2023-06-17 16:37:24,872 INFO [train.py:1064] (3/4) Training started 2023-06-17 16:37:24,873 INFO [train.py:1074] (3/4) Device: cuda:3 2023-06-17 16:37:27,143 INFO [lexicon.py:168] (3/4) Loading pre-compiled data/lang_char/Linv.pt 2023-06-17 16:37:27,407 INFO [train.py:1085] (3/4) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.1', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'c51a0b9684442a88ee37f3ce0af686a04b66855b', 'k2-git-date': 'Mon May 1 21:38:03 2023', 'lhotse-version': '1.14.0.dev+git.0f812851.dirty', 'torch-version': '1.10.0+cu102', 'torch-cuda-available': True, 'torch-cuda-version': '10.2', 'python-version': '3.8', 'icefall-git-branch': 'zipformer_wenetspeech', 'icefall-git-sha1': '802bf98-dirty', 'icefall-git-date': 'Fri Jun 16 18:26:55 2023', 'icefall-path': '/star-kw/kangwei/code/icefall_wenetspeech', 'k2-path': '/ceph-hw/kangwei/code/k2_release/k2/k2/python/k2/__init__.py', 'lhotse-path': '/ceph-hw/kangwei/dev_tools/anaconda3/envs/rnnt2/lib/python3.8/site-packages/lhotse-1.14.0.dev0+git.0f812851.dirty-py3.8.egg/lhotse/__init__.py', 'hostname': 'de-74279-k2-train-7-1218101249-5d97868c7c-v8ngc', 'IP address': '10.177.77.18'}, 'world_size': 4, 'master_port': 12536, 'tensorboard': True, 'num_epochs': 12, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp_L_small'), 'lang_dir': PosixPath('data/lang_char'), 'base_lr': 0.045, 'lr_batches': 7500, 'lr_epochs': 1.5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,2,2,2,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,768,768,768,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,256,256,256,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,192,192,192,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 900, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 8, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'training_subset': 'L', 'blank_id': 0, 'vocab_size': 5537} 2023-06-17 16:37:27,407 INFO [train.py:1087] (3/4) About to create model 2023-06-17 16:37:27,971 INFO [train.py:1091] (3/4) Number of model parameters: 32327030 2023-06-17 16:37:34,605 INFO [train.py:1106] (3/4) Using DDP 2023-06-17 16:37:34,875 INFO [asr_datamodule.py:390] (3/4) About to get train cuts 2023-06-17 16:37:34,899 INFO [asr_datamodule.py:398] (3/4) About to get dev cuts 2023-06-17 16:37:34,902 INFO [asr_datamodule.py:211] (3/4) About to get Musan cuts 2023-06-17 16:37:37,663 INFO [asr_datamodule.py:216] (3/4) Enable MUSAN 2023-06-17 16:37:37,663 INFO [asr_datamodule.py:239] (3/4) Enable SpecAugment 2023-06-17 16:37:37,664 INFO [asr_datamodule.py:240] (3/4) Time warp factor: 80 2023-06-17 16:37:37,665 INFO [asr_datamodule.py:250] (3/4) Num frame mask: 10 2023-06-17 16:37:37,666 INFO [asr_datamodule.py:263] (3/4) About to create train dataset 2023-06-17 16:37:37,667 INFO [asr_datamodule.py:289] (3/4) Using DynamicBucketingSampler. 2023-06-17 16:37:41,998 INFO [asr_datamodule.py:305] (3/4) About to create train dataloader 2023-06-17 16:37:41,999 INFO [asr_datamodule.py:336] (3/4) About to create dev dataset 2023-06-17 16:37:42,664 INFO [asr_datamodule.py:354] (3/4) About to create dev dataloader 2023-06-17 16:39:50,179 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.16 vs. limit=5.0 2023-06-17 16:39:59,891 INFO [train.py:996] (3/4) Epoch 1, batch 0, loss[loss=10.45, simple_loss=9.494, pruned_loss=9.521, over 21848.00 frames. ], tot_loss[loss=10.45, simple_loss=9.494, pruned_loss=9.521, over 21848.00 frames. ], batch size: 98, lr: 2.25e-02, grad_scale: 1.0 2023-06-17 16:39:59,892 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-17 16:40:52,888 INFO [train.py:1028] (3/4) Epoch 1, validation: loss=10.49, simple_loss=9.517, pruned_loss=9.679, over 1796401.00 frames. 2023-06-17 16:40:52,890 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 22433MB 2023-06-17 16:41:04,789 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=0.0, ans=0.9 2023-06-17 16:41:05,350 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.53 vs. limit=7.5 2023-06-17 16:41:12,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=60.0, ans=0.8979 2023-06-17 16:41:15,826 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=60.0, ans=0.09865 2023-06-17 16:42:31,418 INFO [train.py:996] (3/4) Epoch 1, batch 50, loss[loss=1.42, simple_loss=1.275, pruned_loss=1.318, over 21479.00 frames. ], tot_loss[loss=4.159, simple_loss=3.849, pruned_loss=3.062, over 961062.51 frames. ], batch size: 211, lr: 2.48e-02, grad_scale: 0.5 2023-06-17 16:43:01,256 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=29.51 vs. limit=4.12 2023-06-17 16:43:12,601 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=194.13 vs. limit=7.77 2023-06-17 16:43:15,734 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=48.94 vs. limit=4.144 2023-06-17 16:43:18,257 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=360.0, ans=0.8874000000000001 2023-06-17 16:43:24,855 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=234.35 vs. limit=7.6575 2023-06-17 16:43:29,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=420.0, ans=0.4803125 2023-06-17 16:43:46,628 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=48.53 vs. limit=7.815 2023-06-17 16:43:49,781 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=4.168 2023-06-17 16:43:51,665 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=242.11 vs. limit=7.68 2023-06-17 16:43:54,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=480.0, ans=0.4775 2023-06-17 16:44:04,356 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=85.00 vs. limit=7.68 2023-06-17 16:44:36,652 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=540.0, ans=0.17975000000000002 2023-06-17 16:44:37,296 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.23 vs. limit=4.216 2023-06-17 16:44:39,812 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=540.0, ans=0.4746875 2023-06-17 16:44:40,371 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.91 vs. limit=7.7025 2023-06-17 16:44:45,951 INFO [train.py:996] (3/4) Epoch 1, batch 100, loss[loss=1.395, simple_loss=1.212, pruned_loss=1.465, over 21759.00 frames. ], tot_loss[loss=2.614, simple_loss=2.386, pruned_loss=2.121, over 1693739.24 frames. ], batch size: 351, lr: 2.70e-02, grad_scale: 1.0 2023-06-17 16:44:49,125 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.650e+02 2.605e+02 7.361e+02 5.108e+03 2.907e+04, threshold=1.472e+03, percent-clipped=0.0 2023-06-17 16:44:57,257 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=600.0, ans=5.375 2023-06-17 16:45:00,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=660.0, ans=0.2934 2023-06-17 16:45:26,050 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=106.21 vs. limit=7.7475 2023-06-17 16:45:41,189 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=76.31 vs. limit=7.77 2023-06-17 16:45:42,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=720.0, ans=0.46625 2023-06-17 16:45:46,908 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=16.97 vs. limit=4.288 2023-06-17 16:46:21,011 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=77.85 vs. limit=7.7925 2023-06-17 16:46:45,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=840.0, ans=0.460625 2023-06-17 16:46:50,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=840.0, ans=0.5 2023-06-17 16:46:53,214 INFO [train.py:996] (3/4) Epoch 1, batch 150, loss[loss=1.189, simple_loss=1.012, pruned_loss=1.277, over 21610.00 frames. ], tot_loss[loss=2.007, simple_loss=1.807, pruned_loss=1.748, over 2270275.71 frames. ], batch size: 230, lr: 2.93e-02, grad_scale: 1.0 2023-06-17 16:46:55,636 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.54 vs. limit=8.175 2023-06-17 16:47:23,645 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=16.69 vs. limit=8.22 2023-06-17 16:47:31,882 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=33.62 vs. limit=7.8825 2023-06-17 16:47:53,866 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=59.44 vs. limit=7.8825 2023-06-17 16:48:21,058 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=1080.0, ans=0.046625 2023-06-17 16:48:21,588 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=16.55 vs. limit=5.54 2023-06-17 16:48:39,819 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.02 vs. limit=4.432 2023-06-17 16:48:46,151 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.89 vs. limit=5.57 2023-06-17 16:48:47,649 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.30 vs. limit=5.285 2023-06-17 16:48:57,501 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=72.46 vs. limit=7.95 2023-06-17 16:48:58,450 INFO [train.py:996] (3/4) Epoch 1, batch 200, loss[loss=0.8853, simple_loss=0.7558, pruned_loss=0.8793, over 15594.00 frames. ], tot_loss[loss=1.684, simple_loss=1.501, pruned_loss=1.514, over 2698189.69 frames. ], batch size: 60, lr: 3.15e-02, grad_scale: 2.0 2023-06-17 16:49:01,468 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.811e+01 1.173e+02 1.419e+02 1.881e+02 2.743e+02, threshold=2.839e+02, percent-clipped=0.0 2023-06-17 16:49:08,810 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=114.39 vs. limit=7.95 2023-06-17 16:49:38,564 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.85 vs. limit=5.63 2023-06-17 16:49:45,025 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=156.47 vs. limit=7.9725 2023-06-17 16:50:38,421 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.21 vs. limit=4.5280000000000005 2023-06-17 16:50:57,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1380.0, ans=0.4353125 2023-06-17 16:51:13,048 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=19.90 vs. limit=8.58 2023-06-17 16:51:18,146 INFO [train.py:996] (3/4) Epoch 1, batch 250, loss[loss=1.007, simple_loss=0.8587, pruned_loss=0.9486, over 21609.00 frames. ], tot_loss[loss=1.478, simple_loss=1.306, pruned_loss=1.343, over 3041837.37 frames. ], batch size: 263, lr: 3.38e-02, grad_scale: 2.0 2023-06-17 16:51:24,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1500.0, ans=0.4296875 2023-06-17 16:51:49,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1560.0, ans=0.426875 2023-06-17 16:51:59,493 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.77 vs. limit=8.67 2023-06-17 16:52:20,791 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1620.0, ans=0.4240625 2023-06-17 16:52:38,282 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=42.26 vs. limit=8.1075 2023-06-17 16:52:43,184 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=56.40 vs. limit=8.13 2023-06-17 16:53:03,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1680.0, ans=0.42125 2023-06-17 16:53:21,723 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1740.0, ans=0.13474999999999998 2023-06-17 16:53:22,294 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.90 vs. limit=8.1525 2023-06-17 16:53:28,112 INFO [train.py:996] (3/4) Epoch 1, batch 300, loss[loss=0.806, simple_loss=0.6833, pruned_loss=0.7394, over 21723.00 frames. ], tot_loss[loss=1.329, simple_loss=1.166, pruned_loss=1.21, over 3313218.17 frames. ], batch size: 124, lr: 3.60e-02, grad_scale: 4.0 2023-06-17 16:53:31,432 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 7.248e+01 1.102e+02 1.349e+02 1.694e+02 3.595e+02, threshold=2.697e+02, percent-clipped=2.0 2023-06-17 16:54:48,775 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.75 vs. limit=5.96 2023-06-17 16:55:00,407 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=27.09 vs. limit=8.2425 2023-06-17 16:55:12,871 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.99 vs. limit=8.2425 2023-06-17 16:55:13,000 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.66 vs. limit=8.985 2023-06-17 16:55:14,298 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1980.0, ans=0.4071875 2023-06-17 16:55:24,905 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.27 vs. limit=9.03 2023-06-17 16:55:38,320 INFO [train.py:996] (3/4) Epoch 1, batch 350, loss[loss=0.8637, simple_loss=0.7321, pruned_loss=0.7592, over 21812.00 frames. ], tot_loss[loss=1.215, simple_loss=1.059, pruned_loss=1.102, over 3527100.02 frames. ], batch size: 352, lr: 3.83e-02, grad_scale: 4.0 2023-06-17 16:55:42,506 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.78 vs. limit=3.315 2023-06-17 16:55:45,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2100.0, ans=0.4015625 2023-06-17 16:55:52,145 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=144.23 vs. limit=8.2875 2023-06-17 16:55:53,698 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.22 vs. limit=9.120000000000001 2023-06-17 16:55:56,842 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.08 vs. limit=8.31 2023-06-17 16:56:27,061 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.28 vs. limit=9.165 2023-06-17 16:56:27,202 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=24.70 vs. limit=9.165 2023-06-17 16:56:34,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2220.0, ans=0.2778 2023-06-17 16:56:36,257 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.52 vs. limit=8.3325 2023-06-17 16:56:51,801 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=10.78 vs. limit=6.11 2023-06-17 16:57:03,169 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.85 vs. limit=9.21 2023-06-17 16:57:10,227 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=90.97 vs. limit=8.355 2023-06-17 16:57:14,615 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.37 vs. limit=8.355 2023-06-17 16:57:29,918 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=25.29 vs. limit=8.3775 2023-06-17 16:57:43,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2340.0, ans=0.3903125 2023-06-17 16:57:46,853 INFO [train.py:996] (3/4) Epoch 1, batch 400, loss[loss=0.8428, simple_loss=0.7092, pruned_loss=0.7297, over 21428.00 frames. ], tot_loss[loss=1.13, simple_loss=0.9785, pruned_loss=1.017, over 3691888.38 frames. ], batch size: 389, lr: 4.05e-02, grad_scale: 8.0 2023-06-17 16:57:50,150 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 9.350e+01 1.224e+02 1.536e+02 2.025e+02 4.442e+02, threshold=3.072e+02, percent-clipped=8.0 2023-06-17 16:57:52,822 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.14 vs. limit=6.2 2023-06-17 16:57:53,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2400.0, ans=0.8160000000000001 2023-06-17 16:57:58,647 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2400.0, ans=0.3875 2023-06-17 16:58:01,715 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=2460.0, ans=0.0423125 2023-06-17 16:58:59,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2580.0, ans=0.04195 2023-06-17 16:59:12,159 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.17 vs. limit=5.645 2023-06-17 16:59:28,913 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2580.0, ans=0.37906249999999997 2023-06-17 16:59:32,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2640.0, ans=0.8076000000000001 2023-06-17 16:59:50,787 INFO [train.py:996] (3/4) Epoch 1, batch 450, loss[loss=1.2, simple_loss=1.008, pruned_loss=1.011, over 21757.00 frames. ], tot_loss[loss=1.068, simple_loss=0.9194, pruned_loss=0.9518, over 3821739.15 frames. ], batch size: 351, lr: 4.28e-02, grad_scale: 8.0 2023-06-17 17:00:01,534 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=27.81 vs. limit=8.5125 2023-06-17 17:00:03,107 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.73 vs. limit=5.675 2023-06-17 17:00:22,065 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=22.56 vs. limit=8.535 2023-06-17 17:00:37,924 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=2820.0, ans=0.8013 2023-06-17 17:01:17,651 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.14 vs. limit=9.66 2023-06-17 17:01:28,803 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.66 vs. limit=5.176 2023-06-17 17:01:30,406 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.96 vs. limit=8.6025 2023-06-17 17:01:32,247 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=17.12 vs. limit=8.6025 2023-06-17 17:01:40,445 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.98 vs. limit=8.6025 2023-06-17 17:01:45,398 INFO [train.py:996] (3/4) Epoch 1, batch 500, loss[loss=0.7629, simple_loss=0.6424, pruned_loss=0.6175, over 21849.00 frames. ], tot_loss[loss=1.032, simple_loss=0.884, pruned_loss=0.9058, over 3920434.13 frames. ], batch size: 107, lr: 4.49e-02, grad_scale: 8.0 2023-06-17 17:01:48,417 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 9.420e+01 1.754e+02 2.624e+02 3.522e+02 8.349e+02, threshold=5.248e+02, percent-clipped=35.0 2023-06-17 17:02:04,051 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=28.50 vs. limit=8.625 2023-06-17 17:02:53,405 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.74 vs. limit=6.53 2023-06-17 17:03:04,718 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=3120.0, ans=0.7908000000000001 2023-06-17 17:03:04,719 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=3120.0, ans=0.35375 2023-06-17 17:03:45,967 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.86 vs. limit=8.715 2023-06-17 17:03:58,032 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.68 vs. limit=8.715 2023-06-17 17:04:00,014 INFO [train.py:996] (3/4) Epoch 1, batch 550, loss[loss=0.8554, simple_loss=0.7229, pruned_loss=0.667, over 21887.00 frames. ], tot_loss[loss=1.002, simple_loss=0.8552, pruned_loss=0.8599, over 4006739.15 frames. ], batch size: 332, lr: 4.49e-02, grad_scale: 8.0 2023-06-17 17:04:30,283 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.04 vs. limit=9.975 2023-06-17 17:04:31,770 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.12 vs. limit=6.65 2023-06-17 17:04:37,087 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=3360.0, ans=0.3425 2023-06-17 17:04:47,386 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.72 vs. limit=10.02 2023-06-17 17:05:35,049 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.46 vs. limit=10.11 2023-06-17 17:05:44,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=3540.0, ans=0.06724999999999998 2023-06-17 17:05:59,260 INFO [train.py:996] (3/4) Epoch 1, batch 600, loss[loss=1.062, simple_loss=0.9109, pruned_loss=0.7782, over 21215.00 frames. ], tot_loss[loss=0.9646, simple_loss=0.8237, pruned_loss=0.8065, over 4068106.12 frames. ], batch size: 548, lr: 4.49e-02, grad_scale: 8.0 2023-06-17 17:06:03,305 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.326e+02 3.057e+02 4.199e+02 5.888e+02 1.512e+03, threshold=8.399e+02, percent-clipped=32.0 2023-06-17 17:06:03,789 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=3600.0, ans=0.33125 2023-06-17 17:06:05,758 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=8.85 2023-06-17 17:07:03,730 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.56 vs. limit=3.558 2023-06-17 17:07:06,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=3720.0, ans=0.7872 2023-06-17 17:07:06,731 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.83 vs. limit=5.93 2023-06-17 17:07:32,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=3780.0, ans=0.25670000000000004 2023-06-17 17:07:56,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=3840.0, ans=0.013600000000000001 2023-06-17 17:08:07,410 INFO [train.py:996] (3/4) Epoch 1, batch 650, loss[loss=0.7375, simple_loss=0.6319, pruned_loss=0.5304, over 21704.00 frames. ], tot_loss[loss=0.9249, simple_loss=0.7912, pruned_loss=0.7509, over 4121635.07 frames. ], batch size: 298, lr: 4.49e-02, grad_scale: 8.0 2023-06-17 17:08:09,966 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.08 vs. limit=10.425 2023-06-17 17:08:41,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=3960.0, ans=0.31437499999999996 2023-06-17 17:09:19,306 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.39 vs. limit=10.515 2023-06-17 17:09:22,058 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=4020.0, ans=0.31156249999999996 2023-06-17 17:09:50,947 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=4080.0, ans=0.2592 2023-06-17 17:09:52,789 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=4080.0, ans=0.009982608695652173 2023-06-17 17:09:56,605 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.15 vs. limit=9.03 2023-06-17 17:10:02,789 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=4140.0, ans=10.605 2023-06-17 17:10:13,700 INFO [train.py:996] (3/4) Epoch 1, batch 700, loss[loss=0.7301, simple_loss=0.6243, pruned_loss=0.5181, over 21371.00 frames. ], tot_loss[loss=0.8822, simple_loss=0.7565, pruned_loss=0.6963, over 4162392.43 frames. ], batch size: 471, lr: 4.49e-02, grad_scale: 8.0 2023-06-17 17:10:15,627 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=4200.0, ans=0.753 2023-06-17 17:10:16,592 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 4.036e+02 7.786e+02 1.089e+03 2.394e+03, threshold=1.557e+03, percent-clipped=44.0 2023-06-17 17:10:23,549 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=4200.0, ans=7.625 2023-06-17 17:10:40,520 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.02 vs. limit=10.65 2023-06-17 17:11:49,457 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.97 vs. limit=6.095 2023-06-17 17:12:22,075 INFO [train.py:996] (3/4) Epoch 1, batch 750, loss[loss=0.6261, simple_loss=0.5425, pruned_loss=0.4235, over 15325.00 frames. ], tot_loss[loss=0.8445, simple_loss=0.7265, pruned_loss=0.6476, over 4180662.06 frames. ], batch size: 63, lr: 4.49e-02, grad_scale: 8.0 2023-06-17 17:12:41,298 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=4500.0, ans=0.2890625 2023-06-17 17:13:43,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=4620.0, ans=0.2038 2023-06-17 17:14:15,857 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=4740.0, ans=0.2711 2023-06-17 17:14:21,419 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=4740.0, ans=0.7341 2023-06-17 17:14:30,910 INFO [train.py:996] (3/4) Epoch 1, batch 800, loss[loss=0.5891, simple_loss=0.5254, pruned_loss=0.3668, over 21755.00 frames. ], tot_loss[loss=0.803, simple_loss=0.6935, pruned_loss=0.5982, over 4200026.64 frames. ], batch size: 112, lr: 4.49e-02, grad_scale: 16.0 2023-06-17 17:14:31,179 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=4800.0, ans=0.252 2023-06-17 17:14:33,928 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.077e+02 4.647e+02 7.147e+02 1.104e+03 3.003e+03, threshold=1.429e+03, percent-clipped=10.0 2023-06-17 17:14:37,715 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=4800.0, ans=0.275 2023-06-17 17:16:06,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=4980.0, ans=0.26656250000000004 2023-06-17 17:16:15,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=5040.0, ans=0.0 2023-06-17 17:16:21,098 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.67 vs. limit=3.7560000000000002 2023-06-17 17:16:25,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=5040.0, ans=0.04566666666666667 2023-06-17 17:16:25,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=5040.0, ans=0.04566666666666667 2023-06-17 17:16:26,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=5040.0, ans=0.26375000000000004 2023-06-17 17:16:32,237 INFO [train.py:996] (3/4) Epoch 1, batch 850, loss[loss=0.6645, simple_loss=0.5706, pruned_loss=0.4459, over 21647.00 frames. ], tot_loss[loss=0.7633, simple_loss=0.6617, pruned_loss=0.5536, over 4219690.15 frames. ], batch size: 508, lr: 4.49e-02, grad_scale: 4.0 2023-06-17 17:16:39,344 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.03 vs. limit=11.325 2023-06-17 17:16:43,515 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=5100.0, ans=0.04541666666666667 2023-06-17 17:17:02,564 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.88 vs. limit=5.032 2023-06-17 17:18:04,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=5280.0, ans=0.2525 2023-06-17 17:18:49,064 INFO [train.py:996] (3/4) Epoch 1, batch 900, loss[loss=0.5864, simple_loss=0.5221, pruned_loss=0.36, over 21752.00 frames. ], tot_loss[loss=0.7297, simple_loss=0.6357, pruned_loss=0.5142, over 4238365.75 frames. ], batch size: 247, lr: 4.48e-02, grad_scale: 8.0 2023-06-17 17:18:55,065 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.046e+02 5.086e+02 7.748e+02 1.151e+03 3.891e+03, threshold=1.550e+03, percent-clipped=18.0 2023-06-17 17:19:03,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=5460.0, ans=0.24406250000000002 2023-06-17 17:19:11,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=5460.0, ans=0.24406250000000002 2023-06-17 17:19:30,509 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=5520.0, ans=0.009669565217391304 2023-06-17 17:20:05,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=5580.0, ans=0.24419999999999997 2023-06-17 17:20:11,509 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.81 vs. limit=11.684999999999999 2023-06-17 17:20:56,019 INFO [train.py:996] (3/4) Epoch 1, batch 950, loss[loss=0.5771, simple_loss=0.5168, pruned_loss=0.347, over 21290.00 frames. ], tot_loss[loss=0.6984, simple_loss=0.6116, pruned_loss=0.4788, over 4256711.95 frames. ], batch size: 176, lr: 4.48e-02, grad_scale: 4.0 2023-06-17 17:21:39,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=5760.0, ans=0.22999999999999998 2023-06-17 17:21:58,162 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=5760.0, ans=0.04949747468305833 2023-06-17 17:22:03,573 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.51 vs. limit=9.682500000000001 2023-06-17 17:22:52,033 INFO [train.py:996] (3/4) Epoch 1, batch 1000, loss[loss=0.5279, simple_loss=0.4737, pruned_loss=0.3137, over 21464.00 frames. ], tot_loss[loss=0.6723, simple_loss=0.5914, pruned_loss=0.4495, over 4270305.19 frames. ], batch size: 212, lr: 4.48e-02, grad_scale: 8.0 2023-06-17 17:22:53,927 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=6000.0, ans=8.75 2023-06-17 17:23:13,984 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.329e+02 4.662e+02 7.435e+02 1.273e+03 3.855e+03, threshold=1.487e+03, percent-clipped=17.0 2023-06-17 17:23:47,018 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.96 vs. limit=9.7725 2023-06-17 17:24:32,005 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=6180.0, ans=0.2103125 2023-06-17 17:24:59,587 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=6240.0, ans=0.0 2023-06-17 17:25:17,337 INFO [train.py:996] (3/4) Epoch 1, batch 1050, loss[loss=0.5337, simple_loss=0.4876, pruned_loss=0.304, over 21265.00 frames. ], tot_loss[loss=0.6504, simple_loss=0.5749, pruned_loss=0.4245, over 4274115.67 frames. ], batch size: 159, lr: 4.48e-02, grad_scale: 8.0 2023-06-17 17:25:17,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=6300.0, ans=0.6795 2023-06-17 17:26:03,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=6360.0, ans=0.20187500000000003 2023-06-17 17:26:07,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=6360.0, ans=0.04016666666666667 2023-06-17 17:26:57,555 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 17:27:45,202 INFO [train.py:996] (3/4) Epoch 1, batch 1100, loss[loss=0.5143, simple_loss=0.4752, pruned_loss=0.2854, over 21871.00 frames. ], tot_loss[loss=0.6349, simple_loss=0.5646, pruned_loss=0.4041, over 4286646.14 frames. ], batch size: 118, lr: 4.48e-02, grad_scale: 8.0 2023-06-17 17:27:47,343 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=6600.0, ans=0.669 2023-06-17 17:28:00,779 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.917e+02 3.328e+02 5.726e+02 1.108e+03 4.215e+03, threshold=1.145e+03, percent-clipped=17.0 2023-06-17 17:28:16,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=6660.0, ans=0.1878125 2023-06-17 17:28:37,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=6720.0, ans=0.04949747468305833 2023-06-17 17:29:44,683 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=6840.0, ans=0.028625 2023-06-17 17:30:10,958 INFO [train.py:996] (3/4) Epoch 1, batch 1150, loss[loss=0.5306, simple_loss=0.504, pruned_loss=0.2778, over 21569.00 frames. ], tot_loss[loss=0.6197, simple_loss=0.5538, pruned_loss=0.386, over 4288717.40 frames. ], batch size: 230, lr: 4.47e-02, grad_scale: 4.0 2023-06-17 17:30:25,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=6960.0, ans=0.17375000000000002 2023-06-17 17:30:44,421 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=6960.0, ans=0.17375000000000002 2023-06-17 17:30:45,946 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=7020.0, ans=9.3875 2023-06-17 17:32:13,928 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.57 vs. limit=4.071 2023-06-17 17:32:20,710 INFO [train.py:996] (3/4) Epoch 1, batch 1200, loss[loss=0.4794, simple_loss=0.4276, pruned_loss=0.2813, over 20269.00 frames. ], tot_loss[loss=0.6086, simple_loss=0.5466, pruned_loss=0.3713, over 4287082.28 frames. ], batch size: 703, lr: 4.47e-02, grad_scale: 8.0 2023-06-17 17:32:21,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=7200.0, ans=0.22799999999999998 2023-06-17 17:32:26,570 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.15 vs. limit=8.6 2023-06-17 17:32:36,895 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.515e+02 4.923e+02 7.154e+02 1.207e+03 2.545e+03, threshold=1.431e+03, percent-clipped=26.0 2023-06-17 17:32:37,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=7200.0, ans=0.16249999999999998 2023-06-17 17:34:18,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=7440.0, ans=0.035666666666666666 2023-06-17 17:34:27,346 INFO [train.py:996] (3/4) Epoch 1, batch 1250, loss[loss=0.6989, simple_loss=0.6218, pruned_loss=0.4096, over 21532.00 frames. ], tot_loss[loss=0.6029, simple_loss=0.5433, pruned_loss=0.3619, over 4292748.11 frames. ], batch size: 509, lr: 4.47e-02, grad_scale: 8.0 2023-06-17 17:35:03,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=7560.0, ans=0.6354 2023-06-17 17:36:45,489 INFO [train.py:996] (3/4) Epoch 1, batch 1300, loss[loss=0.4553, simple_loss=0.432, pruned_loss=0.2389, over 21405.00 frames. ], tot_loss[loss=0.5887, simple_loss=0.5338, pruned_loss=0.3467, over 4287857.57 frames. ], batch size: 131, lr: 4.47e-02, grad_scale: 8.0 2023-06-17 17:36:46,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=7800.0, ans=0.222 2023-06-17 17:37:02,062 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.523e+02 4.002e+02 7.251e+02 1.294e+03 4.242e+03, threshold=1.450e+03, percent-clipped=21.0 2023-06-17 17:37:02,598 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=7800.0, ans=0.034166666666666665 2023-06-17 17:38:54,694 INFO [train.py:996] (3/4) Epoch 1, batch 1350, loss[loss=0.5125, simple_loss=0.4849, pruned_loss=0.2703, over 21229.00 frames. ], tot_loss[loss=0.5829, simple_loss=0.5306, pruned_loss=0.3384, over 4293083.69 frames. ], batch size: 176, lr: 4.46e-02, grad_scale: 8.0 2023-06-17 17:39:45,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=8220.0, ans=0.07 2023-06-17 17:40:59,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=8340.0, ans=0.21660000000000001 2023-06-17 17:41:08,247 INFO [train.py:996] (3/4) Epoch 1, batch 1400, loss[loss=0.6016, simple_loss=0.5544, pruned_loss=0.3305, over 21705.00 frames. ], tot_loss[loss=0.5689, simple_loss=0.5201, pruned_loss=0.3259, over 4286308.27 frames. ], batch size: 441, lr: 4.46e-02, grad_scale: 8.0 2023-06-17 17:41:24,124 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.077e+02 4.733e+02 7.957e+02 1.163e+03 2.485e+03, threshold=1.591e+03, percent-clipped=13.0 2023-06-17 17:41:26,818 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.10 vs. limit=4.26 2023-06-17 17:41:58,167 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=8460.0, ans=0.125 2023-06-17 17:42:09,045 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=8520.0, ans=0.2148 2023-06-17 17:42:09,645 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.85 vs. limit=10.695 2023-06-17 17:42:10,505 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=8520.0, ans=0.125 2023-06-17 17:42:56,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=8580.0, ans=0.0 2023-06-17 17:43:12,335 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=8640.0, ans=0.008991304347826088 2023-06-17 17:43:23,578 INFO [train.py:996] (3/4) Epoch 1, batch 1450, loss[loss=0.553, simple_loss=0.5187, pruned_loss=0.2955, over 21246.00 frames. ], tot_loss[loss=0.5639, simple_loss=0.5167, pruned_loss=0.3198, over 4293180.22 frames. ], batch size: 143, lr: 4.46e-02, grad_scale: 8.0 2023-06-17 17:43:47,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=8700.0, ans=0.213 2023-06-17 17:43:51,064 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.99 vs. limit=10.785 2023-06-17 17:44:07,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=8760.0, ans=0.125 2023-06-17 17:44:39,294 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.11 vs. limit=9.41 2023-06-17 17:45:34,295 INFO [train.py:996] (3/4) Epoch 1, batch 1500, loss[loss=0.5178, simple_loss=0.4835, pruned_loss=0.2782, over 21480.00 frames. ], tot_loss[loss=0.5567, simple_loss=0.5125, pruned_loss=0.312, over 4294663.93 frames. ], batch size: 131, lr: 4.46e-02, grad_scale: 8.0 2023-06-17 17:46:08,689 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.531e+02 4.868e+02 8.441e+02 1.240e+03 3.321e+03, threshold=1.688e+03, percent-clipped=12.0 2023-06-17 17:46:26,844 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.89 vs. limit=10.8975 2023-06-17 17:46:28,528 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.11 vs. limit=10.92 2023-06-17 17:46:29,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=9120.0, ans=0.02866666666666667 2023-06-17 17:47:29,728 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.10 vs. limit=9.59 2023-06-17 17:47:30,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=9180.0, ans=0.5787 2023-06-17 17:47:57,232 INFO [train.py:996] (3/4) Epoch 1, batch 1550, loss[loss=0.5425, simple_loss=0.5099, pruned_loss=0.2886, over 21541.00 frames. ], tot_loss[loss=0.5438, simple_loss=0.5044, pruned_loss=0.3003, over 4294372.59 frames. ], batch size: 414, lr: 4.45e-02, grad_scale: 8.0 2023-06-17 17:47:57,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=9300.0, ans=0.5745 2023-06-17 17:47:59,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=9300.0, ans=0.125 2023-06-17 17:48:10,445 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=9300.0, ans=0.5745 2023-06-17 17:48:51,732 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.40 vs. limit=14.52 2023-06-17 17:48:55,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=9360.0, ans=0.008834782608695652 2023-06-17 17:48:58,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=9420.0, ans=0.008821739130434783 2023-06-17 17:49:22,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=9420.0, ans=0.025 2023-06-17 17:49:29,366 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.25 vs. limit=11.0325 2023-06-17 17:49:30,943 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.83 vs. limit=11.055 2023-06-17 17:49:39,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=9480.0, ans=0.125 2023-06-17 17:49:39,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=9480.0, ans=0.04949747468305833 2023-06-17 17:50:24,115 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=9540.0, ans=0.5661 2023-06-17 17:50:25,580 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=9600.0, ans=0.20400000000000001 2023-06-17 17:50:26,908 INFO [train.py:996] (3/4) Epoch 1, batch 1600, loss[loss=0.4097, simple_loss=0.4064, pruned_loss=0.2023, over 21266.00 frames. ], tot_loss[loss=0.5336, simple_loss=0.4981, pruned_loss=0.2911, over 4288703.39 frames. ], batch size: 176, lr: 4.45e-02, grad_scale: 16.0 2023-06-17 17:50:36,896 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.140e+02 3.850e+02 5.686e+02 1.025e+03 3.086e+03, threshold=1.137e+03, percent-clipped=9.0 2023-06-17 17:51:12,087 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.69 vs. limit=14.745000000000001 2023-06-17 17:51:21,764 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=9660.0, ans=0.125 2023-06-17 17:51:44,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=9720.0, ans=0.5598000000000001 2023-06-17 17:52:43,630 INFO [train.py:996] (3/4) Epoch 1, batch 1650, loss[loss=0.5698, simple_loss=0.5362, pruned_loss=0.3023, over 21698.00 frames. ], tot_loss[loss=0.5273, simple_loss=0.4951, pruned_loss=0.2846, over 4282994.27 frames. ], batch size: 414, lr: 4.45e-02, grad_scale: 16.0 2023-06-17 17:52:58,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=9900.0, ans=0.34850000000000003 2023-06-17 17:53:16,229 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=9960.0, ans=0.2004 2023-06-17 17:53:27,248 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.84 vs. limit=11.235 2023-06-17 17:54:30,433 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.66 vs. limit=7.52 2023-06-17 17:54:31,389 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=10080.0, ans=0.5472 2023-06-17 17:55:02,456 INFO [train.py:996] (3/4) Epoch 1, batch 1700, loss[loss=0.4857, simple_loss=0.4691, pruned_loss=0.2496, over 21063.00 frames. ], tot_loss[loss=0.5259, simple_loss=0.4961, pruned_loss=0.2814, over 4282614.43 frames. ], batch size: 608, lr: 4.44e-02, grad_scale: 8.0 2023-06-17 17:55:31,449 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.673e+02 4.663e+02 7.889e+02 1.170e+03 3.370e+03, threshold=1.578e+03, percent-clipped=25.0 2023-06-17 17:56:38,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=10380.0, ans=0.05 2023-06-17 17:57:01,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=10380.0, ans=0.19619999999999999 2023-06-17 17:57:13,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=10440.0, ans=0.0 2023-06-17 17:57:23,251 INFO [train.py:996] (3/4) Epoch 1, batch 1750, loss[loss=0.5113, simple_loss=0.4973, pruned_loss=0.261, over 21458.00 frames. ], tot_loss[loss=0.5161, simple_loss=0.4917, pruned_loss=0.2723, over 4275231.89 frames. ], batch size: 548, lr: 4.44e-02, grad_scale: 8.0 2023-06-17 18:00:11,319 INFO [train.py:996] (3/4) Epoch 1, batch 1800, loss[loss=0.4406, simple_loss=0.427, pruned_loss=0.2261, over 21541.00 frames. ], tot_loss[loss=0.5009, simple_loss=0.4809, pruned_loss=0.2616, over 4278252.40 frames. ], batch size: 263, lr: 4.44e-02, grad_scale: 8.0 2023-06-17 18:00:28,807 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.503e+02 4.588e+02 7.695e+02 1.107e+03 4.356e+03, threshold=1.539e+03, percent-clipped=16.0 2023-06-17 18:01:00,226 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.14 vs. limit=15.645 2023-06-17 18:01:08,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=10860.0, ans=0.02141666666666667 2023-06-17 18:01:08,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=10860.0, ans=0.125 2023-06-17 18:01:24,547 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=10920.0, ans=11.594999999999999 2023-06-17 18:01:32,519 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.37 vs. limit=11.594999999999999 2023-06-17 18:01:52,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=10980.0, ans=0.19019999999999998 2023-06-17 18:01:55,888 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=11040.0, ans=0.1896 2023-06-17 18:02:36,584 INFO [train.py:996] (3/4) Epoch 1, batch 1850, loss[loss=0.5312, simple_loss=0.499, pruned_loss=0.282, over 21662.00 frames. ], tot_loss[loss=0.4913, simple_loss=0.4766, pruned_loss=0.2534, over 4271918.25 frames. ], batch size: 263, lr: 4.43e-02, grad_scale: 8.0 2023-06-17 18:04:08,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=11280.0, ans=0.125 2023-06-17 18:04:38,820 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=11340.0, ans=0.0 2023-06-17 18:04:53,652 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 18:04:54,568 INFO [train.py:996] (3/4) Epoch 1, batch 1900, loss[loss=0.5283, simple_loss=0.4914, pruned_loss=0.2831, over 21730.00 frames. ], tot_loss[loss=0.4841, simple_loss=0.4715, pruned_loss=0.2484, over 4274568.75 frames. ], batch size: 473, lr: 4.43e-02, grad_scale: 8.0 2023-06-17 18:05:08,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=11400.0, ans=0.01916666666666667 2023-06-17 18:05:12,012 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.200e+02 4.389e+02 6.592e+02 1.010e+03 2.305e+03, threshold=1.318e+03, percent-clipped=4.0 2023-06-17 18:05:16,763 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=11460.0, ans=0.49890000000000007 2023-06-17 18:05:18,395 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=11460.0, ans=0.01891666666666667 2023-06-17 18:05:42,620 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=11520.0, ans=0.01866666666666667 2023-06-17 18:07:06,307 INFO [train.py:996] (3/4) Epoch 1, batch 1950, loss[loss=0.5951, simple_loss=0.5555, pruned_loss=0.3175, over 21765.00 frames. ], tot_loss[loss=0.4767, simple_loss=0.4646, pruned_loss=0.2443, over 4276743.65 frames. ], batch size: 441, lr: 4.43e-02, grad_scale: 8.0 2023-06-17 18:07:34,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=11760.0, ans=0.00831304347826087 2023-06-17 18:07:41,116 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.18 vs. limit=4.764 2023-06-17 18:07:48,440 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.98 vs. limit=11.932500000000001 2023-06-17 18:08:47,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=11940.0, ans=0.125 2023-06-17 18:09:07,365 INFO [train.py:996] (3/4) Epoch 1, batch 2000, loss[loss=0.3773, simple_loss=0.4008, pruned_loss=0.177, over 21799.00 frames. ], tot_loss[loss=0.4672, simple_loss=0.4582, pruned_loss=0.2379, over 4283342.23 frames. ], batch size: 282, lr: 4.42e-02, grad_scale: 16.0 2023-06-17 18:09:09,262 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=12000.0, ans=0.0 2023-06-17 18:09:37,820 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.517e+02 4.885e+02 7.905e+02 1.281e+03 2.485e+03, threshold=1.581e+03, percent-clipped=23.0 2023-06-17 18:10:09,619 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=12120.0, ans=0.4758 2023-06-17 18:10:20,082 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=12120.0, ans=0.125 2023-06-17 18:10:24,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=12120.0, ans=0.01616666666666667 2023-06-17 18:10:57,707 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=12180.0, ans=0.125 2023-06-17 18:11:10,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=12240.0, ans=0.125 2023-06-17 18:11:39,759 INFO [train.py:996] (3/4) Epoch 1, batch 2050, loss[loss=0.4217, simple_loss=0.4472, pruned_loss=0.1981, over 21605.00 frames. ], tot_loss[loss=0.4667, simple_loss=0.4604, pruned_loss=0.2364, over 4290937.23 frames. ], batch size: 263, lr: 4.42e-02, grad_scale: 8.0 2023-06-17 18:12:48,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=12420.0, ans=0.125 2023-06-17 18:13:28,528 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.28 vs. limit=4.881 2023-06-17 18:13:35,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=12540.0, ans=0.17459999999999998 2023-06-17 18:13:44,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=12540.0, ans=0.17459999999999998 2023-06-17 18:13:46,881 INFO [train.py:996] (3/4) Epoch 1, batch 2100, loss[loss=0.4631, simple_loss=0.4598, pruned_loss=0.2332, over 21758.00 frames. ], tot_loss[loss=0.4723, simple_loss=0.466, pruned_loss=0.2392, over 4279728.03 frames. ], batch size: 124, lr: 4.42e-02, grad_scale: 8.0 2023-06-17 18:13:51,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=12600.0, ans=0.125 2023-06-17 18:13:56,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=12600.0, ans=0.125 2023-06-17 18:13:59,747 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.516e+02 5.167e+02 7.622e+02 1.111e+03 2.066e+03, threshold=1.524e+03, percent-clipped=6.0 2023-06-17 18:14:58,398 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=12720.0, ans=0.125 2023-06-17 18:15:53,936 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=12840.0, ans=0.125 2023-06-17 18:16:06,965 INFO [train.py:996] (3/4) Epoch 1, batch 2150, loss[loss=0.4002, simple_loss=0.3992, pruned_loss=0.2006, over 21630.00 frames. ], tot_loss[loss=0.4693, simple_loss=0.4623, pruned_loss=0.2381, over 4285508.73 frames. ], batch size: 247, lr: 4.41e-02, grad_scale: 8.0 2023-06-17 18:16:30,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=12960.0, ans=0.125 2023-06-17 18:16:57,526 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.08 vs. limit=12.3825 2023-06-17 18:17:31,413 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.49 vs. limit=12.405000000000001 2023-06-17 18:18:29,032 INFO [train.py:996] (3/4) Epoch 1, batch 2200, loss[loss=0.3291, simple_loss=0.3684, pruned_loss=0.1449, over 21245.00 frames. ], tot_loss[loss=0.4656, simple_loss=0.4635, pruned_loss=0.2338, over 4286303.98 frames. ], batch size: 159, lr: 4.41e-02, grad_scale: 8.0 2023-06-17 18:18:44,392 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=13200.0, ans=10.0 2023-06-17 18:18:48,045 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.412e+02 4.529e+02 5.924e+02 1.033e+03 2.265e+03, threshold=1.185e+03, percent-clipped=8.0 2023-06-17 18:18:54,410 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=13260.0, ans=0.07 2023-06-17 18:19:00,137 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 18:20:35,196 INFO [train.py:996] (3/4) Epoch 1, batch 2250, loss[loss=0.3586, simple_loss=0.3711, pruned_loss=0.1731, over 21841.00 frames. ], tot_loss[loss=0.4513, simple_loss=0.4536, pruned_loss=0.2244, over 4288714.99 frames. ], batch size: 98, lr: 4.40e-02, grad_scale: 8.0 2023-06-17 18:21:03,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=13500.0, ans=0.125 2023-06-17 18:21:12,971 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=13560.0, ans=0.02 2023-06-17 18:21:22,514 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.75 vs. limit=9.448 2023-06-17 18:21:39,440 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=13620.0, ans=0.4233 2023-06-17 18:21:55,437 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 18:22:27,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=13740.0, ans=0.125 2023-06-17 18:22:35,757 INFO [train.py:996] (3/4) Epoch 1, batch 2300, loss[loss=0.3997, simple_loss=0.405, pruned_loss=0.1971, over 21655.00 frames. ], tot_loss[loss=0.4426, simple_loss=0.4446, pruned_loss=0.2202, over 4276781.08 frames. ], batch size: 282, lr: 4.40e-02, grad_scale: 8.0 2023-06-17 18:22:49,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=13800.0, ans=0.00916666666666667 2023-06-17 18:22:54,880 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.161e+02 4.349e+02 7.156e+02 9.563e+02 2.862e+03, threshold=1.431e+03, percent-clipped=11.0 2023-06-17 18:24:50,687 INFO [train.py:996] (3/4) Epoch 1, batch 2350, loss[loss=0.3713, simple_loss=0.3747, pruned_loss=0.184, over 21485.00 frames. ], tot_loss[loss=0.4417, simple_loss=0.4428, pruned_loss=0.2203, over 4284931.45 frames. ], batch size: 195, lr: 4.40e-02, grad_scale: 8.0 2023-06-17 18:24:59,378 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.22 vs. limit=12.7875 2023-06-17 18:25:12,603 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.06 vs. limit=8.54 2023-06-17 18:25:40,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=14220.0, ans=0.125 2023-06-17 18:26:24,167 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=14280.0, ans=0.007765217391304348 2023-06-17 18:26:24,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=14280.0, ans=0.1572 2023-06-17 18:26:46,730 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.36 vs. limit=18.255000000000003 2023-06-17 18:27:03,567 INFO [train.py:996] (3/4) Epoch 1, batch 2400, loss[loss=0.5281, simple_loss=0.5166, pruned_loss=0.2698, over 21513.00 frames. ], tot_loss[loss=0.4513, simple_loss=0.4513, pruned_loss=0.2257, over 4278902.07 frames. ], batch size: 414, lr: 4.39e-02, grad_scale: 16.0 2023-06-17 18:27:24,515 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.741e+02 4.803e+02 6.612e+02 1.169e+03 2.103e+03, threshold=1.322e+03, percent-clipped=15.0 2023-06-17 18:28:02,541 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.36 vs. limit=5.178 2023-06-17 18:28:03,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=14520.0, ans=0.125 2023-06-17 18:28:36,750 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=14580.0, ans=0.38970000000000005 2023-06-17 18:28:56,552 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.15 vs. limit=12.99 2023-06-17 18:29:04,436 INFO [train.py:996] (3/4) Epoch 1, batch 2450, loss[loss=0.4525, simple_loss=0.4541, pruned_loss=0.2254, over 21770.00 frames. ], tot_loss[loss=0.454, simple_loss=0.4552, pruned_loss=0.2263, over 4275603.53 frames. ], batch size: 124, lr: 4.39e-02, grad_scale: 16.0 2023-06-17 18:29:04,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=14700.0, ans=0.38550000000000006 2023-06-17 18:29:14,757 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.33 vs. limit=13.0125 2023-06-17 18:29:15,389 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=14700.0, ans=0.15300000000000002 2023-06-17 18:29:27,822 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=14760.0, ans=0.04949747468305833 2023-06-17 18:30:30,891 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=14940.0, ans=0.125 2023-06-17 18:30:47,043 INFO [train.py:996] (3/4) Epoch 1, batch 2500, loss[loss=0.3823, simple_loss=0.4345, pruned_loss=0.165, over 21336.00 frames. ], tot_loss[loss=0.4447, simple_loss=0.448, pruned_loss=0.2207, over 4280128.31 frames. ], batch size: 176, lr: 4.38e-02, grad_scale: 16.0 2023-06-17 18:30:56,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=15000.0, ans=0.15000000000000002 2023-06-17 18:31:09,192 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.058e+02 3.980e+02 5.871e+02 7.826e+02 2.441e+03, threshold=1.174e+03, percent-clipped=5.0 2023-06-17 18:32:00,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=15120.0, ans=0.14880000000000002 2023-06-17 18:32:00,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=15120.0, ans=0.0 2023-06-17 18:32:11,791 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.96 vs. limit=13.192499999999999 2023-06-17 18:32:56,398 INFO [train.py:996] (3/4) Epoch 1, batch 2550, loss[loss=0.3792, simple_loss=0.388, pruned_loss=0.1852, over 21669.00 frames. ], tot_loss[loss=0.4379, simple_loss=0.4445, pruned_loss=0.2156, over 4275223.69 frames. ], batch size: 283, lr: 4.38e-02, grad_scale: 16.0 2023-06-17 18:33:15,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=15360.0, ans=0.4304 2023-06-17 18:34:35,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=15480.0, ans=0.125 2023-06-17 18:34:38,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=15480.0, ans=0.125 2023-06-17 18:34:43,660 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=15540.0, ans=0.35609999999999997 2023-06-17 18:34:52,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=15540.0, ans=0.1 2023-06-17 18:35:02,573 INFO [train.py:996] (3/4) Epoch 1, batch 2600, loss[loss=0.4327, simple_loss=0.4377, pruned_loss=0.2139, over 21774.00 frames. ], tot_loss[loss=0.442, simple_loss=0.4476, pruned_loss=0.2182, over 4281414.12 frames. ], batch size: 247, lr: 4.37e-02, grad_scale: 16.0 2023-06-17 18:35:10,234 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=15600.0, ans=0.125 2023-06-17 18:35:16,497 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.31 vs. limit=5.349 2023-06-17 18:35:16,864 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.481e+02 4.497e+02 6.428e+02 1.038e+03 2.322e+03, threshold=1.286e+03, percent-clipped=17.0 2023-06-17 18:36:09,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=15720.0, ans=0.125 2023-06-17 18:36:45,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=15840.0, ans=0.09899494936611666 2023-06-17 18:37:03,977 INFO [train.py:996] (3/4) Epoch 1, batch 2650, loss[loss=0.4215, simple_loss=0.4363, pruned_loss=0.2034, over 21965.00 frames. ], tot_loss[loss=0.4423, simple_loss=0.4478, pruned_loss=0.2184, over 4287861.16 frames. ], batch size: 316, lr: 4.37e-02, grad_scale: 8.0 2023-06-17 18:37:41,825 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.60 vs. limit=13.5075 2023-06-17 18:38:42,365 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.79 vs. limit=13.530000000000001 2023-06-17 18:38:49,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=16140.0, ans=0.050705 2023-06-17 18:38:51,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=16140.0, ans=0.1386 2023-06-17 18:39:09,546 INFO [train.py:996] (3/4) Epoch 1, batch 2700, loss[loss=0.3114, simple_loss=0.3289, pruned_loss=0.147, over 21347.00 frames. ], tot_loss[loss=0.4347, simple_loss=0.4434, pruned_loss=0.213, over 4281627.44 frames. ], batch size: 131, lr: 4.36e-02, grad_scale: 8.0 2023-06-17 18:39:14,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=16200.0, ans=0.125 2023-06-17 18:39:28,088 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.221e+02 4.133e+02 5.896e+02 7.988e+02 2.040e+03, threshold=1.179e+03, percent-clipped=10.0 2023-06-17 18:40:50,137 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.07 vs. limit=13.665 2023-06-17 18:41:06,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=16440.0, ans=0.0 2023-06-17 18:41:13,164 INFO [train.py:996] (3/4) Epoch 1, batch 2750, loss[loss=0.4069, simple_loss=0.4419, pruned_loss=0.186, over 21450.00 frames. ], tot_loss[loss=0.4328, simple_loss=0.4422, pruned_loss=0.2117, over 4290356.53 frames. ], batch size: 194, lr: 4.36e-02, grad_scale: 8.0 2023-06-17 18:42:02,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=16560.0, ans=13.71 2023-06-17 18:42:09,351 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=16560.0, ans=0.125 2023-06-17 18:43:09,272 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.71 vs. limit=5.502 2023-06-17 18:43:32,083 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=16740.0, ans=0.13260000000000002 2023-06-17 18:43:47,053 INFO [train.py:996] (3/4) Epoch 1, batch 2800, loss[loss=0.3901, simple_loss=0.3912, pruned_loss=0.1945, over 21295.00 frames. ], tot_loss[loss=0.4338, simple_loss=0.4445, pruned_loss=0.2115, over 4285901.46 frames. ], batch size: 551, lr: 4.36e-02, grad_scale: 16.0 2023-06-17 18:43:47,494 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=16800.0, ans=0.31200000000000006 2023-06-17 18:44:01,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=16800.0, ans=0.125 2023-06-17 18:44:11,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=16800.0, ans=0.0 2023-06-17 18:44:17,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=16860.0, ans=0.13140000000000002 2023-06-17 18:44:18,557 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.642e+02 4.703e+02 6.814e+02 1.223e+03 2.130e+03, threshold=1.363e+03, percent-clipped=25.0 2023-06-17 18:45:35,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=16980.0, ans=0.125 2023-06-17 18:46:05,393 INFO [train.py:996] (3/4) Epoch 1, batch 2850, loss[loss=0.4126, simple_loss=0.4368, pruned_loss=0.1943, over 21657.00 frames. ], tot_loss[loss=0.4322, simple_loss=0.4435, pruned_loss=0.2105, over 4284326.23 frames. ], batch size: 414, lr: 4.35e-02, grad_scale: 16.0 2023-06-17 18:46:39,659 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.54 vs. limit=10.84 2023-06-17 18:48:12,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=17400.0, ans=0.126 2023-06-17 18:48:20,746 INFO [train.py:996] (3/4) Epoch 1, batch 2900, loss[loss=0.4043, simple_loss=0.4156, pruned_loss=0.1965, over 21892.00 frames. ], tot_loss[loss=0.4276, simple_loss=0.4393, pruned_loss=0.208, over 4286891.69 frames. ], batch size: 107, lr: 4.35e-02, grad_scale: 16.0 2023-06-17 18:48:48,691 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.105e+02 4.392e+02 5.988e+02 8.416e+02 1.775e+03, threshold=1.198e+03, percent-clipped=6.0 2023-06-17 18:49:38,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=17520.0, ans=0.007060869565217391 2023-06-17 18:49:45,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=17580.0, ans=0.1242 2023-06-17 18:49:47,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=17580.0, ans=0.125 2023-06-17 18:50:57,614 INFO [train.py:996] (3/4) Epoch 1, batch 2950, loss[loss=0.3761, simple_loss=0.3925, pruned_loss=0.1798, over 21396.00 frames. ], tot_loss[loss=0.4248, simple_loss=0.4392, pruned_loss=0.2052, over 4293432.10 frames. ], batch size: 144, lr: 4.34e-02, grad_scale: 8.0 2023-06-17 18:51:42,938 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=17820.0, ans=0.125 2023-06-17 18:51:52,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=17820.0, ans=0.125 2023-06-17 18:53:14,231 INFO [train.py:996] (3/4) Epoch 1, batch 3000, loss[loss=0.4604, simple_loss=0.4621, pruned_loss=0.2293, over 21534.00 frames. ], tot_loss[loss=0.4291, simple_loss=0.4441, pruned_loss=0.2071, over 4283172.56 frames. ], batch size: 194, lr: 4.34e-02, grad_scale: 8.0 2023-06-17 18:53:14,233 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-17 18:54:05,131 INFO [train.py:1028] (3/4) Epoch 1, validation: loss=0.3426, simple_loss=0.4236, pruned_loss=0.1308, over 1796401.00 frames. 2023-06-17 18:54:05,133 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-17 18:54:11,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=18000.0, ans=0.12000000000000002 2023-06-17 18:54:32,211 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=18060.0, ans=0.1194 2023-06-17 18:54:33,158 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.648e+02 4.493e+02 5.938e+02 7.860e+02 2.320e+03, threshold=1.188e+03, percent-clipped=8.0 2023-06-17 18:54:33,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=18060.0, ans=0.125 2023-06-17 18:56:05,089 INFO [train.py:996] (3/4) Epoch 1, batch 3050, loss[loss=0.4528, simple_loss=0.4598, pruned_loss=0.2229, over 21880.00 frames. ], tot_loss[loss=0.4281, simple_loss=0.4448, pruned_loss=0.2057, over 4285537.41 frames. ], batch size: 371, lr: 4.33e-02, grad_scale: 8.0 2023-06-17 18:56:15,660 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=18300.0, ans=0.11700000000000002 2023-06-17 18:56:34,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=18360.0, ans=0.2574000000000001 2023-06-17 18:57:11,854 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=18420.0, ans=0.125 2023-06-17 18:57:18,559 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.73 vs. limit=14.407499999999999 2023-06-17 18:58:29,010 INFO [train.py:996] (3/4) Epoch 1, batch 3100, loss[loss=0.455, simple_loss=0.4496, pruned_loss=0.2302, over 21579.00 frames. ], tot_loss[loss=0.422, simple_loss=0.4403, pruned_loss=0.2019, over 4283397.29 frames. ], batch size: 548, lr: 4.33e-02, grad_scale: 8.0 2023-06-17 18:58:46,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=18600.0, ans=0.249 2023-06-17 18:58:50,767 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.527e+02 3.860e+02 4.806e+02 7.218e+02 1.901e+03, threshold=9.611e+02, percent-clipped=6.0 2023-06-17 18:59:28,449 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=18720.0, ans=0.125 2023-06-17 18:59:55,124 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=18780.0, ans=0.11220000000000002 2023-06-17 19:00:19,485 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=14.565000000000001 2023-06-17 19:00:55,768 INFO [train.py:996] (3/4) Epoch 1, batch 3150, loss[loss=0.4938, simple_loss=0.4956, pruned_loss=0.246, over 21484.00 frames. ], tot_loss[loss=0.4273, simple_loss=0.4448, pruned_loss=0.2049, over 4288414.26 frames. ], batch size: 131, lr: 4.32e-02, grad_scale: 8.0 2023-06-17 19:01:03,957 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.53 vs. limit=14.5875 2023-06-17 19:01:44,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=19020.0, ans=0.006734782608695652 2023-06-17 19:02:10,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=19020.0, ans=0.125 2023-06-17 19:03:01,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=19140.0, ans=0.125 2023-06-17 19:03:01,880 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=11.655999999999999 2023-06-17 19:03:25,394 INFO [train.py:996] (3/4) Epoch 1, batch 3200, loss[loss=0.4108, simple_loss=0.4481, pruned_loss=0.1867, over 21676.00 frames. ], tot_loss[loss=0.4244, simple_loss=0.4446, pruned_loss=0.2021, over 4290584.08 frames. ], batch size: 414, lr: 4.32e-02, grad_scale: 16.0 2023-06-17 19:04:02,308 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.111e+02 3.900e+02 5.632e+02 8.444e+02 2.494e+03, threshold=1.126e+03, percent-clipped=20.0 2023-06-17 19:05:52,667 INFO [train.py:996] (3/4) Epoch 1, batch 3250, loss[loss=0.3904, simple_loss=0.3996, pruned_loss=0.1906, over 21381.00 frames. ], tot_loss[loss=0.4252, simple_loss=0.4433, pruned_loss=0.2036, over 4289420.10 frames. ], batch size: 211, lr: 4.31e-02, grad_scale: 16.0 2023-06-17 19:06:09,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=19560.0, ans=0.0 2023-06-17 19:07:16,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=19680.0, ans=0.006591304347826087 2023-06-17 19:07:46,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=19740.0, ans=0.125 2023-06-17 19:07:55,858 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.10 vs. limit=14.9025 2023-06-17 19:07:59,357 INFO [train.py:996] (3/4) Epoch 1, batch 3300, loss[loss=0.3407, simple_loss=0.3961, pruned_loss=0.1426, over 21533.00 frames. ], tot_loss[loss=0.419, simple_loss=0.4366, pruned_loss=0.2007, over 4285533.99 frames. ], batch size: 230, lr: 4.31e-02, grad_scale: 16.0 2023-06-17 19:08:19,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=19860.0, ans=0.125 2023-06-17 19:08:20,954 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.209e+02 3.802e+02 5.443e+02 8.160e+02 1.939e+03, threshold=1.089e+03, percent-clipped=11.0 2023-06-17 19:08:36,204 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=19860.0, ans=0.20489999999999997 2023-06-17 19:08:38,921 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=19860.0, ans=0.125 2023-06-17 19:09:03,555 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=8.78 vs. limit=11.968 2023-06-17 19:09:05,790 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=19920.0, ans=0.1008 2023-06-17 19:09:25,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=19980.0, ans=0.125 2023-06-17 19:09:26,415 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.12 vs. limit=5.997 2023-06-17 19:10:20,570 INFO [train.py:996] (3/4) Epoch 1, batch 3350, loss[loss=0.4119, simple_loss=0.4391, pruned_loss=0.1924, over 21407.00 frames. ], tot_loss[loss=0.4191, simple_loss=0.4386, pruned_loss=0.1997, over 4285201.49 frames. ], batch size: 548, lr: 4.30e-02, grad_scale: 8.0 2023-06-17 19:10:22,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=20100.0, ans=0.006500000000000001 2023-06-17 19:10:28,811 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=20100.0, ans=0.07 2023-06-17 19:11:32,340 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.08 vs. limit=10.0 2023-06-17 19:11:33,122 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=20220.0, ans=0.125 2023-06-17 19:11:57,190 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=20280.0, ans=0.2 2023-06-17 19:12:39,783 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.93 vs. limit=22.5 2023-06-17 19:12:42,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=20400.0, ans=0.09899494936611666 2023-06-17 19:12:43,815 INFO [train.py:996] (3/4) Epoch 1, batch 3400, loss[loss=0.4139, simple_loss=0.4295, pruned_loss=0.1991, over 21859.00 frames. ], tot_loss[loss=0.419, simple_loss=0.4379, pruned_loss=0.2001, over 4287055.09 frames. ], batch size: 124, lr: 4.29e-02, grad_scale: 8.0 2023-06-17 19:12:57,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=20400.0, ans=0.2 2023-06-17 19:13:03,421 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=20460.0, ans=0.0 2023-06-17 19:13:07,259 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.317e+02 4.511e+02 6.532e+02 8.905e+02 1.651e+03, threshold=1.306e+03, percent-clipped=8.0 2023-06-17 19:13:48,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=20520.0, ans=0.125 2023-06-17 19:13:56,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=20580.0, ans=0.1 2023-06-17 19:14:09,361 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=20580.0, ans=0.006395652173913044 2023-06-17 19:14:58,843 INFO [train.py:996] (3/4) Epoch 1, batch 3450, loss[loss=0.6166, simple_loss=0.5822, pruned_loss=0.3255, over 21386.00 frames. ], tot_loss[loss=0.4143, simple_loss=0.4314, pruned_loss=0.1986, over 4282446.39 frames. ], batch size: 507, lr: 4.29e-02, grad_scale: 8.0 2023-06-17 19:17:20,356 INFO [train.py:996] (3/4) Epoch 1, batch 3500, loss[loss=0.4414, simple_loss=0.4617, pruned_loss=0.2105, over 21805.00 frames. ], tot_loss[loss=0.4249, simple_loss=0.4422, pruned_loss=0.2038, over 4282053.38 frames. ], batch size: 247, lr: 4.28e-02, grad_scale: 8.0 2023-06-17 19:17:42,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=21000.0, ans=0.125 2023-06-17 19:17:58,118 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.542e+02 4.045e+02 5.374e+02 7.279e+02 2.253e+03, threshold=1.075e+03, percent-clipped=5.0 2023-06-17 19:19:09,358 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.29 vs. limit=10.0 2023-06-17 19:19:45,207 INFO [train.py:996] (3/4) Epoch 1, batch 3550, loss[loss=0.4079, simple_loss=0.4243, pruned_loss=0.1958, over 21209.00 frames. ], tot_loss[loss=0.4274, simple_loss=0.4448, pruned_loss=0.205, over 4278488.94 frames. ], batch size: 159, lr: 4.28e-02, grad_scale: 8.0 2023-06-17 19:20:18,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=21300.0, ans=0.2 2023-06-17 19:20:24,853 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=21360.0, ans=0.5 2023-06-17 19:20:45,334 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=21420.0, ans=0.025 2023-06-17 19:21:06,680 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=21420.0, ans=0.025 2023-06-17 19:21:46,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=21540.0, ans=0.125 2023-06-17 19:21:55,554 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.11 vs. limit=15.0 2023-06-17 19:21:59,191 INFO [train.py:996] (3/4) Epoch 1, batch 3600, loss[loss=0.4611, simple_loss=0.4642, pruned_loss=0.229, over 21560.00 frames. ], tot_loss[loss=0.4218, simple_loss=0.4384, pruned_loss=0.2026, over 4280994.50 frames. ], batch size: 389, lr: 4.27e-02, grad_scale: 16.0 2023-06-17 19:22:07,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=21600.0, ans=0.1 2023-06-17 19:22:31,524 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=21600.0, ans=0.125 2023-06-17 19:22:43,392 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.059e+02 3.884e+02 5.320e+02 7.505e+02 1.580e+03, threshold=1.064e+03, percent-clipped=11.0 2023-06-17 19:22:48,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=21660.0, ans=0.125 2023-06-17 19:23:10,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=21660.0, ans=0.0 2023-06-17 19:23:13,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=21720.0, ans=0.125 2023-06-17 19:24:16,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=21840.0, ans=0.125 2023-06-17 19:24:43,923 INFO [train.py:996] (3/4) Epoch 1, batch 3650, loss[loss=0.4828, simple_loss=0.5128, pruned_loss=0.2264, over 19800.00 frames. ], tot_loss[loss=0.4224, simple_loss=0.4401, pruned_loss=0.2023, over 4282370.79 frames. ], batch size: 702, lr: 4.27e-02, grad_scale: 16.0 2023-06-17 19:25:13,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=21960.0, ans=0.1 2023-06-17 19:25:14,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=21960.0, ans=0.2 2023-06-17 19:25:45,399 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.78 vs. limit=15.0 2023-06-17 19:26:14,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=22080.0, ans=0.0 2023-06-17 19:26:22,642 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=22080.0, ans=0.0060695652173913045 2023-06-17 19:26:27,537 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.67 vs. limit=15.0 2023-06-17 19:26:28,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=22140.0, ans=0.125 2023-06-17 19:26:58,877 INFO [train.py:996] (3/4) Epoch 1, batch 3700, loss[loss=0.4377, simple_loss=0.4559, pruned_loss=0.2097, over 21841.00 frames. ], tot_loss[loss=0.4205, simple_loss=0.4389, pruned_loss=0.2011, over 4276496.25 frames. ], batch size: 351, lr: 4.26e-02, grad_scale: 16.0 2023-06-17 19:27:08,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=22200.0, ans=0.125 2023-06-17 19:27:15,006 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=22200.0, ans=0.125 2023-06-17 19:27:30,078 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.339e+02 4.188e+02 5.889e+02 8.889e+02 2.124e+03, threshold=1.178e+03, percent-clipped=16.0 2023-06-17 19:29:19,301 INFO [train.py:996] (3/4) Epoch 1, batch 3750, loss[loss=0.3737, simple_loss=0.3721, pruned_loss=0.1877, over 20235.00 frames. ], tot_loss[loss=0.4142, simple_loss=0.4329, pruned_loss=0.1978, over 4285587.97 frames. ], batch size: 703, lr: 4.26e-02, grad_scale: 16.0 2023-06-17 19:29:52,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=22560.0, ans=0.125 2023-06-17 19:30:34,063 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.49 vs. limit=15.0 2023-06-17 19:30:47,803 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=22620.0, ans=0.125 2023-06-17 19:31:54,522 INFO [train.py:996] (3/4) Epoch 1, batch 3800, loss[loss=0.4379, simple_loss=0.4456, pruned_loss=0.2151, over 19987.00 frames. ], tot_loss[loss=0.4112, simple_loss=0.4312, pruned_loss=0.1956, over 4287321.30 frames. ], batch size: 703, lr: 4.25e-02, grad_scale: 16.0 2023-06-17 19:32:06,810 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=22800.0, ans=0.125 2023-06-17 19:32:08,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=22860.0, ans=0.2 2023-06-17 19:32:11,770 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 3.353e+02 4.457e+02 7.554e+02 3.391e+03, threshold=8.914e+02, percent-clipped=13.0 2023-06-17 19:32:25,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=22860.0, ans=0.0 2023-06-17 19:33:23,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=22980.0, ans=15.0 2023-06-17 19:33:29,605 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=19.38 vs. limit=22.5 2023-06-17 19:33:58,054 INFO [train.py:996] (3/4) Epoch 1, batch 3850, loss[loss=0.3953, simple_loss=0.3982, pruned_loss=0.1962, over 21874.00 frames. ], tot_loss[loss=0.4097, simple_loss=0.4287, pruned_loss=0.1953, over 4282522.72 frames. ], batch size: 373, lr: 4.24e-02, grad_scale: 8.0 2023-06-17 19:34:12,884 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.85 vs. limit=22.5 2023-06-17 19:34:55,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=23220.0, ans=0.2 2023-06-17 19:36:07,357 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=23340.0, ans=0.125 2023-06-17 19:36:14,372 INFO [train.py:996] (3/4) Epoch 1, batch 3900, loss[loss=0.3609, simple_loss=0.3774, pruned_loss=0.1722, over 21233.00 frames. ], tot_loss[loss=0.4038, simple_loss=0.4229, pruned_loss=0.1924, over 4284944.90 frames. ], batch size: 548, lr: 4.24e-02, grad_scale: 8.0 2023-06-17 19:36:23,294 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=23400.0, ans=0.0 2023-06-17 19:36:37,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=23460.0, ans=0.125 2023-06-17 19:36:38,862 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.237e+02 3.460e+02 4.927e+02 7.077e+02 1.688e+03, threshold=9.853e+02, percent-clipped=16.0 2023-06-17 19:37:54,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=23580.0, ans=0.1 2023-06-17 19:38:01,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=23640.0, ans=0.2 2023-06-17 19:38:03,438 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=23640.0, ans=0.1 2023-06-17 19:38:11,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=23640.0, ans=0.125 2023-06-17 19:38:38,972 INFO [train.py:996] (3/4) Epoch 1, batch 3950, loss[loss=0.2495, simple_loss=0.304, pruned_loss=0.09747, over 21317.00 frames. ], tot_loss[loss=0.4014, simple_loss=0.4229, pruned_loss=0.19, over 4291922.11 frames. ], batch size: 176, lr: 4.23e-02, grad_scale: 8.0 2023-06-17 19:39:22,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=23760.0, ans=0.005704347826086957 2023-06-17 19:39:24,433 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 19:39:27,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=23820.0, ans=0.125 2023-06-17 19:40:21,395 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.37 vs. limit=15.0 2023-06-17 19:40:57,569 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.14 vs. limit=15.0 2023-06-17 19:41:01,647 INFO [train.py:996] (3/4) Epoch 1, batch 4000, loss[loss=0.3453, simple_loss=0.3669, pruned_loss=0.1619, over 21787.00 frames. ], tot_loss[loss=0.392, simple_loss=0.4157, pruned_loss=0.1841, over 4287658.38 frames. ], batch size: 124, lr: 4.23e-02, grad_scale: 16.0 2023-06-17 19:41:35,118 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.256e+02 3.734e+02 4.906e+02 6.607e+02 1.436e+03, threshold=9.812e+02, percent-clipped=4.0 2023-06-17 19:42:11,175 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.77 vs. limit=6.0 2023-06-17 19:42:11,886 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=24120.0, ans=0.125 2023-06-17 19:42:20,539 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=24180.0, ans=0.025 2023-06-17 19:43:26,360 INFO [train.py:996] (3/4) Epoch 1, batch 4050, loss[loss=0.3627, simple_loss=0.3667, pruned_loss=0.1793, over 20840.00 frames. ], tot_loss[loss=0.3879, simple_loss=0.4138, pruned_loss=0.181, over 4277846.27 frames. ], batch size: 613, lr: 4.22e-02, grad_scale: 4.0 2023-06-17 19:43:34,248 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.95 vs. limit=22.5 2023-06-17 19:43:44,917 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=24360.0, ans=0.125 2023-06-17 19:44:36,378 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.14 vs. limit=15.0 2023-06-17 19:45:41,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=24600.0, ans=0.005521739130434783 2023-06-17 19:45:42,498 INFO [train.py:996] (3/4) Epoch 1, batch 4100, loss[loss=0.3802, simple_loss=0.4013, pruned_loss=0.1795, over 21933.00 frames. ], tot_loss[loss=0.3896, simple_loss=0.4147, pruned_loss=0.1822, over 4285798.92 frames. ], batch size: 316, lr: 4.22e-02, grad_scale: 8.0 2023-06-17 19:46:06,893 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.26 vs. limit=15.0 2023-06-17 19:46:09,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=24600.0, ans=0.125 2023-06-17 19:46:11,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=24660.0, ans=0.2 2023-06-17 19:46:34,088 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.511e+02 3.802e+02 5.077e+02 7.572e+02 1.841e+03, threshold=1.015e+03, percent-clipped=11.0 2023-06-17 19:47:45,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=24780.0, ans=0.125 2023-06-17 19:48:01,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=24900.0, ans=0.125 2023-06-17 19:48:02,816 INFO [train.py:996] (3/4) Epoch 1, batch 4150, loss[loss=0.4334, simple_loss=0.4455, pruned_loss=0.2107, over 21068.00 frames. ], tot_loss[loss=0.3819, simple_loss=0.4125, pruned_loss=0.1757, over 4271578.83 frames. ], batch size: 608, lr: 4.21e-02, grad_scale: 8.0 2023-06-17 19:48:03,137 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=24900.0, ans=0.125 2023-06-17 19:48:07,440 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=24900.0, ans=0.125 2023-06-17 19:48:49,650 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=17.60 vs. limit=22.5 2023-06-17 19:49:38,180 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=24.05 vs. limit=22.5 2023-06-17 19:49:47,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=25080.0, ans=0.0 2023-06-17 19:49:51,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=25080.0, ans=0.125 2023-06-17 19:50:06,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=25140.0, ans=0.125 2023-06-17 19:50:12,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=25140.0, ans=0.5 2023-06-17 19:50:32,236 INFO [train.py:996] (3/4) Epoch 1, batch 4200, loss[loss=0.4893, simple_loss=0.4866, pruned_loss=0.246, over 21355.00 frames. ], tot_loss[loss=0.3816, simple_loss=0.4123, pruned_loss=0.1754, over 4275542.51 frames. ], batch size: 548, lr: 4.20e-02, grad_scale: 8.0 2023-06-17 19:51:04,626 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.20 vs. limit=22.5 2023-06-17 19:51:13,303 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=25260.0, ans=0.0 2023-06-17 19:51:17,209 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.69 vs. limit=6.0 2023-06-17 19:51:17,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=25260.0, ans=0.005378260869565218 2023-06-17 19:51:18,824 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 3.271e+02 4.382e+02 6.312e+02 1.234e+03, threshold=8.764e+02, percent-clipped=8.0 2023-06-17 19:52:52,075 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=25440.0, ans=0.2 2023-06-17 19:52:55,080 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=25440.0, ans=0.1 2023-06-17 19:53:04,745 INFO [train.py:996] (3/4) Epoch 1, batch 4250, loss[loss=0.4241, simple_loss=0.4181, pruned_loss=0.2151, over 20215.00 frames. ], tot_loss[loss=0.3903, simple_loss=0.421, pruned_loss=0.1798, over 4272573.45 frames. ], batch size: 702, lr: 4.20e-02, grad_scale: 8.0 2023-06-17 19:53:14,280 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.39 vs. limit=15.0 2023-06-17 19:53:44,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=25560.0, ans=10.0 2023-06-17 19:53:44,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=25560.0, ans=0.2 2023-06-17 19:53:55,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=25560.0, ans=0.125 2023-06-17 19:53:58,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=25560.0, ans=0.125 2023-06-17 19:54:31,523 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=25680.0, ans=0.1 2023-06-17 19:55:33,831 INFO [train.py:996] (3/4) Epoch 1, batch 4300, loss[loss=0.3436, simple_loss=0.3868, pruned_loss=0.1502, over 21301.00 frames. ], tot_loss[loss=0.3986, simple_loss=0.4297, pruned_loss=0.1837, over 4277333.72 frames. ], batch size: 176, lr: 4.19e-02, grad_scale: 8.0 2023-06-17 19:55:34,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=25800.0, ans=0.125 2023-06-17 19:56:00,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=25800.0, ans=0.125 2023-06-17 19:56:16,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=25860.0, ans=0.1 2023-06-17 19:56:30,573 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.330e+02 4.361e+02 6.749e+02 9.023e+02 1.594e+03, threshold=1.350e+03, percent-clipped=28.0 2023-06-17 19:56:48,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=25920.0, ans=0.05 2023-06-17 19:56:54,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=25920.0, ans=0.0 2023-06-17 19:56:55,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=25920.0, ans=0.2 2023-06-17 19:57:04,331 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=25920.0, ans=0.1 2023-06-17 19:58:04,455 INFO [train.py:996] (3/4) Epoch 1, batch 4350, loss[loss=0.3342, simple_loss=0.348, pruned_loss=0.1602, over 21220.00 frames. ], tot_loss[loss=0.395, simple_loss=0.4256, pruned_loss=0.1822, over 4266436.42 frames. ], batch size: 548, lr: 4.19e-02, grad_scale: 8.0 2023-06-17 19:58:22,288 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=26100.0, ans=0.1 2023-06-17 19:58:47,696 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.40 vs. limit=6.0 2023-06-17 19:59:15,744 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=26220.0, ans=0.2 2023-06-17 19:59:27,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=26280.0, ans=0.1 2023-06-17 20:00:06,847 INFO [train.py:996] (3/4) Epoch 1, batch 4400, loss[loss=0.4333, simple_loss=0.4724, pruned_loss=0.1971, over 21656.00 frames. ], tot_loss[loss=0.3914, simple_loss=0.4208, pruned_loss=0.1809, over 4268760.25 frames. ], batch size: 414, lr: 4.18e-02, grad_scale: 16.0 2023-06-17 20:00:14,199 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=26400.0, ans=0.07 2023-06-17 20:00:21,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=26400.0, ans=0.1 2023-06-17 20:00:23,689 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.74 vs. limit=15.0 2023-06-17 20:00:53,229 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.109e+02 3.739e+02 5.540e+02 6.939e+02 1.405e+03, threshold=1.108e+03, percent-clipped=1.0 2023-06-17 20:01:36,813 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=26520.0, ans=0.0 2023-06-17 20:01:52,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=26580.0, ans=0.0 2023-06-17 20:02:32,190 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 20:02:38,864 INFO [train.py:996] (3/4) Epoch 1, batch 4450, loss[loss=0.523, simple_loss=0.5237, pruned_loss=0.2612, over 21573.00 frames. ], tot_loss[loss=0.3947, simple_loss=0.4267, pruned_loss=0.1814, over 4272299.33 frames. ], batch size: 471, lr: 4.17e-02, grad_scale: 8.0 2023-06-17 20:03:20,585 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=26760.0, ans=10.0 2023-06-17 20:03:59,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=26820.0, ans=0.0050391304347826085 2023-06-17 20:04:11,809 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=26880.0, ans=0.1 2023-06-17 20:04:47,885 INFO [train.py:996] (3/4) Epoch 1, batch 4500, loss[loss=0.439, simple_loss=0.4283, pruned_loss=0.2249, over 20182.00 frames. ], tot_loss[loss=0.3988, simple_loss=0.4291, pruned_loss=0.1842, over 4279282.49 frames. ], batch size: 707, lr: 4.17e-02, grad_scale: 8.0 2023-06-17 20:05:15,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=27060.0, ans=0.09899494936611666 2023-06-17 20:05:40,589 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.441e+02 4.481e+02 5.907e+02 7.861e+02 1.389e+03, threshold=1.181e+03, percent-clipped=9.0 2023-06-17 20:05:48,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=27060.0, ans=0.125 2023-06-17 20:05:56,083 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=27120.0, ans=0.125 2023-06-17 20:06:05,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=27120.0, ans=0.2 2023-06-17 20:07:07,040 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.47 vs. limit=6.0 2023-06-17 20:07:14,411 INFO [train.py:996] (3/4) Epoch 1, batch 4550, loss[loss=0.4854, simple_loss=0.4973, pruned_loss=0.2368, over 21347.00 frames. ], tot_loss[loss=0.4016, simple_loss=0.4329, pruned_loss=0.1851, over 4282318.13 frames. ], batch size: 549, lr: 4.16e-02, grad_scale: 4.0 2023-06-17 20:09:37,113 INFO [train.py:996] (3/4) Epoch 1, batch 4600, loss[loss=0.3647, simple_loss=0.4039, pruned_loss=0.1628, over 21832.00 frames. ], tot_loss[loss=0.405, simple_loss=0.4348, pruned_loss=0.1876, over 4282206.01 frames. ], batch size: 351, lr: 4.15e-02, grad_scale: 8.0 2023-06-17 20:09:40,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=27600.0, ans=0.1 2023-06-17 20:10:27,797 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.731e+02 3.841e+02 4.647e+02 5.664e+02 1.586e+03, threshold=9.294e+02, percent-clipped=2.0 2023-06-17 20:10:33,951 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=27720.0, ans=0.1 2023-06-17 20:10:41,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=27720.0, ans=0.125 2023-06-17 20:12:02,317 INFO [train.py:996] (3/4) Epoch 1, batch 4650, loss[loss=0.3134, simple_loss=0.3581, pruned_loss=0.1344, over 21774.00 frames. ], tot_loss[loss=0.3941, simple_loss=0.4237, pruned_loss=0.1822, over 4286791.99 frames. ], batch size: 391, lr: 4.15e-02, grad_scale: 8.0 2023-06-17 20:12:39,603 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=27960.0, ans=0.125 2023-06-17 20:13:53,861 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.83 vs. limit=15.0 2023-06-17 20:14:04,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=28140.0, ans=0.1 2023-06-17 20:14:05,612 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=28140.0, ans=0.2 2023-06-17 20:14:13,013 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 20:14:13,955 INFO [train.py:996] (3/4) Epoch 1, batch 4700, loss[loss=0.3253, simple_loss=0.3437, pruned_loss=0.1535, over 21238.00 frames. ], tot_loss[loss=0.3839, simple_loss=0.4127, pruned_loss=0.1775, over 4285498.00 frames. ], batch size: 159, lr: 4.14e-02, grad_scale: 8.0 2023-06-17 20:14:14,346 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=28200.0, ans=0.125 2023-06-17 20:14:53,044 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=28260.0, ans=0.004726086956521739 2023-06-17 20:15:16,495 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.200e+02 4.151e+02 4.936e+02 6.766e+02 1.742e+03, threshold=9.871e+02, percent-clipped=9.0 2023-06-17 20:15:36,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=28320.0, ans=0.1 2023-06-17 20:15:47,341 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.56 vs. limit=22.5 2023-06-17 20:16:02,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=28440.0, ans=0.0 2023-06-17 20:16:17,011 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 20:16:17,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=28440.0, ans=0.125 2023-06-17 20:16:24,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=28440.0, ans=0.125 2023-06-17 20:16:31,465 INFO [train.py:996] (3/4) Epoch 1, batch 4750, loss[loss=0.4033, simple_loss=0.4155, pruned_loss=0.1956, over 21323.00 frames. ], tot_loss[loss=0.382, simple_loss=0.4081, pruned_loss=0.1779, over 4285220.29 frames. ], batch size: 159, lr: 4.14e-02, grad_scale: 8.0 2023-06-17 20:16:57,444 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=28500.0, ans=0.125 2023-06-17 20:17:40,725 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.10 vs. limit=15.0 2023-06-17 20:18:07,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=28680.0, ans=0.2 2023-06-17 20:18:30,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=28680.0, ans=0.125 2023-06-17 20:19:08,694 INFO [train.py:996] (3/4) Epoch 1, batch 4800, loss[loss=0.38, simple_loss=0.3967, pruned_loss=0.1817, over 21305.00 frames. ], tot_loss[loss=0.3848, simple_loss=0.4107, pruned_loss=0.1795, over 4284047.95 frames. ], batch size: 143, lr: 4.13e-02, grad_scale: 16.0 2023-06-17 20:19:18,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=28800.0, ans=0.125 2023-06-17 20:19:24,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=28860.0, ans=0.125 2023-06-17 20:19:33,634 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.564e+02 4.275e+02 5.086e+02 6.755e+02 1.816e+03, threshold=1.017e+03, percent-clipped=8.0 2023-06-17 20:20:20,166 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=28980.0, ans=0.1 2023-06-17 20:20:40,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=29040.0, ans=0.0 2023-06-17 20:21:00,820 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=29100.0, ans=0.125 2023-06-17 20:21:04,165 INFO [train.py:996] (3/4) Epoch 1, batch 4850, loss[loss=0.4071, simple_loss=0.4253, pruned_loss=0.1945, over 21823.00 frames. ], tot_loss[loss=0.3825, simple_loss=0.409, pruned_loss=0.1781, over 4282366.71 frames. ], batch size: 332, lr: 4.12e-02, grad_scale: 16.0 2023-06-17 20:22:13,954 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=29220.0, ans=0.0 2023-06-17 20:22:14,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=29220.0, ans=0.125 2023-06-17 20:22:42,468 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.18 vs. limit=15.0 2023-06-17 20:22:43,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=29280.0, ans=0.125 2023-06-17 20:23:05,150 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=29340.0, ans=0.125 2023-06-17 20:23:37,867 INFO [train.py:996] (3/4) Epoch 1, batch 4900, loss[loss=0.5012, simple_loss=0.5016, pruned_loss=0.2504, over 21516.00 frames. ], tot_loss[loss=0.3864, simple_loss=0.4129, pruned_loss=0.18, over 4279997.21 frames. ], batch size: 508, lr: 4.12e-02, grad_scale: 16.0 2023-06-17 20:23:48,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=29400.0, ans=0.0 2023-06-17 20:23:51,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=29400.0, ans=0.1 2023-06-17 20:24:08,469 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.531e+02 3.410e+02 4.347e+02 5.276e+02 1.356e+03, threshold=8.693e+02, percent-clipped=2.0 2023-06-17 20:24:10,645 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 20:25:57,161 INFO [train.py:996] (3/4) Epoch 1, batch 4950, loss[loss=0.3885, simple_loss=0.4449, pruned_loss=0.1661, over 21649.00 frames. ], tot_loss[loss=0.3862, simple_loss=0.4175, pruned_loss=0.1774, over 4275096.95 frames. ], batch size: 441, lr: 4.11e-02, grad_scale: 16.0 2023-06-17 20:26:04,037 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.79 vs. limit=15.0 2023-06-17 20:26:09,559 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.15 vs. limit=15.0 2023-06-17 20:26:26,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=29760.0, ans=0.0 2023-06-17 20:26:31,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=29760.0, ans=0.004399999999999999 2023-06-17 20:26:50,616 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.32 vs. limit=6.0 2023-06-17 20:28:07,657 INFO [train.py:996] (3/4) Epoch 1, batch 5000, loss[loss=0.2885, simple_loss=0.365, pruned_loss=0.106, over 21421.00 frames. ], tot_loss[loss=0.3784, simple_loss=0.4135, pruned_loss=0.1717, over 4273338.40 frames. ], batch size: 194, lr: 4.10e-02, grad_scale: 16.0 2023-06-17 20:28:42,514 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.621e+02 3.507e+02 4.413e+02 5.456e+02 1.135e+03, threshold=8.826e+02, percent-clipped=2.0 2023-06-17 20:29:18,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=30120.0, ans=0.1 2023-06-17 20:29:30,650 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=30180.0, ans=0.0 2023-06-17 20:29:42,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=30240.0, ans=0.2 2023-06-17 20:29:47,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=30240.0, ans=0.125 2023-06-17 20:29:48,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=30240.0, ans=0.125 2023-06-17 20:30:11,510 INFO [train.py:996] (3/4) Epoch 1, batch 5050, loss[loss=0.3677, simple_loss=0.3961, pruned_loss=0.1696, over 21581.00 frames. ], tot_loss[loss=0.3799, simple_loss=0.4138, pruned_loss=0.173, over 4277644.90 frames. ], batch size: 195, lr: 4.10e-02, grad_scale: 16.0 2023-06-17 20:30:59,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=30360.0, ans=0.125 2023-06-17 20:31:02,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=30360.0, ans=0.125 2023-06-17 20:31:23,450 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.10 vs. limit=15.0 2023-06-17 20:32:32,733 INFO [train.py:996] (3/4) Epoch 1, batch 5100, loss[loss=0.3427, simple_loss=0.3747, pruned_loss=0.1553, over 21389.00 frames. ], tot_loss[loss=0.3796, simple_loss=0.4132, pruned_loss=0.173, over 4278123.85 frames. ], batch size: 159, lr: 4.09e-02, grad_scale: 16.0 2023-06-17 20:32:34,668 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 20:32:36,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=30600.0, ans=0.0042173913043478265 2023-06-17 20:32:37,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=30600.0, ans=0.125 2023-06-17 20:33:03,175 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.08 vs. limit=15.0 2023-06-17 20:33:25,228 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.491e+02 3.812e+02 4.644e+02 6.399e+02 1.305e+03, threshold=9.287e+02, percent-clipped=10.0 2023-06-17 20:33:52,880 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=30720.0, ans=0.2 2023-06-17 20:34:31,292 INFO [train.py:996] (3/4) Epoch 1, batch 5150, loss[loss=0.4001, simple_loss=0.4297, pruned_loss=0.1852, over 21802.00 frames. ], tot_loss[loss=0.3825, simple_loss=0.414, pruned_loss=0.1755, over 4281491.69 frames. ], batch size: 332, lr: 4.09e-02, grad_scale: 16.0 2023-06-17 20:36:11,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=31080.0, ans=0.2 2023-06-17 20:36:27,330 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.72 vs. limit=15.0 2023-06-17 20:36:52,080 INFO [train.py:996] (3/4) Epoch 1, batch 5200, loss[loss=0.3266, simple_loss=0.381, pruned_loss=0.1361, over 21249.00 frames. ], tot_loss[loss=0.3828, simple_loss=0.4154, pruned_loss=0.1751, over 4283868.81 frames. ], batch size: 159, lr: 4.08e-02, grad_scale: 32.0 2023-06-17 20:36:55,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=31200.0, ans=0.1 2023-06-17 20:37:49,920 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.764e+02 3.911e+02 4.920e+02 6.306e+02 1.130e+03, threshold=9.840e+02, percent-clipped=5.0 2023-06-17 20:38:00,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=31320.0, ans=0.1 2023-06-17 20:38:01,501 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.66 vs. limit=22.5 2023-06-17 20:38:19,872 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.89 vs. limit=15.0 2023-06-17 20:38:54,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=31440.0, ans=0.0 2023-06-17 20:39:01,690 INFO [train.py:996] (3/4) Epoch 1, batch 5250, loss[loss=0.4683, simple_loss=0.4687, pruned_loss=0.2339, over 21770.00 frames. ], tot_loss[loss=0.3788, simple_loss=0.416, pruned_loss=0.1708, over 4281964.70 frames. ], batch size: 441, lr: 4.07e-02, grad_scale: 32.0 2023-06-17 20:41:23,878 INFO [train.py:996] (3/4) Epoch 1, batch 5300, loss[loss=0.4171, simple_loss=0.4249, pruned_loss=0.2047, over 21919.00 frames. ], tot_loss[loss=0.3819, simple_loss=0.4172, pruned_loss=0.1733, over 4291294.90 frames. ], batch size: 414, lr: 4.07e-02, grad_scale: 32.0 2023-06-17 20:41:25,851 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 20:42:14,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=31860.0, ans=0.025 2023-06-17 20:42:26,440 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.181e+02 3.571e+02 4.372e+02 6.573e+02 1.564e+03, threshold=8.743e+02, percent-clipped=7.0 2023-06-17 20:42:31,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=31920.0, ans=0.0 2023-06-17 20:42:34,332 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=31920.0, ans=0.0 2023-06-17 20:43:18,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=32040.0, ans=0.125 2023-06-17 20:43:27,177 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.98 vs. limit=15.0 2023-06-17 20:43:47,264 INFO [train.py:996] (3/4) Epoch 1, batch 5350, loss[loss=0.374, simple_loss=0.3952, pruned_loss=0.1764, over 21467.00 frames. ], tot_loss[loss=0.3836, simple_loss=0.4173, pruned_loss=0.175, over 4290268.78 frames. ], batch size: 159, lr: 4.06e-02, grad_scale: 32.0 2023-06-17 20:44:35,287 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 20:44:45,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=32220.0, ans=0.125 2023-06-17 20:44:54,498 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=32220.0, ans=0.125 2023-06-17 20:45:15,232 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=32280.0, ans=0.0 2023-06-17 20:45:32,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=32340.0, ans=0.125 2023-06-17 20:45:32,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=32340.0, ans=0.2 2023-06-17 20:45:47,064 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=32340.0, ans=0.05 2023-06-17 20:46:12,769 INFO [train.py:996] (3/4) Epoch 1, batch 5400, loss[loss=0.3431, simple_loss=0.3903, pruned_loss=0.148, over 21754.00 frames. ], tot_loss[loss=0.3846, simple_loss=0.4164, pruned_loss=0.1764, over 4294147.70 frames. ], batch size: 391, lr: 4.05e-02, grad_scale: 32.0 2023-06-17 20:46:50,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=32460.0, ans=0.2 2023-06-17 20:46:57,628 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.109e+02 3.975e+02 4.705e+02 6.164e+02 1.546e+03, threshold=9.411e+02, percent-clipped=5.0 2023-06-17 20:47:01,080 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=32520.0, ans=0.0038000000000000004 2023-06-17 20:47:37,431 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=32580.0, ans=0.125 2023-06-17 20:47:41,008 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.58 vs. limit=10.0 2023-06-17 20:48:16,836 INFO [train.py:996] (3/4) Epoch 1, batch 5450, loss[loss=0.3614, simple_loss=0.4432, pruned_loss=0.1398, over 21688.00 frames. ], tot_loss[loss=0.3813, simple_loss=0.4158, pruned_loss=0.1734, over 4297400.08 frames. ], batch size: 247, lr: 4.05e-02, grad_scale: 32.0 2023-06-17 20:48:42,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=32700.0, ans=0.1 2023-06-17 20:48:53,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=32760.0, ans=0.0 2023-06-17 20:49:02,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=32760.0, ans=0.1 2023-06-17 20:49:29,422 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=32820.0, ans=0.1 2023-06-17 20:50:17,582 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.49 vs. limit=15.0 2023-06-17 20:50:27,674 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=32940.0, ans=0.2 2023-06-17 20:50:39,998 INFO [train.py:996] (3/4) Epoch 1, batch 5500, loss[loss=0.5126, simple_loss=0.5773, pruned_loss=0.2239, over 19799.00 frames. ], tot_loss[loss=0.3779, simple_loss=0.4176, pruned_loss=0.1691, over 4291966.07 frames. ], batch size: 702, lr: 4.04e-02, grad_scale: 32.0 2023-06-17 20:51:21,176 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=33060.0, ans=0.0 2023-06-17 20:51:30,001 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.403e+02 3.468e+02 4.497e+02 5.642e+02 1.011e+03, threshold=8.995e+02, percent-clipped=2.0 2023-06-17 20:52:18,754 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=33180.0, ans=15.0 2023-06-17 20:52:45,167 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=33240.0, ans=0.125 2023-06-17 20:53:14,076 INFO [train.py:996] (3/4) Epoch 1, batch 5550, loss[loss=0.2761, simple_loss=0.3455, pruned_loss=0.1033, over 21673.00 frames. ], tot_loss[loss=0.3703, simple_loss=0.4128, pruned_loss=0.1639, over 4287641.58 frames. ], batch size: 247, lr: 4.03e-02, grad_scale: 32.0 2023-06-17 20:54:14,273 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.71 vs. limit=22.5 2023-06-17 20:55:28,109 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=33540.0, ans=0.1 2023-06-17 20:55:30,608 INFO [train.py:996] (3/4) Epoch 1, batch 5600, loss[loss=0.4252, simple_loss=0.4445, pruned_loss=0.203, over 20013.00 frames. ], tot_loss[loss=0.3636, simple_loss=0.409, pruned_loss=0.1591, over 4285120.15 frames. ], batch size: 702, lr: 4.03e-02, grad_scale: 32.0 2023-06-17 20:55:38,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=33600.0, ans=0.125 2023-06-17 20:55:42,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=33600.0, ans=0.2 2023-06-17 20:56:05,403 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.367e+02 3.459e+02 4.877e+02 6.636e+02 1.371e+03, threshold=9.753e+02, percent-clipped=8.0 2023-06-17 20:56:18,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=33720.0, ans=0.125 2023-06-17 20:56:58,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=33780.0, ans=0.125 2023-06-17 20:57:49,976 INFO [train.py:996] (3/4) Epoch 1, batch 5650, loss[loss=0.4017, simple_loss=0.4675, pruned_loss=0.1679, over 21213.00 frames. ], tot_loss[loss=0.3695, simple_loss=0.4134, pruned_loss=0.1628, over 4282189.85 frames. ], batch size: 548, lr: 4.02e-02, grad_scale: 32.0 2023-06-17 20:57:51,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=33900.0, ans=0.125 2023-06-17 21:00:01,363 INFO [train.py:996] (3/4) Epoch 1, batch 5700, loss[loss=0.3276, simple_loss=0.3883, pruned_loss=0.1335, over 21749.00 frames. ], tot_loss[loss=0.3723, simple_loss=0.4129, pruned_loss=0.1659, over 4282173.60 frames. ], batch size: 282, lr: 4.02e-02, grad_scale: 32.0 2023-06-17 21:00:40,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=34260.0, ans=0.125 2023-06-17 21:01:02,127 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.613e+02 3.406e+02 4.056e+02 5.524e+02 1.397e+03, threshold=8.113e+02, percent-clipped=5.0 2023-06-17 21:01:10,680 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=34320.0, ans=0.125 2023-06-17 21:01:43,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=34380.0, ans=0.0033956521739130436 2023-06-17 21:02:26,270 INFO [train.py:996] (3/4) Epoch 1, batch 5750, loss[loss=0.3528, simple_loss=0.4378, pruned_loss=0.1339, over 20808.00 frames. ], tot_loss[loss=0.365, simple_loss=0.407, pruned_loss=0.1615, over 4271743.19 frames. ], batch size: 608, lr: 4.01e-02, grad_scale: 32.0 2023-06-17 21:02:26,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=34500.0, ans=0.1 2023-06-17 21:03:26,237 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=34620.0, ans=0.0 2023-06-17 21:03:50,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=34620.0, ans=0.0 2023-06-17 21:03:52,334 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=34620.0, ans=0.125 2023-06-17 21:04:53,010 INFO [train.py:996] (3/4) Epoch 1, batch 5800, loss[loss=0.2599, simple_loss=0.3027, pruned_loss=0.1086, over 21935.00 frames. ], tot_loss[loss=0.3614, simple_loss=0.4049, pruned_loss=0.159, over 4266759.72 frames. ], batch size: 98, lr: 4.00e-02, grad_scale: 32.0 2023-06-17 21:05:11,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=34860.0, ans=0.125 2023-06-17 21:05:11,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=34860.0, ans=0.0032913043478260866 2023-06-17 21:05:16,220 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=34860.0, ans=0.125 2023-06-17 21:05:33,094 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.587e+02 3.596e+02 4.280e+02 5.879e+02 1.064e+03, threshold=8.560e+02, percent-clipped=6.0 2023-06-17 21:06:21,913 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=34980.0, ans=0.125 2023-06-17 21:06:24,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=34980.0, ans=0.125 2023-06-17 21:07:15,162 INFO [train.py:996] (3/4) Epoch 1, batch 5850, loss[loss=0.3753, simple_loss=0.4566, pruned_loss=0.147, over 19845.00 frames. ], tot_loss[loss=0.3481, simple_loss=0.3982, pruned_loss=0.149, over 4271882.77 frames. ], batch size: 702, lr: 4.00e-02, grad_scale: 32.0 2023-06-17 21:07:37,266 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=35160.0, ans=0.04949747468305833 2023-06-17 21:08:10,877 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=35220.0, ans=0.125 2023-06-17 21:08:20,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=35220.0, ans=0.125 2023-06-17 21:08:43,660 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=35280.0, ans=0.02 2023-06-17 21:08:50,687 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 21:09:29,300 INFO [train.py:996] (3/4) Epoch 1, batch 5900, loss[loss=0.3658, simple_loss=0.3899, pruned_loss=0.1708, over 21561.00 frames. ], tot_loss[loss=0.3353, simple_loss=0.3884, pruned_loss=0.1411, over 4277606.01 frames. ], batch size: 212, lr: 3.99e-02, grad_scale: 32.0 2023-06-17 21:10:06,012 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.665e+02 2.990e+02 3.626e+02 5.703e+02 1.926e+03, threshold=7.252e+02, percent-clipped=11.0 2023-06-17 21:11:09,788 INFO [train.py:996] (3/4) Epoch 1, batch 5950, loss[loss=0.4135, simple_loss=0.4151, pruned_loss=0.206, over 21481.00 frames. ], tot_loss[loss=0.3458, simple_loss=0.3918, pruned_loss=0.1499, over 4275478.98 frames. ], batch size: 389, lr: 3.98e-02, grad_scale: 32.0 2023-06-17 21:12:02,075 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.55 vs. limit=10.0 2023-06-17 21:13:03,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=35940.0, ans=0.125 2023-06-17 21:13:16,095 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.94 vs. limit=15.0 2023-06-17 21:13:16,688 INFO [train.py:996] (3/4) Epoch 1, batch 6000, loss[loss=0.3898, simple_loss=0.4095, pruned_loss=0.1851, over 21500.00 frames. ], tot_loss[loss=0.3507, simple_loss=0.3914, pruned_loss=0.155, over 4275954.06 frames. ], batch size: 548, lr: 3.98e-02, grad_scale: 32.0 2023-06-17 21:13:16,690 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-17 21:14:09,295 INFO [train.py:1028] (3/4) Epoch 1, validation: loss=0.3443, simple_loss=0.428, pruned_loss=0.1303, over 1796401.00 frames. 2023-06-17 21:14:09,296 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-17 21:14:40,176 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.337e+02 3.654e+02 4.651e+02 5.653e+02 9.533e+02, threshold=9.302e+02, percent-clipped=10.0 2023-06-17 21:14:42,704 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=17.22 vs. limit=22.5 2023-06-17 21:15:33,719 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=36240.0, ans=0.125 2023-06-17 21:16:08,866 INFO [train.py:996] (3/4) Epoch 1, batch 6050, loss[loss=0.2881, simple_loss=0.3444, pruned_loss=0.1159, over 21554.00 frames. ], tot_loss[loss=0.35, simple_loss=0.388, pruned_loss=0.156, over 4281809.14 frames. ], batch size: 230, lr: 3.97e-02, grad_scale: 32.0 2023-06-17 21:16:10,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=36300.0, ans=0.125 2023-06-17 21:17:10,154 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.20 vs. limit=6.0 2023-06-17 21:17:34,061 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 21:17:51,928 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=36540.0, ans=0.125 2023-06-17 21:17:51,973 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=36540.0, ans=0.1 2023-06-17 21:18:23,272 INFO [train.py:996] (3/4) Epoch 1, batch 6100, loss[loss=0.3458, simple_loss=0.3802, pruned_loss=0.1557, over 21616.00 frames. ], tot_loss[loss=0.347, simple_loss=0.3851, pruned_loss=0.1544, over 4286903.32 frames. ], batch size: 230, lr: 3.96e-02, grad_scale: 32.0 2023-06-17 21:19:00,231 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.172e+02 3.419e+02 4.285e+02 5.583e+02 1.372e+03, threshold=8.569e+02, percent-clipped=6.0 2023-06-17 21:19:13,237 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=36720.0, ans=0.05 2023-06-17 21:19:21,508 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=36720.0, ans=0.0 2023-06-17 21:19:50,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=36780.0, ans=0.5 2023-06-17 21:20:17,380 INFO [train.py:996] (3/4) Epoch 1, batch 6150, loss[loss=0.3449, simple_loss=0.3812, pruned_loss=0.1543, over 21600.00 frames. ], tot_loss[loss=0.3559, simple_loss=0.3913, pruned_loss=0.1603, over 4294346.96 frames. ], batch size: 263, lr: 3.96e-02, grad_scale: 16.0 2023-06-17 21:20:41,430 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=36900.0, ans=0.0 2023-06-17 21:20:56,814 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.42 vs. limit=15.0 2023-06-17 21:20:58,281 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.88 vs. limit=15.0 2023-06-17 21:21:07,644 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=37020.0, ans=0.1 2023-06-17 21:21:09,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=37020.0, ans=0.0 2023-06-17 21:21:13,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=37020.0, ans=0.125 2023-06-17 21:21:37,842 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=18.01 vs. limit=22.5 2023-06-17 21:22:22,068 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=37140.0, ans=0.0 2023-06-17 21:22:41,973 INFO [train.py:996] (3/4) Epoch 1, batch 6200, loss[loss=0.3704, simple_loss=0.3832, pruned_loss=0.1788, over 21194.00 frames. ], tot_loss[loss=0.3574, simple_loss=0.3939, pruned_loss=0.1604, over 4297478.05 frames. ], batch size: 608, lr: 3.95e-02, grad_scale: 16.0 2023-06-17 21:22:52,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=37200.0, ans=0.125 2023-06-17 21:23:13,965 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.483e+02 3.238e+02 4.206e+02 5.371e+02 1.012e+03, threshold=8.413e+02, percent-clipped=2.0 2023-06-17 21:24:51,716 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=37440.0, ans=0.0 2023-06-17 21:24:54,233 INFO [train.py:996] (3/4) Epoch 1, batch 6250, loss[loss=0.4491, simple_loss=0.4934, pruned_loss=0.2024, over 21399.00 frames. ], tot_loss[loss=0.3589, simple_loss=0.3988, pruned_loss=0.1595, over 4297527.46 frames. ], batch size: 548, lr: 3.94e-02, grad_scale: 16.0 2023-06-17 21:25:01,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=37500.0, ans=0.2 2023-06-17 21:25:31,899 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=37560.0, ans=0.0 2023-06-17 21:26:06,976 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.28 vs. limit=15.0 2023-06-17 21:27:06,386 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=37740.0, ans=0.125 2023-06-17 21:27:20,507 INFO [train.py:996] (3/4) Epoch 1, batch 6300, loss[loss=0.3738, simple_loss=0.4081, pruned_loss=0.1698, over 21846.00 frames. ], tot_loss[loss=0.3604, simple_loss=0.4037, pruned_loss=0.1586, over 4291545.93 frames. ], batch size: 332, lr: 3.94e-02, grad_scale: 16.0 2023-06-17 21:28:10,416 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.404e+02 3.757e+02 4.825e+02 6.859e+02 1.465e+03, threshold=9.649e+02, percent-clipped=15.0 2023-06-17 21:28:58,034 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.96 vs. limit=15.0 2023-06-17 21:29:07,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=37980.0, ans=0.125 2023-06-17 21:29:27,498 INFO [train.py:996] (3/4) Epoch 1, batch 6350, loss[loss=0.4089, simple_loss=0.44, pruned_loss=0.1889, over 21801.00 frames. ], tot_loss[loss=0.3704, simple_loss=0.4103, pruned_loss=0.1653, over 4293789.90 frames. ], batch size: 282, lr: 3.93e-02, grad_scale: 16.0 2023-06-17 21:30:03,557 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=38160.0, ans=0.125 2023-06-17 21:30:23,943 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.74 vs. limit=12.0 2023-06-17 21:31:09,066 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.35 vs. limit=6.0 2023-06-17 21:31:46,189 INFO [train.py:996] (3/4) Epoch 1, batch 6400, loss[loss=0.3819, simple_loss=0.4141, pruned_loss=0.1749, over 21819.00 frames. ], tot_loss[loss=0.3802, simple_loss=0.4179, pruned_loss=0.1712, over 4288114.44 frames. ], batch size: 247, lr: 3.92e-02, grad_scale: 32.0 2023-06-17 21:32:35,824 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.409e+02 3.736e+02 4.493e+02 6.013e+02 1.011e+03, threshold=8.985e+02, percent-clipped=1.0 2023-06-17 21:33:23,132 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.73 vs. limit=10.0 2023-06-17 21:34:04,666 INFO [train.py:996] (3/4) Epoch 1, batch 6450, loss[loss=0.4544, simple_loss=0.524, pruned_loss=0.1924, over 20786.00 frames. ], tot_loss[loss=0.3767, simple_loss=0.4171, pruned_loss=0.1682, over 4287972.62 frames. ], batch size: 607, lr: 3.92e-02, grad_scale: 32.0 2023-06-17 21:34:23,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=38760.0, ans=0.125 2023-06-17 21:34:27,831 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.61 vs. limit=6.0 2023-06-17 21:34:51,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=38820.0, ans=0.002430434782608696 2023-06-17 21:36:12,922 INFO [train.py:996] (3/4) Epoch 1, batch 6500, loss[loss=0.3227, simple_loss=0.3488, pruned_loss=0.1484, over 21198.00 frames. ], tot_loss[loss=0.3723, simple_loss=0.4099, pruned_loss=0.1674, over 4284516.43 frames. ], batch size: 144, lr: 3.91e-02, grad_scale: 32.0 2023-06-17 21:36:40,788 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.47 vs. limit=12.0 2023-06-17 21:36:44,709 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=39060.0, ans=0.04949747468305833 2023-06-17 21:36:59,798 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 3.413e+02 4.587e+02 6.044e+02 1.414e+03, threshold=9.175e+02, percent-clipped=8.0 2023-06-17 21:38:00,066 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.39 vs. limit=15.0 2023-06-17 21:38:31,984 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.62 vs. limit=22.5 2023-06-17 21:38:35,543 INFO [train.py:996] (3/4) Epoch 1, batch 6550, loss[loss=0.3729, simple_loss=0.4109, pruned_loss=0.1675, over 21871.00 frames. ], tot_loss[loss=0.3715, simple_loss=0.41, pruned_loss=0.1665, over 4283509.45 frames. ], batch size: 316, lr: 3.91e-02, grad_scale: 32.0 2023-06-17 21:38:49,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=39360.0, ans=0.125 2023-06-17 21:39:41,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=39420.0, ans=0.0 2023-06-17 21:40:34,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=39600.0, ans=0.2 2023-06-17 21:40:35,056 INFO [train.py:996] (3/4) Epoch 1, batch 6600, loss[loss=0.3284, simple_loss=0.3525, pruned_loss=0.1522, over 21277.00 frames. ], tot_loss[loss=0.368, simple_loss=0.4063, pruned_loss=0.1648, over 4267319.69 frames. ], batch size: 144, lr: 3.90e-02, grad_scale: 32.0 2023-06-17 21:41:23,083 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.188e+02 3.711e+02 4.295e+02 5.281e+02 1.119e+03, threshold=8.590e+02, percent-clipped=2.0 2023-06-17 21:41:26,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=39720.0, ans=0.1 2023-06-17 21:41:32,604 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=39720.0, ans=0.125 2023-06-17 21:42:27,414 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 21:42:34,155 INFO [train.py:996] (3/4) Epoch 1, batch 6650, loss[loss=0.3112, simple_loss=0.3369, pruned_loss=0.1428, over 21305.00 frames. ], tot_loss[loss=0.3578, simple_loss=0.3961, pruned_loss=0.1597, over 4260610.78 frames. ], batch size: 159, lr: 3.89e-02, grad_scale: 32.0 2023-06-17 21:43:51,453 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.64 vs. limit=15.0 2023-06-17 21:44:06,472 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.87 vs. limit=15.0 2023-06-17 21:44:49,223 INFO [train.py:996] (3/4) Epoch 1, batch 6700, loss[loss=0.2953, simple_loss=0.3433, pruned_loss=0.1236, over 21530.00 frames. ], tot_loss[loss=0.3565, simple_loss=0.3935, pruned_loss=0.1597, over 4259504.10 frames. ], batch size: 230, lr: 3.89e-02, grad_scale: 32.0 2023-06-17 21:45:41,585 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.343e+02 3.533e+02 4.521e+02 6.052e+02 1.154e+03, threshold=9.041e+02, percent-clipped=5.0 2023-06-17 21:45:46,120 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.75 vs. limit=10.0 2023-06-17 21:46:09,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=40380.0, ans=0.1 2023-06-17 21:47:11,741 INFO [train.py:996] (3/4) Epoch 1, batch 6750, loss[loss=0.3959, simple_loss=0.4114, pruned_loss=0.1902, over 21864.00 frames. ], tot_loss[loss=0.3545, simple_loss=0.3896, pruned_loss=0.1597, over 4269666.82 frames. ], batch size: 351, lr: 3.88e-02, grad_scale: 32.0 2023-06-17 21:47:52,591 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-17 21:48:14,313 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=19.33 vs. limit=22.5 2023-06-17 21:48:29,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=40680.0, ans=0.1 2023-06-17 21:48:48,992 INFO [train.py:996] (3/4) Epoch 1, batch 6800, loss[loss=0.518, simple_loss=0.5779, pruned_loss=0.229, over 19780.00 frames. ], tot_loss[loss=0.3599, simple_loss=0.3925, pruned_loss=0.1637, over 4271515.71 frames. ], batch size: 702, lr: 3.87e-02, grad_scale: 32.0 2023-06-17 21:48:59,844 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.16 vs. limit=22.5 2023-06-17 21:49:38,269 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.364e+02 3.387e+02 4.190e+02 5.566e+02 1.112e+03, threshold=8.380e+02, percent-clipped=6.0 2023-06-17 21:50:35,642 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=41100.0, ans=0.1 2023-06-17 21:50:36,590 INFO [train.py:996] (3/4) Epoch 1, batch 6850, loss[loss=0.3286, simple_loss=0.3545, pruned_loss=0.1513, over 21566.00 frames. ], tot_loss[loss=0.3589, simple_loss=0.3886, pruned_loss=0.1646, over 4270719.81 frames. ], batch size: 263, lr: 3.87e-02, grad_scale: 32.0 2023-06-17 21:50:46,014 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=41100.0, ans=0.125 2023-06-17 21:51:28,145 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=41220.0, ans=0.125 2023-06-17 21:51:34,557 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=41280.0, ans=0.125 2023-06-17 21:52:07,067 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 21:52:10,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=41280.0, ans=0.125 2023-06-17 21:52:33,804 INFO [train.py:996] (3/4) Epoch 1, batch 6900, loss[loss=0.3467, simple_loss=0.4087, pruned_loss=0.1424, over 21708.00 frames. ], tot_loss[loss=0.3604, simple_loss=0.3901, pruned_loss=0.1653, over 4275055.56 frames. ], batch size: 414, lr: 3.86e-02, grad_scale: 32.0 2023-06-17 21:53:08,807 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.61 vs. limit=15.0 2023-06-17 21:53:11,372 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=41460.0, ans=0.125 2023-06-17 21:53:31,726 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.711e+02 3.488e+02 4.127e+02 5.234e+02 1.332e+03, threshold=8.254e+02, percent-clipped=6.0 2023-06-17 21:53:42,703 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.42 vs. limit=15.0 2023-06-17 21:54:47,873 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=41640.0, ans=0.0 2023-06-17 21:54:58,877 INFO [train.py:996] (3/4) Epoch 1, batch 6950, loss[loss=0.4194, simple_loss=0.4539, pruned_loss=0.1924, over 21846.00 frames. ], tot_loss[loss=0.3559, simple_loss=0.3907, pruned_loss=0.1606, over 4281081.34 frames. ], batch size: 118, lr: 3.85e-02, grad_scale: 32.0 2023-06-17 21:55:29,822 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff2.min_abs, batch_count=41700.0, ans=0.1 2023-06-17 21:56:07,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=41820.0, ans=0.035 2023-06-17 21:56:11,199 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=41820.0, ans=15.0 2023-06-17 21:56:38,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=41880.0, ans=0.125 2023-06-17 21:56:49,726 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.25 vs. limit=15.0 2023-06-17 21:57:23,696 INFO [train.py:996] (3/4) Epoch 1, batch 7000, loss[loss=0.433, simple_loss=0.4148, pruned_loss=0.2256, over 21340.00 frames. ], tot_loss[loss=0.364, simple_loss=0.3957, pruned_loss=0.1661, over 4279999.84 frames. ], batch size: 508, lr: 3.85e-02, grad_scale: 32.0 2023-06-17 21:57:59,914 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=42060.0, ans=0.2 2023-06-17 21:57:59,954 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=42060.0, ans=0.125 2023-06-17 21:58:00,861 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.247e+02 4.182e+02 5.502e+02 7.553e+02 1.291e+03, threshold=1.100e+03, percent-clipped=22.0 2023-06-17 21:59:17,465 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 21:59:25,482 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.52 vs. limit=15.0 2023-06-17 21:59:34,222 INFO [train.py:996] (3/4) Epoch 1, batch 7050, loss[loss=0.3259, simple_loss=0.3818, pruned_loss=0.135, over 21586.00 frames. ], tot_loss[loss=0.3575, simple_loss=0.3907, pruned_loss=0.1621, over 4276216.41 frames. ], batch size: 263, lr: 3.84e-02, grad_scale: 32.0 2023-06-17 22:00:02,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=42360.0, ans=0.1 2023-06-17 22:01:31,626 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.81 vs. limit=22.5 2023-06-17 22:01:52,731 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=42540.0, ans=0.0 2023-06-17 22:01:55,179 INFO [train.py:996] (3/4) Epoch 1, batch 7100, loss[loss=0.3244, simple_loss=0.3734, pruned_loss=0.1377, over 21725.00 frames. ], tot_loss[loss=0.3634, simple_loss=0.3975, pruned_loss=0.1647, over 4278927.59 frames. ], batch size: 332, lr: 3.83e-02, grad_scale: 32.0 2023-06-17 22:02:12,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=42600.0, ans=0.125 2023-06-17 22:02:16,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=42600.0, ans=0.125 2023-06-17 22:02:27,314 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.66 vs. limit=10.0 2023-06-17 22:02:33,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=42660.0, ans=0.125 2023-06-17 22:02:53,219 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.999e+02 3.347e+02 4.129e+02 5.601e+02 1.207e+03, threshold=8.258e+02, percent-clipped=3.0 2023-06-17 22:03:06,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=42720.0, ans=0.125 2023-06-17 22:03:38,346 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=42780.0, ans=0.05 2023-06-17 22:04:07,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=42840.0, ans=0.1 2023-06-17 22:04:11,655 INFO [train.py:996] (3/4) Epoch 1, batch 7150, loss[loss=0.4192, simple_loss=0.4479, pruned_loss=0.1952, over 21611.00 frames. ], tot_loss[loss=0.3562, simple_loss=0.3926, pruned_loss=0.1599, over 4283604.10 frames. ], batch size: 389, lr: 3.83e-02, grad_scale: 32.0 2023-06-17 22:04:48,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=42960.0, ans=0.125 2023-06-17 22:05:32,846 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=43020.0, ans=0.125 2023-06-17 22:06:33,820 INFO [train.py:996] (3/4) Epoch 1, batch 7200, loss[loss=0.3296, simple_loss=0.3608, pruned_loss=0.1492, over 21182.00 frames. ], tot_loss[loss=0.3637, simple_loss=0.3968, pruned_loss=0.1653, over 4280189.27 frames. ], batch size: 159, lr: 3.82e-02, grad_scale: 32.0 2023-06-17 22:06:56,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=43200.0, ans=0.1 2023-06-17 22:07:18,232 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.931e+02 3.282e+02 4.227e+02 5.672e+02 1.166e+03, threshold=8.454e+02, percent-clipped=6.0 2023-06-17 22:07:35,053 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=43320.0, ans=0.0 2023-06-17 22:07:39,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=43320.0, ans=0.125 2023-06-17 22:07:59,094 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=19.14 vs. limit=15.0 2023-06-17 22:08:29,561 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.22 vs. limit=15.0 2023-06-17 22:08:49,366 INFO [train.py:996] (3/4) Epoch 1, batch 7250, loss[loss=0.3099, simple_loss=0.3312, pruned_loss=0.1443, over 21237.00 frames. ], tot_loss[loss=0.3602, simple_loss=0.3912, pruned_loss=0.1646, over 4280276.62 frames. ], batch size: 549, lr: 3.82e-02, grad_scale: 32.0 2023-06-17 22:08:51,145 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=43500.0, ans=0.125 2023-06-17 22:10:16,410 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=43680.0, ans=0.0 2023-06-17 22:10:30,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=43740.0, ans=0.2 2023-06-17 22:10:30,220 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=43740.0, ans=0.1 2023-06-17 22:10:47,947 INFO [train.py:996] (3/4) Epoch 1, batch 7300, loss[loss=0.3302, simple_loss=0.365, pruned_loss=0.1477, over 21364.00 frames. ], tot_loss[loss=0.3523, simple_loss=0.3829, pruned_loss=0.1608, over 4270413.33 frames. ], batch size: 131, lr: 3.81e-02, grad_scale: 32.0 2023-06-17 22:10:59,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=43800.0, ans=0.05 2023-06-17 22:11:20,836 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 22:11:54,766 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.416e+02 3.511e+02 4.213e+02 5.491e+02 1.019e+03, threshold=8.426e+02, percent-clipped=2.0 2023-06-17 22:12:03,100 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.56 vs. limit=15.0 2023-06-17 22:12:03,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=43920.0, ans=0.125 2023-06-17 22:12:59,891 INFO [train.py:996] (3/4) Epoch 1, batch 7350, loss[loss=0.3534, simple_loss=0.3879, pruned_loss=0.1595, over 21704.00 frames. ], tot_loss[loss=0.3521, simple_loss=0.3803, pruned_loss=0.162, over 4272386.24 frames. ], batch size: 298, lr: 3.80e-02, grad_scale: 32.0 2023-06-17 22:13:19,796 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=44160.0, ans=0.125 2023-06-17 22:13:40,997 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 22:14:03,198 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.53 vs. limit=12.0 2023-06-17 22:14:36,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=44340.0, ans=0.07 2023-06-17 22:14:54,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=44340.0, ans=0.1 2023-06-17 22:14:57,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=44340.0, ans=0.02 2023-06-17 22:14:59,901 INFO [train.py:996] (3/4) Epoch 1, batch 7400, loss[loss=0.3372, simple_loss=0.3956, pruned_loss=0.1393, over 21828.00 frames. ], tot_loss[loss=0.3611, simple_loss=0.3901, pruned_loss=0.1661, over 4276026.47 frames. ], batch size: 317, lr: 3.80e-02, grad_scale: 32.0 2023-06-17 22:15:47,312 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.591e+02 3.927e+02 4.881e+02 6.398e+02 1.158e+03, threshold=9.762e+02, percent-clipped=7.0 2023-06-17 22:16:06,821 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.72 vs. limit=15.0 2023-06-17 22:16:12,356 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.45 vs. limit=22.5 2023-06-17 22:16:22,022 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.07 vs. limit=15.0 2023-06-17 22:16:51,966 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=12.0 2023-06-17 22:16:52,041 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.40 vs. limit=6.0 2023-06-17 22:17:04,109 INFO [train.py:996] (3/4) Epoch 1, batch 7450, loss[loss=0.309, simple_loss=0.3408, pruned_loss=0.1386, over 21415.00 frames. ], tot_loss[loss=0.3555, simple_loss=0.3865, pruned_loss=0.1622, over 4280580.33 frames. ], batch size: 195, lr: 3.79e-02, grad_scale: 32.0 2023-06-17 22:18:16,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=44820.0, ans=0.125 2023-06-17 22:18:21,915 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.19 vs. limit=15.0 2023-06-17 22:18:56,285 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=44940.0, ans=0.125 2023-06-17 22:19:09,755 INFO [train.py:996] (3/4) Epoch 1, batch 7500, loss[loss=0.4217, simple_loss=0.4728, pruned_loss=0.1853, over 21266.00 frames. ], tot_loss[loss=0.3636, simple_loss=0.3945, pruned_loss=0.1663, over 4277500.17 frames. ], batch size: 549, lr: 3.78e-02, grad_scale: 32.0 2023-06-17 22:19:11,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=45000.0, ans=0.001086956521739131 2023-06-17 22:20:07,708 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.413e+02 3.630e+02 4.462e+02 5.942e+02 1.492e+03, threshold=8.924e+02, percent-clipped=4.0 2023-06-17 22:20:56,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=45240.0, ans=0.0 2023-06-17 22:20:58,071 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=45240.0, ans=0.125 2023-06-17 22:21:06,589 INFO [train.py:996] (3/4) Epoch 1, batch 7550, loss[loss=0.3225, simple_loss=0.3827, pruned_loss=0.1311, over 21786.00 frames. ], tot_loss[loss=0.3641, simple_loss=0.4015, pruned_loss=0.1634, over 4276780.74 frames. ], batch size: 118, lr: 3.78e-02, grad_scale: 32.0 2023-06-17 22:21:13,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=45300.0, ans=0.125 2023-06-17 22:21:51,288 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=45420.0, ans=0.035 2023-06-17 22:22:24,955 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=45540.0, ans=0.0 2023-06-17 22:22:43,402 INFO [train.py:996] (3/4) Epoch 1, batch 7600, loss[loss=0.3549, simple_loss=0.3904, pruned_loss=0.1596, over 21321.00 frames. ], tot_loss[loss=0.3604, simple_loss=0.3989, pruned_loss=0.1609, over 4276975.07 frames. ], batch size: 159, lr: 3.77e-02, grad_scale: 32.0 2023-06-17 22:22:52,784 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=45600.0, ans=0.0 2023-06-17 22:23:15,591 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 3.356e+02 4.120e+02 5.350e+02 1.313e+03, threshold=8.240e+02, percent-clipped=1.0 2023-06-17 22:24:35,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=45840.0, ans=0.125 2023-06-17 22:24:44,507 INFO [train.py:996] (3/4) Epoch 1, batch 7650, loss[loss=0.3975, simple_loss=0.415, pruned_loss=0.19, over 21914.00 frames. ], tot_loss[loss=0.3634, simple_loss=0.3984, pruned_loss=0.1642, over 4283528.44 frames. ], batch size: 414, lr: 3.77e-02, grad_scale: 32.0 2023-06-17 22:26:13,173 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=46080.0, ans=0.125 2023-06-17 22:26:13,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=46080.0, ans=0.0 2023-06-17 22:26:15,288 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.74 vs. limit=15.0 2023-06-17 22:26:30,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=46140.0, ans=0.2 2023-06-17 22:26:33,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=46140.0, ans=0.0 2023-06-17 22:26:49,825 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=46140.0, ans=0.1 2023-06-17 22:27:09,905 INFO [train.py:996] (3/4) Epoch 1, batch 7700, loss[loss=0.3877, simple_loss=0.4, pruned_loss=0.1877, over 19959.00 frames. ], tot_loss[loss=0.3708, simple_loss=0.4031, pruned_loss=0.1692, over 4290478.80 frames. ], batch size: 702, lr: 3.76e-02, grad_scale: 32.0 2023-06-17 22:28:07,865 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.438e+02 3.716e+02 4.576e+02 5.497e+02 9.392e+02, threshold=9.152e+02, percent-clipped=2.0 2023-06-17 22:28:26,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=46380.0, ans=0.0 2023-06-17 22:28:44,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=46380.0, ans=0.1 2023-06-17 22:29:04,296 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=46440.0, ans=0.2 2023-06-17 22:29:09,552 INFO [train.py:996] (3/4) Epoch 1, batch 7750, loss[loss=0.3808, simple_loss=0.4372, pruned_loss=0.1622, over 21375.00 frames. ], tot_loss[loss=0.3718, simple_loss=0.4074, pruned_loss=0.1681, over 4287009.09 frames. ], batch size: 194, lr: 3.75e-02, grad_scale: 32.0 2023-06-17 22:29:13,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=46500.0, ans=0.0 2023-06-17 22:30:07,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=46620.0, ans=0.1 2023-06-17 22:30:19,718 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=46620.0, ans=0.125 2023-06-17 22:30:32,305 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=46680.0, ans=0.0 2023-06-17 22:31:10,873 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=46740.0, ans=0.125 2023-06-17 22:31:14,887 INFO [train.py:996] (3/4) Epoch 1, batch 7800, loss[loss=0.2664, simple_loss=0.2959, pruned_loss=0.1185, over 21238.00 frames. ], tot_loss[loss=0.3726, simple_loss=0.4089, pruned_loss=0.1682, over 4277646.46 frames. ], batch size: 143, lr: 3.75e-02, grad_scale: 32.0 2023-06-17 22:31:51,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=46800.0, ans=0.125 2023-06-17 22:32:11,161 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.281e+02 3.644e+02 4.604e+02 5.812e+02 1.073e+03, threshold=9.208e+02, percent-clipped=1.0 2023-06-17 22:33:10,962 INFO [train.py:996] (3/4) Epoch 1, batch 7850, loss[loss=0.328, simple_loss=0.3571, pruned_loss=0.1494, over 21691.00 frames. ], tot_loss[loss=0.3662, simple_loss=0.4007, pruned_loss=0.1658, over 4276942.63 frames. ], batch size: 333, lr: 3.74e-02, grad_scale: 32.0 2023-06-17 22:33:13,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=47100.0, ans=0.125 2023-06-17 22:33:33,275 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.36 vs. limit=8.0 2023-06-17 22:34:11,449 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=47220.0, ans=0.0 2023-06-17 22:34:18,988 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=47280.0, ans=0.125 2023-06-17 22:34:19,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=47280.0, ans=0.125 2023-06-17 22:34:48,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=47340.0, ans=0.0 2023-06-17 22:34:50,979 INFO [train.py:996] (3/4) Epoch 1, batch 7900, loss[loss=0.334, simple_loss=0.3477, pruned_loss=0.1601, over 21879.00 frames. ], tot_loss[loss=0.3621, simple_loss=0.3953, pruned_loss=0.1644, over 4266608.69 frames. ], batch size: 98, lr: 3.73e-02, grad_scale: 32.0 2023-06-17 22:35:02,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=47400.0, ans=0.0005652173913043481 2023-06-17 22:35:22,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=47460.0, ans=15.0 2023-06-17 22:35:28,925 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.540e+02 3.542e+02 4.364e+02 5.493e+02 1.086e+03, threshold=8.728e+02, percent-clipped=7.0 2023-06-17 22:35:50,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=47520.0, ans=0.125 2023-06-17 22:37:15,571 INFO [train.py:996] (3/4) Epoch 1, batch 7950, loss[loss=0.4317, simple_loss=0.4617, pruned_loss=0.2008, over 21695.00 frames. ], tot_loss[loss=0.3662, simple_loss=0.4029, pruned_loss=0.1647, over 4266010.01 frames. ], batch size: 414, lr: 3.73e-02, grad_scale: 32.0 2023-06-17 22:37:58,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=47760.0, ans=0.0 2023-06-17 22:38:40,416 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=19.36 vs. limit=22.5 2023-06-17 22:39:42,795 INFO [train.py:996] (3/4) Epoch 1, batch 8000, loss[loss=0.3453, simple_loss=0.3481, pruned_loss=0.1712, over 20297.00 frames. ], tot_loss[loss=0.3718, simple_loss=0.4065, pruned_loss=0.1686, over 4264572.69 frames. ], batch size: 703, lr: 3.72e-02, grad_scale: 32.0 2023-06-17 22:39:44,627 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=48000.0, ans=0.0 2023-06-17 22:40:09,934 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=48060.0, ans=0.0 2023-06-17 22:40:17,331 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.314e+02 4.071e+02 5.066e+02 6.362e+02 1.546e+03, threshold=1.013e+03, percent-clipped=8.0 2023-06-17 22:42:17,229 INFO [train.py:996] (3/4) Epoch 1, batch 8050, loss[loss=0.2906, simple_loss=0.3242, pruned_loss=0.1285, over 21399.00 frames. ], tot_loss[loss=0.3706, simple_loss=0.4061, pruned_loss=0.1675, over 4266066.89 frames. ], batch size: 131, lr: 3.72e-02, grad_scale: 32.0 2023-06-17 22:42:25,651 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=48300.0, ans=0.125 2023-06-17 22:42:46,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=48360.0, ans=0.2 2023-06-17 22:43:18,145 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.96 vs. limit=22.5 2023-06-17 22:43:21,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=48480.0, ans=0.125 2023-06-17 22:44:07,430 INFO [train.py:996] (3/4) Epoch 1, batch 8100, loss[loss=0.3262, simple_loss=0.3648, pruned_loss=0.1439, over 21796.00 frames. ], tot_loss[loss=0.3728, simple_loss=0.4081, pruned_loss=0.1687, over 4261303.39 frames. ], batch size: 247, lr: 3.71e-02, grad_scale: 32.0 2023-06-17 22:44:26,705 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.49 vs. limit=15.0 2023-06-17 22:45:13,361 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.843e+02 3.716e+02 4.690e+02 6.352e+02 1.621e+03, threshold=9.381e+02, percent-clipped=6.0 2023-06-17 22:45:36,571 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=48780.0, ans=0.1 2023-06-17 22:46:10,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=48840.0, ans=0.125 2023-06-17 22:46:42,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=48840.0, ans=0.2 2023-06-17 22:46:45,262 INFO [train.py:996] (3/4) Epoch 1, batch 8150, loss[loss=0.3451, simple_loss=0.4226, pruned_loss=0.1338, over 21818.00 frames. ], tot_loss[loss=0.3791, simple_loss=0.4167, pruned_loss=0.1707, over 4260615.14 frames. ], batch size: 372, lr: 3.70e-02, grad_scale: 32.0 2023-06-17 22:46:47,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=48900.0, ans=0.07 2023-06-17 22:46:55,949 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=48900.0, ans=0.1 2023-06-17 22:47:09,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=48900.0, ans=0.0 2023-06-17 22:47:26,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=48960.0, ans=0.0002260869565217389 2023-06-17 22:48:01,250 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=49080.0, ans=0.0 2023-06-17 22:48:34,258 INFO [train.py:996] (3/4) Epoch 1, batch 8200, loss[loss=0.3781, simple_loss=0.3945, pruned_loss=0.1809, over 21619.00 frames. ], tot_loss[loss=0.372, simple_loss=0.4095, pruned_loss=0.1672, over 4256834.91 frames. ], batch size: 415, lr: 3.70e-02, grad_scale: 32.0 2023-06-17 22:48:52,115 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=49200.0, ans=0.125 2023-06-17 22:49:19,268 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.379e+02 3.766e+02 4.530e+02 5.768e+02 1.043e+03, threshold=9.060e+02, percent-clipped=2.0 2023-06-17 22:49:26,746 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.14 vs. limit=15.0 2023-06-17 22:49:42,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=49380.0, ans=0.0 2023-06-17 22:50:06,908 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.47 vs. limit=10.0 2023-06-17 22:50:11,990 INFO [train.py:996] (3/4) Epoch 1, batch 8250, loss[loss=0.3538, simple_loss=0.3724, pruned_loss=0.1676, over 21298.00 frames. ], tot_loss[loss=0.3703, simple_loss=0.407, pruned_loss=0.1668, over 4255752.29 frames. ], batch size: 608, lr: 3.69e-02, grad_scale: 32.0 2023-06-17 22:50:41,899 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=49560.0, ans=0.125 2023-06-17 22:51:02,083 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=49620.0, ans=0.04949747468305833 2023-06-17 22:51:30,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=49680.0, ans=0.0 2023-06-17 22:51:55,150 INFO [train.py:996] (3/4) Epoch 1, batch 8300, loss[loss=0.333, simple_loss=0.3905, pruned_loss=0.1377, over 21699.00 frames. ], tot_loss[loss=0.3615, simple_loss=0.4012, pruned_loss=0.1609, over 4259524.75 frames. ], batch size: 351, lr: 3.68e-02, grad_scale: 32.0 2023-06-17 22:52:16,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=49860.0, ans=0.125 2023-06-17 22:52:27,554 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=49860.0, ans=0.0 2023-06-17 22:52:34,302 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.218e+02 3.435e+02 4.224e+02 5.428e+02 8.537e+02, threshold=8.449e+02, percent-clipped=0.0 2023-06-17 22:52:50,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=49980.0, ans=0.125 2023-06-17 22:53:10,172 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=50040.0, ans=0.0 2023-06-17 22:53:32,139 INFO [train.py:996] (3/4) Epoch 1, batch 8350, loss[loss=0.3059, simple_loss=0.3559, pruned_loss=0.1279, over 21799.00 frames. ], tot_loss[loss=0.3542, simple_loss=0.3967, pruned_loss=0.1558, over 4261756.98 frames. ], batch size: 118, lr: 3.68e-02, grad_scale: 32.0 2023-06-17 22:53:32,650 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=50100.0, ans=0.125 2023-06-17 22:53:35,432 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=50100.0, ans=0.0 2023-06-17 22:54:06,229 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.90 vs. limit=22.5 2023-06-17 22:55:15,330 INFO [train.py:996] (3/4) Epoch 1, batch 8400, loss[loss=0.1974, simple_loss=0.2458, pruned_loss=0.07445, over 16674.00 frames. ], tot_loss[loss=0.3476, simple_loss=0.3922, pruned_loss=0.1515, over 4253150.06 frames. ], batch size: 62, lr: 3.67e-02, grad_scale: 32.0 2023-06-17 22:55:50,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=50460.0, ans=0.0 2023-06-17 22:55:58,086 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-17 22:56:05,634 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.614e+02 4.068e+02 5.029e+02 6.288e+02 1.067e+03, threshold=1.006e+03, percent-clipped=6.0 2023-06-17 22:56:57,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=50640.0, ans=0.125 2023-06-17 22:57:03,216 INFO [train.py:996] (3/4) Epoch 1, batch 8450, loss[loss=0.4361, simple_loss=0.4821, pruned_loss=0.1951, over 20798.00 frames. ], tot_loss[loss=0.3496, simple_loss=0.393, pruned_loss=0.1531, over 4262878.67 frames. ], batch size: 607, lr: 3.67e-02, grad_scale: 32.0 2023-06-17 22:57:36,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=50760.0, ans=0.125 2023-06-17 22:58:01,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff3.min_abs, batch_count=50820.0, ans=0.2 2023-06-17 22:58:15,382 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.98 vs. limit=10.0 2023-06-17 22:58:47,896 INFO [train.py:996] (3/4) Epoch 1, batch 8500, loss[loss=0.3947, simple_loss=0.3825, pruned_loss=0.2034, over 21505.00 frames. ], tot_loss[loss=0.3503, simple_loss=0.3897, pruned_loss=0.1554, over 4268048.08 frames. ], batch size: 511, lr: 3.66e-02, grad_scale: 32.0 2023-06-17 22:58:49,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=51000.0, ans=0.2 2023-06-17 22:59:03,302 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=51000.0, ans=22.5 2023-06-17 22:59:26,686 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=51060.0, ans=0.125 2023-06-17 22:59:27,734 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.579e+02 3.628e+02 4.473e+02 5.603e+02 9.273e+02, threshold=8.945e+02, percent-clipped=0.0 2023-06-17 22:59:35,105 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.61 vs. limit=15.0 2023-06-17 22:59:38,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=51120.0, ans=0.125 2023-06-17 23:00:31,210 INFO [train.py:996] (3/4) Epoch 1, batch 8550, loss[loss=0.4075, simple_loss=0.4527, pruned_loss=0.1811, over 21843.00 frames. ], tot_loss[loss=0.3561, simple_loss=0.3944, pruned_loss=0.1589, over 4271380.11 frames. ], batch size: 371, lr: 3.65e-02, grad_scale: 32.0 2023-06-17 23:00:39,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=51300.0, ans=0.5 2023-06-17 23:00:48,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=51360.0, ans=0.0 2023-06-17 23:01:05,353 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=51420.0, ans=0.125 2023-06-17 23:01:23,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=51420.0, ans=0.1 2023-06-17 23:01:30,628 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.22 vs. limit=22.5 2023-06-17 23:01:56,074 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=51480.0, ans=0.125 2023-06-17 23:02:29,161 INFO [train.py:996] (3/4) Epoch 1, batch 8600, loss[loss=0.4175, simple_loss=0.4389, pruned_loss=0.1981, over 21343.00 frames. ], tot_loss[loss=0.3625, simple_loss=0.4025, pruned_loss=0.1613, over 4275197.26 frames. ], batch size: 548, lr: 3.65e-02, grad_scale: 32.0 2023-06-17 23:03:10,040 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.658e+02 3.952e+02 4.835e+02 6.888e+02 1.478e+03, threshold=9.670e+02, percent-clipped=13.0 2023-06-17 23:03:13,563 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 23:04:08,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=51840.0, ans=0.0 2023-06-17 23:04:29,714 INFO [train.py:996] (3/4) Epoch 1, batch 8650, loss[loss=0.2938, simple_loss=0.3694, pruned_loss=0.1092, over 21823.00 frames. ], tot_loss[loss=0.3703, simple_loss=0.4121, pruned_loss=0.1643, over 4276204.42 frames. ], batch size: 316, lr: 3.64e-02, grad_scale: 32.0 2023-06-17 23:04:47,498 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=51900.0, ans=0.0 2023-06-17 23:05:05,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=51960.0, ans=0.1 2023-06-17 23:06:19,698 INFO [train.py:996] (3/4) Epoch 1, batch 8700, loss[loss=0.3012, simple_loss=0.3421, pruned_loss=0.1301, over 21389.00 frames. ], tot_loss[loss=0.3619, simple_loss=0.4061, pruned_loss=0.1589, over 4267715.68 frames. ], batch size: 131, lr: 3.64e-02, grad_scale: 32.0 2023-06-17 23:06:48,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=52260.0, ans=0.125 2023-06-17 23:06:59,823 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.126e+02 3.304e+02 4.220e+02 5.103e+02 8.221e+02, threshold=8.441e+02, percent-clipped=0.0 2023-06-17 23:07:04,917 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=52320.0, ans=0.125 2023-06-17 23:07:24,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=52380.0, ans=0.125 2023-06-17 23:07:58,798 INFO [train.py:996] (3/4) Epoch 1, batch 8750, loss[loss=0.3722, simple_loss=0.3947, pruned_loss=0.1748, over 21832.00 frames. ], tot_loss[loss=0.3619, simple_loss=0.4012, pruned_loss=0.1614, over 4279009.07 frames. ], batch size: 282, lr: 3.63e-02, grad_scale: 32.0 2023-06-17 23:08:23,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=52560.0, ans=0.125 2023-06-17 23:08:28,118 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.39 vs. limit=12.0 2023-06-17 23:08:36,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=52560.0, ans=0.125 2023-06-17 23:08:42,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=52620.0, ans=0.0 2023-06-17 23:08:43,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=52620.0, ans=0.0 2023-06-17 23:08:58,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=52680.0, ans=0.2 2023-06-17 23:09:54,980 INFO [train.py:996] (3/4) Epoch 1, batch 8800, loss[loss=0.4173, simple_loss=0.4462, pruned_loss=0.1942, over 21191.00 frames. ], tot_loss[loss=0.3718, simple_loss=0.4101, pruned_loss=0.1667, over 4277661.70 frames. ], batch size: 143, lr: 3.62e-02, grad_scale: 32.0 2023-06-17 23:10:45,677 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.532e+02 4.248e+02 5.624e+02 7.492e+02 1.328e+03, threshold=1.125e+03, percent-clipped=14.0 2023-06-17 23:11:52,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=53040.0, ans=0.1 2023-06-17 23:11:59,497 INFO [train.py:996] (3/4) Epoch 1, batch 8850, loss[loss=0.3184, simple_loss=0.3689, pruned_loss=0.1339, over 21179.00 frames. ], tot_loss[loss=0.3782, simple_loss=0.418, pruned_loss=0.1692, over 4268320.62 frames. ], batch size: 143, lr: 3.62e-02, grad_scale: 32.0 2023-06-17 23:12:02,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=53100.0, ans=0.5 2023-06-17 23:12:41,511 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.41 vs. limit=5.0 2023-06-17 23:13:10,805 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.81 vs. limit=22.5 2023-06-17 23:13:22,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=53340.0, ans=0.0 2023-06-17 23:13:30,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=53340.0, ans=0.125 2023-06-17 23:13:38,033 INFO [train.py:996] (3/4) Epoch 1, batch 8900, loss[loss=0.3219, simple_loss=0.3658, pruned_loss=0.139, over 21783.00 frames. ], tot_loss[loss=0.3749, simple_loss=0.413, pruned_loss=0.1684, over 4265016.97 frames. ], batch size: 102, lr: 3.61e-02, grad_scale: 32.0 2023-06-17 23:14:01,403 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.00 vs. limit=22.5 2023-06-17 23:14:12,501 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.567e+02 3.765e+02 4.627e+02 5.813e+02 9.173e+02, threshold=9.253e+02, percent-clipped=0.0 2023-06-17 23:15:24,395 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=53580.0, ans=0.0 2023-06-17 23:15:46,982 INFO [train.py:996] (3/4) Epoch 1, batch 8950, loss[loss=0.3916, simple_loss=0.4264, pruned_loss=0.1784, over 21635.00 frames. ], tot_loss[loss=0.3725, simple_loss=0.4138, pruned_loss=0.1656, over 4253402.12 frames. ], batch size: 414, lr: 3.61e-02, grad_scale: 32.0 2023-06-17 23:15:51,857 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=53700.0, ans=0.125 2023-06-17 23:16:12,419 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=53700.0, ans=0.0 2023-06-17 23:16:15,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=53760.0, ans=0.125 2023-06-17 23:17:16,282 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=53880.0, ans=0.04949747468305833 2023-06-17 23:17:50,561 INFO [train.py:996] (3/4) Epoch 1, batch 9000, loss[loss=0.3508, simple_loss=0.3775, pruned_loss=0.1621, over 21905.00 frames. ], tot_loss[loss=0.3677, simple_loss=0.406, pruned_loss=0.1647, over 4255930.00 frames. ], batch size: 107, lr: 3.60e-02, grad_scale: 32.0 2023-06-17 23:17:50,561 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-17 23:18:41,567 INFO [train.py:1028] (3/4) Epoch 1, validation: loss=0.3222, simple_loss=0.4116, pruned_loss=0.1164, over 1796401.00 frames. 2023-06-17 23:18:41,568 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-17 23:19:27,224 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.702e+02 3.716e+02 4.512e+02 5.914e+02 1.006e+03, threshold=9.023e+02, percent-clipped=2.0 2023-06-17 23:19:36,604 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=54120.0, ans=0.125 2023-06-17 23:19:44,566 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.41 vs. limit=6.0 2023-06-17 23:20:03,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=54180.0, ans=0.125 2023-06-17 23:20:43,178 INFO [train.py:996] (3/4) Epoch 1, batch 9050, loss[loss=0.4726, simple_loss=0.4927, pruned_loss=0.2263, over 21807.00 frames. ], tot_loss[loss=0.359, simple_loss=0.4002, pruned_loss=0.1589, over 4249458.54 frames. ], batch size: 118, lr: 3.59e-02, grad_scale: 32.0 2023-06-17 23:21:07,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=54300.0, ans=0.1 2023-06-17 23:22:13,760 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.71 vs. limit=12.0 2023-06-17 23:22:27,558 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=54480.0, ans=0.1 2023-06-17 23:22:54,217 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.34 vs. limit=15.0 2023-06-17 23:22:56,138 INFO [train.py:996] (3/4) Epoch 1, batch 9100, loss[loss=0.3369, simple_loss=0.402, pruned_loss=0.1359, over 21696.00 frames. ], tot_loss[loss=0.3664, simple_loss=0.4069, pruned_loss=0.163, over 4250326.76 frames. ], batch size: 298, lr: 3.59e-02, grad_scale: 32.0 2023-06-17 23:23:14,598 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=54600.0, ans=0.125 2023-06-17 23:23:30,795 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.153e+02 3.462e+02 4.314e+02 5.762e+02 1.601e+03, threshold=8.627e+02, percent-clipped=9.0 2023-06-17 23:23:50,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=54720.0, ans=0.125 2023-06-17 23:23:51,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=54780.0, ans=0.0 2023-06-17 23:23:59,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=54780.0, ans=0.2 2023-06-17 23:23:59,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=54780.0, ans=0.0 2023-06-17 23:24:06,070 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.79 vs. limit=22.5 2023-06-17 23:24:33,673 INFO [train.py:996] (3/4) Epoch 1, batch 9150, loss[loss=0.3564, simple_loss=0.4185, pruned_loss=0.1471, over 21799.00 frames. ], tot_loss[loss=0.3593, simple_loss=0.4051, pruned_loss=0.1567, over 4259945.87 frames. ], batch size: 351, lr: 3.58e-02, grad_scale: 32.0 2023-06-17 23:24:45,472 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=54900.0, ans=0.125 2023-06-17 23:24:49,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=54900.0, ans=0.05 2023-06-17 23:25:07,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=55020.0, ans=0.125 2023-06-17 23:26:02,843 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.34 vs. limit=15.0 2023-06-17 23:26:08,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=55080.0, ans=0.1 2023-06-17 23:26:19,481 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.34 vs. limit=15.0 2023-06-17 23:26:21,888 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=55140.0, ans=0.125 2023-06-17 23:26:28,794 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 23:26:45,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=55140.0, ans=0.2 2023-06-17 23:26:47,972 INFO [train.py:996] (3/4) Epoch 1, batch 9200, loss[loss=0.3685, simple_loss=0.4117, pruned_loss=0.1627, over 21294.00 frames. ], tot_loss[loss=0.3603, simple_loss=0.4079, pruned_loss=0.1563, over 4266355.44 frames. ], batch size: 176, lr: 3.58e-02, grad_scale: 32.0 2023-06-17 23:27:24,791 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.410e+02 3.827e+02 4.968e+02 6.967e+02 1.252e+03, threshold=9.935e+02, percent-clipped=13.0 2023-06-17 23:27:29,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=55320.0, ans=0.125 2023-06-17 23:28:27,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=55440.0, ans=0.125 2023-06-17 23:28:47,260 INFO [train.py:996] (3/4) Epoch 1, batch 9250, loss[loss=0.3492, simple_loss=0.3838, pruned_loss=0.1573, over 21853.00 frames. ], tot_loss[loss=0.369, simple_loss=0.4127, pruned_loss=0.1627, over 4267389.77 frames. ], batch size: 118, lr: 3.57e-02, grad_scale: 16.0 2023-06-17 23:29:22,294 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=55620.0, ans=0.0 2023-06-17 23:29:58,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=55680.0, ans=0.1 2023-06-17 23:30:13,018 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=55740.0, ans=0.2 2023-06-17 23:30:26,449 INFO [train.py:996] (3/4) Epoch 1, batch 9300, loss[loss=0.4222, simple_loss=0.4499, pruned_loss=0.1973, over 20619.00 frames. ], tot_loss[loss=0.3657, simple_loss=0.4056, pruned_loss=0.1629, over 4259622.37 frames. ], batch size: 607, lr: 3.57e-02, grad_scale: 16.0 2023-06-17 23:30:49,054 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=55860.0, ans=0.04949747468305833 2023-06-17 23:31:10,768 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.819e+02 4.444e+02 5.472e+02 6.937e+02 1.249e+03, threshold=1.094e+03, percent-clipped=7.0 2023-06-17 23:32:11,930 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.01 vs. limit=15.0 2023-06-17 23:32:35,226 INFO [train.py:996] (3/4) Epoch 1, batch 9350, loss[loss=0.3935, simple_loss=0.4324, pruned_loss=0.1773, over 21629.00 frames. ], tot_loss[loss=0.3737, simple_loss=0.4142, pruned_loss=0.1666, over 4262771.00 frames. ], batch size: 230, lr: 3.56e-02, grad_scale: 16.0 2023-06-17 23:34:24,414 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=56340.0, ans=0.04949747468305833 2023-06-17 23:34:32,810 INFO [train.py:996] (3/4) Epoch 1, batch 9400, loss[loss=0.3696, simple_loss=0.4171, pruned_loss=0.161, over 21716.00 frames. ], tot_loss[loss=0.3748, simple_loss=0.4155, pruned_loss=0.167, over 4262657.16 frames. ], batch size: 332, lr: 3.55e-02, grad_scale: 16.0 2023-06-17 23:34:54,174 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.86 vs. limit=22.5 2023-06-17 23:34:58,488 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 23:35:03,457 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.70 vs. limit=10.0 2023-06-17 23:35:08,715 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=56460.0, ans=0.1 2023-06-17 23:35:08,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=56460.0, ans=0.0 2023-06-17 23:35:10,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=56460.0, ans=0.09899494936611666 2023-06-17 23:35:13,925 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.66 vs. limit=15.0 2023-06-17 23:35:26,245 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.443e+02 3.466e+02 4.550e+02 6.027e+02 9.606e+02, threshold=9.099e+02, percent-clipped=0.0 2023-06-17 23:35:57,980 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=56580.0, ans=0.0 2023-06-17 23:36:18,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=56640.0, ans=0.125 2023-06-17 23:36:22,976 INFO [train.py:996] (3/4) Epoch 1, batch 9450, loss[loss=0.3302, simple_loss=0.3579, pruned_loss=0.1513, over 21213.00 frames. ], tot_loss[loss=0.3672, simple_loss=0.4055, pruned_loss=0.1644, over 4257202.63 frames. ], batch size: 159, lr: 3.55e-02, grad_scale: 16.0 2023-06-17 23:36:35,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=56700.0, ans=0.125 2023-06-17 23:36:50,293 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=56760.0, ans=0.1 2023-06-17 23:37:17,781 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.71 vs. limit=22.5 2023-06-17 23:38:00,025 INFO [train.py:996] (3/4) Epoch 1, batch 9500, loss[loss=0.3278, simple_loss=0.3638, pruned_loss=0.1459, over 21763.00 frames. ], tot_loss[loss=0.3576, simple_loss=0.396, pruned_loss=0.1596, over 4262985.35 frames. ], batch size: 112, lr: 3.54e-02, grad_scale: 16.0 2023-06-17 23:38:09,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=57000.0, ans=0.125 2023-06-17 23:38:10,927 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=57000.0, ans=0.125 2023-06-17 23:38:50,367 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.933e+02 3.365e+02 4.382e+02 5.694e+02 1.167e+03, threshold=8.764e+02, percent-clipped=2.0 2023-06-17 23:39:11,508 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.65 vs. limit=15.0 2023-06-17 23:40:02,009 INFO [train.py:996] (3/4) Epoch 1, batch 9550, loss[loss=0.4265, simple_loss=0.4481, pruned_loss=0.2025, over 21584.00 frames. ], tot_loss[loss=0.3641, simple_loss=0.401, pruned_loss=0.1637, over 4269246.44 frames. ], batch size: 389, lr: 3.54e-02, grad_scale: 16.0 2023-06-17 23:40:13,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=57300.0, ans=0.0 2023-06-17 23:40:20,085 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=57360.0, ans=0.1 2023-06-17 23:40:57,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=57420.0, ans=0.125 2023-06-17 23:41:12,377 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.72 vs. limit=15.0 2023-06-17 23:41:59,560 INFO [train.py:996] (3/4) Epoch 1, batch 9600, loss[loss=0.3713, simple_loss=0.402, pruned_loss=0.1703, over 17595.00 frames. ], tot_loss[loss=0.3697, simple_loss=0.4058, pruned_loss=0.1668, over 4272120.33 frames. ], batch size: 60, lr: 3.53e-02, grad_scale: 32.0 2023-06-17 23:42:08,720 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=57600.0, ans=0.1 2023-06-17 23:42:54,718 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.299e+02 3.734e+02 4.517e+02 6.077e+02 1.128e+03, threshold=9.035e+02, percent-clipped=4.0 2023-06-17 23:43:32,961 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=57840.0, ans=0.125 2023-06-17 23:43:54,540 INFO [train.py:996] (3/4) Epoch 1, batch 9650, loss[loss=0.3777, simple_loss=0.4164, pruned_loss=0.1695, over 21333.00 frames. ], tot_loss[loss=0.3697, simple_loss=0.4062, pruned_loss=0.1667, over 4277670.73 frames. ], batch size: 176, lr: 3.53e-02, grad_scale: 32.0 2023-06-17 23:43:55,001 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=57900.0, ans=0.2 2023-06-17 23:45:13,279 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=58020.0, ans=0.125 2023-06-17 23:45:40,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=58140.0, ans=0.0 2023-06-17 23:45:52,398 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=58140.0, ans=0.0 2023-06-17 23:45:54,821 INFO [train.py:996] (3/4) Epoch 1, batch 9700, loss[loss=0.3203, simple_loss=0.3595, pruned_loss=0.1406, over 21273.00 frames. ], tot_loss[loss=0.3686, simple_loss=0.4075, pruned_loss=0.1649, over 4279146.96 frames. ], batch size: 159, lr: 3.52e-02, grad_scale: 32.0 2023-06-17 23:45:59,456 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=58200.0, ans=0.125 2023-06-17 23:46:40,098 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 23:46:40,980 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.286e+02 3.561e+02 4.162e+02 5.499e+02 1.221e+03, threshold=8.324e+02, percent-clipped=3.0 2023-06-17 23:46:42,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=58320.0, ans=0.125 2023-06-17 23:47:19,582 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.31 vs. limit=10.0 2023-06-17 23:47:37,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=58440.0, ans=0.0 2023-06-17 23:47:54,050 INFO [train.py:996] (3/4) Epoch 1, batch 9750, loss[loss=0.3704, simple_loss=0.3765, pruned_loss=0.1821, over 21206.00 frames. ], tot_loss[loss=0.3614, simple_loss=0.3979, pruned_loss=0.1624, over 4275349.88 frames. ], batch size: 471, lr: 3.51e-02, grad_scale: 32.0 2023-06-17 23:48:37,370 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=58560.0, ans=0.125 2023-06-17 23:49:25,524 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.15 vs. limit=15.0 2023-06-17 23:49:43,173 INFO [train.py:996] (3/4) Epoch 1, batch 9800, loss[loss=0.3377, simple_loss=0.3774, pruned_loss=0.149, over 21657.00 frames. ], tot_loss[loss=0.3582, simple_loss=0.3961, pruned_loss=0.1601, over 4269120.49 frames. ], batch size: 263, lr: 3.51e-02, grad_scale: 32.0 2023-06-17 23:50:48,190 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.310e+02 3.805e+02 4.466e+02 5.878e+02 9.239e+02, threshold=8.932e+02, percent-clipped=2.0 2023-06-17 23:51:12,429 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.12 vs. limit=15.0 2023-06-17 23:51:44,813 INFO [train.py:996] (3/4) Epoch 1, batch 9850, loss[loss=0.4108, simple_loss=0.4398, pruned_loss=0.1909, over 20742.00 frames. ], tot_loss[loss=0.3569, simple_loss=0.3933, pruned_loss=0.1603, over 4271729.20 frames. ], batch size: 607, lr: 3.50e-02, grad_scale: 32.0 2023-06-17 23:51:59,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=59100.0, ans=0.125 2023-06-17 23:52:13,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=59160.0, ans=0.125 2023-06-17 23:52:17,473 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=59160.0, ans=0.2 2023-06-17 23:52:54,801 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=59280.0, ans=0.0 2023-06-17 23:53:25,785 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=59340.0, ans=0.0 2023-06-17 23:53:25,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=59340.0, ans=0.0 2023-06-17 23:53:42,085 INFO [train.py:996] (3/4) Epoch 1, batch 9900, loss[loss=0.3647, simple_loss=0.4011, pruned_loss=0.1642, over 21393.00 frames. ], tot_loss[loss=0.3547, simple_loss=0.3896, pruned_loss=0.1599, over 4268045.61 frames. ], batch size: 211, lr: 3.50e-02, grad_scale: 32.0 2023-06-17 23:53:47,423 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.16 vs. limit=15.0 2023-06-17 23:54:37,609 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.448e+02 3.479e+02 4.439e+02 5.469e+02 9.869e+02, threshold=8.878e+02, percent-clipped=4.0 2023-06-17 23:54:40,891 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=59520.0, ans=0.0 2023-06-17 23:54:42,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=59520.0, ans=0.04949747468305833 2023-06-17 23:55:07,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=59580.0, ans=0.05 2023-06-17 23:55:17,758 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=59640.0, ans=0.0 2023-06-17 23:55:27,583 INFO [train.py:996] (3/4) Epoch 1, batch 9950, loss[loss=0.3562, simple_loss=0.3756, pruned_loss=0.1684, over 22017.00 frames. ], tot_loss[loss=0.357, simple_loss=0.3905, pruned_loss=0.1618, over 4277290.47 frames. ], batch size: 375, lr: 3.49e-02, grad_scale: 32.0 2023-06-17 23:56:13,045 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.35 vs. limit=15.0 2023-06-17 23:56:40,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=59880.0, ans=0.0 2023-06-17 23:56:50,237 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.96 vs. limit=15.0 2023-06-17 23:57:16,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=59940.0, ans=0.125 2023-06-17 23:57:26,286 INFO [train.py:996] (3/4) Epoch 1, batch 10000, loss[loss=0.3498, simple_loss=0.3914, pruned_loss=0.1541, over 21636.00 frames. ], tot_loss[loss=0.3522, simple_loss=0.3869, pruned_loss=0.1588, over 4275606.22 frames. ], batch size: 351, lr: 3.49e-02, grad_scale: 32.0 2023-06-17 23:58:22,343 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.612e+02 3.683e+02 4.306e+02 5.455e+02 9.094e+02, threshold=8.612e+02, percent-clipped=1.0 2023-06-17 23:59:28,911 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=60240.0, ans=0.125 2023-06-17 23:59:40,477 INFO [train.py:996] (3/4) Epoch 1, batch 10050, loss[loss=0.3172, simple_loss=0.3567, pruned_loss=0.1389, over 21250.00 frames. ], tot_loss[loss=0.356, simple_loss=0.39, pruned_loss=0.161, over 4271336.70 frames. ], batch size: 159, lr: 3.48e-02, grad_scale: 32.0 2023-06-17 23:59:41,567 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.48 vs. limit=15.0 2023-06-18 00:01:08,790 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=60480.0, ans=0.0 2023-06-18 00:02:00,513 INFO [train.py:996] (3/4) Epoch 1, batch 10100, loss[loss=0.3742, simple_loss=0.4155, pruned_loss=0.1664, over 21707.00 frames. ], tot_loss[loss=0.3509, simple_loss=0.3871, pruned_loss=0.1573, over 4274758.34 frames. ], batch size: 351, lr: 3.47e-02, grad_scale: 32.0 2023-06-18 00:02:09,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=60600.0, ans=0.1 2023-06-18 00:02:45,036 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.836e+02 3.604e+02 4.387e+02 5.436e+02 8.331e+02, threshold=8.774e+02, percent-clipped=0.0 2023-06-18 00:03:11,356 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.61 vs. limit=15.0 2023-06-18 00:03:25,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=60780.0, ans=0.125 2023-06-18 00:03:25,389 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=60780.0, ans=0.125 2023-06-18 00:04:07,994 INFO [train.py:996] (3/4) Epoch 1, batch 10150, loss[loss=0.3223, simple_loss=0.3674, pruned_loss=0.1386, over 21682.00 frames. ], tot_loss[loss=0.3567, simple_loss=0.3923, pruned_loss=0.1606, over 4263606.00 frames. ], batch size: 247, lr: 3.47e-02, grad_scale: 32.0 2023-06-18 00:04:43,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=61020.0, ans=10.0 2023-06-18 00:04:49,501 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.55 vs. limit=22.5 2023-06-18 00:04:55,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=61020.0, ans=0.125 2023-06-18 00:05:31,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=61140.0, ans=0.1 2023-06-18 00:05:46,100 INFO [train.py:996] (3/4) Epoch 1, batch 10200, loss[loss=0.2935, simple_loss=0.3608, pruned_loss=0.1131, over 21725.00 frames. ], tot_loss[loss=0.3485, simple_loss=0.3876, pruned_loss=0.1547, over 4270859.14 frames. ], batch size: 332, lr: 3.46e-02, grad_scale: 32.0 2023-06-18 00:06:21,937 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.056e+02 3.332e+02 4.219e+02 5.598e+02 1.332e+03, threshold=8.438e+02, percent-clipped=6.0 2023-06-18 00:06:51,822 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.82 vs. limit=15.0 2023-06-18 00:07:23,919 INFO [train.py:996] (3/4) Epoch 1, batch 10250, loss[loss=0.3951, simple_loss=0.433, pruned_loss=0.1786, over 21615.00 frames. ], tot_loss[loss=0.3393, simple_loss=0.383, pruned_loss=0.1478, over 4262234.33 frames. ], batch size: 389, lr: 3.46e-02, grad_scale: 16.0 2023-06-18 00:08:47,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=61740.0, ans=0.1 2023-06-18 00:08:52,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=61740.0, ans=0.125 2023-06-18 00:09:09,888 INFO [train.py:996] (3/4) Epoch 1, batch 10300, loss[loss=0.4113, simple_loss=0.4709, pruned_loss=0.1758, over 21691.00 frames. ], tot_loss[loss=0.3472, simple_loss=0.3907, pruned_loss=0.1518, over 4270556.31 frames. ], batch size: 441, lr: 3.45e-02, grad_scale: 16.0 2023-06-18 00:09:10,853 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.04 vs. limit=22.5 2023-06-18 00:09:19,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=61800.0, ans=0.0 2023-06-18 00:09:50,464 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.49 vs. limit=6.0 2023-06-18 00:09:52,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=61860.0, ans=0.1 2023-06-18 00:10:24,897 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.176e+02 3.585e+02 4.803e+02 7.199e+02 1.796e+03, threshold=9.605e+02, percent-clipped=14.0 2023-06-18 00:10:46,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=61980.0, ans=0.125 2023-06-18 00:11:05,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=62040.0, ans=0.125 2023-06-18 00:11:19,515 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=62040.0, ans=0.5 2023-06-18 00:11:26,236 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=62100.0, ans=0.0 2023-06-18 00:11:26,792 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.37 vs. limit=15.0 2023-06-18 00:11:27,155 INFO [train.py:996] (3/4) Epoch 1, batch 10350, loss[loss=0.2825, simple_loss=0.3277, pruned_loss=0.1186, over 21393.00 frames. ], tot_loss[loss=0.3462, simple_loss=0.3902, pruned_loss=0.151, over 4267559.50 frames. ], batch size: 211, lr: 3.45e-02, grad_scale: 16.0 2023-06-18 00:11:42,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=62160.0, ans=0.125 2023-06-18 00:12:12,353 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=62220.0, ans=0.125 2023-06-18 00:12:27,715 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=62220.0, ans=0.1 2023-06-18 00:13:08,634 INFO [train.py:996] (3/4) Epoch 1, batch 10400, loss[loss=0.3258, simple_loss=0.3872, pruned_loss=0.1322, over 21279.00 frames. ], tot_loss[loss=0.3354, simple_loss=0.3792, pruned_loss=0.1458, over 4265004.19 frames. ], batch size: 551, lr: 3.44e-02, grad_scale: 32.0 2023-06-18 00:13:12,371 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=62400.0, ans=0.2 2023-06-18 00:13:51,649 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.495e+02 3.668e+02 4.396e+02 5.299e+02 1.049e+03, threshold=8.792e+02, percent-clipped=2.0 2023-06-18 00:13:54,428 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.93 vs. limit=22.5 2023-06-18 00:14:04,063 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=62520.0, ans=0.0 2023-06-18 00:14:08,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=62520.0, ans=0.2 2023-06-18 00:14:44,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=62640.0, ans=0.125 2023-06-18 00:14:48,065 INFO [train.py:996] (3/4) Epoch 1, batch 10450, loss[loss=0.3656, simple_loss=0.4097, pruned_loss=0.1608, over 21707.00 frames. ], tot_loss[loss=0.3449, simple_loss=0.3861, pruned_loss=0.1518, over 4262900.37 frames. ], batch size: 247, lr: 3.44e-02, grad_scale: 32.0 2023-06-18 00:16:20,221 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=62880.0, ans=0.125 2023-06-18 00:16:56,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=62940.0, ans=0.0 2023-06-18 00:16:59,889 INFO [train.py:996] (3/4) Epoch 1, batch 10500, loss[loss=0.3598, simple_loss=0.3862, pruned_loss=0.1667, over 21436.00 frames. ], tot_loss[loss=0.3416, simple_loss=0.3835, pruned_loss=0.1498, over 4261066.20 frames. ], batch size: 389, lr: 3.43e-02, grad_scale: 16.0 2023-06-18 00:17:54,108 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.435e+02 3.581e+02 4.532e+02 5.617e+02 1.542e+03, threshold=9.064e+02, percent-clipped=4.0 2023-06-18 00:17:59,158 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=63120.0, ans=0.125 2023-06-18 00:18:22,363 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.45 vs. limit=15.0 2023-06-18 00:18:25,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=63180.0, ans=0.125 2023-06-18 00:18:42,286 INFO [train.py:996] (3/4) Epoch 1, batch 10550, loss[loss=0.3284, simple_loss=0.3556, pruned_loss=0.1505, over 21921.00 frames. ], tot_loss[loss=0.3388, simple_loss=0.3782, pruned_loss=0.1498, over 4257595.05 frames. ], batch size: 373, lr: 3.43e-02, grad_scale: 16.0 2023-06-18 00:18:57,477 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=63300.0, ans=0.125 2023-06-18 00:19:50,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=63480.0, ans=0.125 2023-06-18 00:20:24,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=63540.0, ans=0.125 2023-06-18 00:20:41,287 INFO [train.py:996] (3/4) Epoch 1, batch 10600, loss[loss=0.4049, simple_loss=0.4541, pruned_loss=0.1778, over 21408.00 frames. ], tot_loss[loss=0.3339, simple_loss=0.3738, pruned_loss=0.147, over 4254328.87 frames. ], batch size: 507, lr: 3.42e-02, grad_scale: 16.0 2023-06-18 00:21:20,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=63660.0, ans=0.0 2023-06-18 00:21:33,418 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.16 vs. limit=15.0 2023-06-18 00:21:42,137 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.31 vs. limit=15.0 2023-06-18 00:21:42,398 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.306e+02 3.242e+02 3.917e+02 4.950e+02 7.459e+02, threshold=7.834e+02, percent-clipped=0.0 2023-06-18 00:21:50,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=63720.0, ans=0.125 2023-06-18 00:22:21,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=63780.0, ans=0.125 2023-06-18 00:23:04,592 INFO [train.py:996] (3/4) Epoch 1, batch 10650, loss[loss=0.328, simple_loss=0.3985, pruned_loss=0.1288, over 21191.00 frames. ], tot_loss[loss=0.3313, simple_loss=0.3747, pruned_loss=0.1439, over 4248758.10 frames. ], batch size: 549, lr: 3.41e-02, grad_scale: 16.0 2023-06-18 00:23:42,194 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.90 vs. limit=15.0 2023-06-18 00:23:44,431 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=64020.0, ans=0.125 2023-06-18 00:24:00,706 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 00:24:54,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=64140.0, ans=0.125 2023-06-18 00:25:14,492 INFO [train.py:996] (3/4) Epoch 1, batch 10700, loss[loss=0.4023, simple_loss=0.4441, pruned_loss=0.1802, over 21460.00 frames. ], tot_loss[loss=0.3351, simple_loss=0.377, pruned_loss=0.1466, over 4250686.94 frames. ], batch size: 131, lr: 3.41e-02, grad_scale: 16.0 2023-06-18 00:25:14,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=64200.0, ans=0.07 2023-06-18 00:25:24,089 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 00:25:53,674 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.366e+02 3.609e+02 4.403e+02 5.419e+02 8.654e+02, threshold=8.805e+02, percent-clipped=2.0 2023-06-18 00:26:37,600 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.31 vs. limit=6.0 2023-06-18 00:27:17,065 INFO [train.py:996] (3/4) Epoch 1, batch 10750, loss[loss=0.3353, simple_loss=0.3457, pruned_loss=0.1625, over 20025.00 frames. ], tot_loss[loss=0.3478, simple_loss=0.389, pruned_loss=0.1533, over 4253583.62 frames. ], batch size: 702, lr: 3.40e-02, grad_scale: 16.0 2023-06-18 00:27:47,761 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 00:27:55,082 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=64620.0, ans=0.05 2023-06-18 00:29:06,563 INFO [train.py:996] (3/4) Epoch 1, batch 10800, loss[loss=0.3789, simple_loss=0.411, pruned_loss=0.1734, over 20653.00 frames. ], tot_loss[loss=0.3504, simple_loss=0.3925, pruned_loss=0.1542, over 4253681.14 frames. ], batch size: 609, lr: 3.40e-02, grad_scale: 32.0 2023-06-18 00:29:40,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=64860.0, ans=0.0 2023-06-18 00:30:01,342 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.224e+02 3.483e+02 3.969e+02 4.979e+02 7.720e+02, threshold=7.938e+02, percent-clipped=0.0 2023-06-18 00:30:13,556 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=64920.0, ans=0.125 2023-06-18 00:30:31,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=64980.0, ans=0.125 2023-06-18 00:31:13,357 INFO [train.py:996] (3/4) Epoch 1, batch 10850, loss[loss=0.3648, simple_loss=0.4289, pruned_loss=0.1504, over 20795.00 frames. ], tot_loss[loss=0.3523, simple_loss=0.3948, pruned_loss=0.155, over 4251986.99 frames. ], batch size: 607, lr: 3.39e-02, grad_scale: 32.0 2023-06-18 00:31:39,943 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=65160.0, ans=0.0 2023-06-18 00:31:46,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=65160.0, ans=0.0 2023-06-18 00:33:12,047 INFO [train.py:996] (3/4) Epoch 1, batch 10900, loss[loss=0.3043, simple_loss=0.3746, pruned_loss=0.117, over 21745.00 frames. ], tot_loss[loss=0.3474, simple_loss=0.3892, pruned_loss=0.1528, over 4248243.78 frames. ], batch size: 282, lr: 3.39e-02, grad_scale: 32.0 2023-06-18 00:33:17,000 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=65400.0, ans=0.0 2023-06-18 00:33:35,690 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=65460.0, ans=0.0 2023-06-18 00:33:51,456 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.422e+02 3.204e+02 4.085e+02 4.812e+02 7.317e+02, threshold=8.170e+02, percent-clipped=0.0 2023-06-18 00:34:36,505 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=65640.0, ans=0.07 2023-06-18 00:34:50,799 INFO [train.py:996] (3/4) Epoch 1, batch 10950, loss[loss=0.3468, simple_loss=0.3701, pruned_loss=0.1617, over 21534.00 frames. ], tot_loss[loss=0.3409, simple_loss=0.3835, pruned_loss=0.1492, over 4243693.62 frames. ], batch size: 414, lr: 3.38e-02, grad_scale: 32.0 2023-06-18 00:34:58,650 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=65700.0, ans=0.05 2023-06-18 00:35:05,164 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=15.0 2023-06-18 00:36:45,679 INFO [train.py:996] (3/4) Epoch 1, batch 11000, loss[loss=0.3796, simple_loss=0.4125, pruned_loss=0.1733, over 21368.00 frames. ], tot_loss[loss=0.3425, simple_loss=0.3834, pruned_loss=0.1508, over 4258110.92 frames. ], batch size: 159, lr: 3.38e-02, grad_scale: 32.0 2023-06-18 00:37:35,146 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.527e+02 3.687e+02 4.482e+02 5.428e+02 9.093e+02, threshold=8.964e+02, percent-clipped=3.0 2023-06-18 00:38:13,219 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=66240.0, ans=0.05 2023-06-18 00:38:49,265 INFO [train.py:996] (3/4) Epoch 1, batch 11050, loss[loss=0.2966, simple_loss=0.3256, pruned_loss=0.1338, over 20763.00 frames. ], tot_loss[loss=0.3424, simple_loss=0.3808, pruned_loss=0.152, over 4261995.49 frames. ], batch size: 608, lr: 3.37e-02, grad_scale: 32.0 2023-06-18 00:39:01,044 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=66300.0, ans=0.5 2023-06-18 00:40:36,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=66540.0, ans=0.2 2023-06-18 00:40:44,262 INFO [train.py:996] (3/4) Epoch 1, batch 11100, loss[loss=0.3043, simple_loss=0.3431, pruned_loss=0.1327, over 21378.00 frames. ], tot_loss[loss=0.34, simple_loss=0.3775, pruned_loss=0.1512, over 4258095.83 frames. ], batch size: 131, lr: 3.37e-02, grad_scale: 32.0 2023-06-18 00:40:56,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=66600.0, ans=0.1 2023-06-18 00:41:15,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=66660.0, ans=0.0 2023-06-18 00:41:17,423 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=66660.0, ans=0.125 2023-06-18 00:41:39,685 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.585e+02 3.202e+02 3.919e+02 4.833e+02 8.145e+02, threshold=7.838e+02, percent-clipped=0.0 2023-06-18 00:41:50,490 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=66720.0, ans=0.09899494936611666 2023-06-18 00:42:13,075 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.11 vs. limit=15.0 2023-06-18 00:42:39,303 INFO [train.py:996] (3/4) Epoch 1, batch 11150, loss[loss=0.3078, simple_loss=0.3489, pruned_loss=0.1333, over 15355.00 frames. ], tot_loss[loss=0.3379, simple_loss=0.3748, pruned_loss=0.1505, over 4258244.19 frames. ], batch size: 61, lr: 3.36e-02, grad_scale: 32.0 2023-06-18 00:43:13,070 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=66960.0, ans=0.125 2023-06-18 00:44:09,954 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=67140.0, ans=0.0 2023-06-18 00:44:13,226 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.71 vs. limit=15.0 2023-06-18 00:44:19,508 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=67140.0, ans=0.2 2023-06-18 00:44:27,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=67200.0, ans=0.125 2023-06-18 00:44:28,575 INFO [train.py:996] (3/4) Epoch 1, batch 11200, loss[loss=0.3763, simple_loss=0.3976, pruned_loss=0.1774, over 21491.00 frames. ], tot_loss[loss=0.3353, simple_loss=0.3735, pruned_loss=0.1486, over 4254591.80 frames. ], batch size: 441, lr: 3.36e-02, grad_scale: 32.0 2023-06-18 00:44:42,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=67260.0, ans=0.125 2023-06-18 00:44:52,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=67260.0, ans=0.1 2023-06-18 00:45:07,255 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.341e+02 3.477e+02 4.166e+02 5.140e+02 9.115e+02, threshold=8.331e+02, percent-clipped=3.0 2023-06-18 00:45:28,994 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.63 vs. limit=15.0 2023-06-18 00:45:43,463 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.91 vs. limit=15.0 2023-06-18 00:46:05,261 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.63 vs. limit=10.0 2023-06-18 00:46:05,994 INFO [train.py:996] (3/4) Epoch 1, batch 11250, loss[loss=0.3442, simple_loss=0.392, pruned_loss=0.1482, over 21332.00 frames. ], tot_loss[loss=0.3356, simple_loss=0.3732, pruned_loss=0.149, over 4253388.03 frames. ], batch size: 131, lr: 3.35e-02, grad_scale: 32.0 2023-06-18 00:46:08,534 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=9.27 vs. limit=12.0 2023-06-18 00:46:09,445 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=67500.0, ans=0.1 2023-06-18 00:46:12,899 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.21 vs. limit=15.0 2023-06-18 00:46:13,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=67500.0, ans=0.2 2023-06-18 00:46:14,462 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.47 vs. limit=15.0 2023-06-18 00:46:32,782 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.94 vs. limit=12.0 2023-06-18 00:47:16,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=67680.0, ans=0.0 2023-06-18 00:47:23,526 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.00 vs. limit=15.0 2023-06-18 00:47:36,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=67740.0, ans=0.125 2023-06-18 00:47:49,278 INFO [train.py:996] (3/4) Epoch 1, batch 11300, loss[loss=0.3673, simple_loss=0.4049, pruned_loss=0.1649, over 21505.00 frames. ], tot_loss[loss=0.3375, simple_loss=0.3747, pruned_loss=0.1502, over 4263897.48 frames. ], batch size: 471, lr: 3.35e-02, grad_scale: 32.0 2023-06-18 00:48:39,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=67860.0, ans=0.1 2023-06-18 00:49:00,213 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.349e+02 3.206e+02 3.828e+02 4.812e+02 8.998e+02, threshold=7.656e+02, percent-clipped=1.0 2023-06-18 00:49:11,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=67980.0, ans=0.0 2023-06-18 00:49:48,612 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=68040.0, ans=0.125 2023-06-18 00:50:09,180 INFO [train.py:996] (3/4) Epoch 1, batch 11350, loss[loss=0.4171, simple_loss=0.487, pruned_loss=0.1736, over 20886.00 frames. ], tot_loss[loss=0.3405, simple_loss=0.3791, pruned_loss=0.151, over 4267169.58 frames. ], batch size: 607, lr: 3.34e-02, grad_scale: 16.0 2023-06-18 00:50:31,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=68160.0, ans=0.125 2023-06-18 00:51:12,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=68220.0, ans=0.1 2023-06-18 00:52:07,604 INFO [train.py:996] (3/4) Epoch 1, batch 11400, loss[loss=0.2928, simple_loss=0.349, pruned_loss=0.1183, over 21271.00 frames. ], tot_loss[loss=0.3494, simple_loss=0.3887, pruned_loss=0.155, over 4274086.10 frames. ], batch size: 176, lr: 3.34e-02, grad_scale: 16.0 2023-06-18 00:52:38,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=68460.0, ans=0.0 2023-06-18 00:53:16,841 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.487e+02 3.896e+02 4.865e+02 6.013e+02 1.206e+03, threshold=9.731e+02, percent-clipped=12.0 2023-06-18 00:53:44,605 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=68580.0, ans=0.025 2023-06-18 00:54:10,461 INFO [train.py:996] (3/4) Epoch 1, batch 11450, loss[loss=0.4034, simple_loss=0.4336, pruned_loss=0.1866, over 21452.00 frames. ], tot_loss[loss=0.3487, simple_loss=0.3901, pruned_loss=0.1537, over 4272928.16 frames. ], batch size: 131, lr: 3.33e-02, grad_scale: 16.0 2023-06-18 00:54:45,582 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=68760.0, ans=0.0 2023-06-18 00:54:54,971 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=68760.0, ans=0.05 2023-06-18 00:55:11,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=68820.0, ans=0.125 2023-06-18 00:55:12,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=68820.0, ans=0.125 2023-06-18 00:55:38,376 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=68880.0, ans=0.125 2023-06-18 00:55:38,395 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=68880.0, ans=0.125 2023-06-18 00:56:02,445 INFO [train.py:996] (3/4) Epoch 1, batch 11500, loss[loss=0.3517, simple_loss=0.3932, pruned_loss=0.1551, over 21595.00 frames. ], tot_loss[loss=0.3508, simple_loss=0.3927, pruned_loss=0.1544, over 4271342.23 frames. ], batch size: 230, lr: 3.33e-02, grad_scale: 16.0 2023-06-18 00:56:58,288 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=69060.0, ans=0.1 2023-06-18 00:57:13,260 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.25 vs. limit=6.0 2023-06-18 00:57:20,751 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.117e+02 3.435e+02 4.128e+02 5.314e+02 9.134e+02, threshold=8.255e+02, percent-clipped=0.0 2023-06-18 00:58:13,017 INFO [train.py:996] (3/4) Epoch 1, batch 11550, loss[loss=0.3952, simple_loss=0.4553, pruned_loss=0.1675, over 21635.00 frames. ], tot_loss[loss=0.3523, simple_loss=0.3978, pruned_loss=0.1534, over 4274502.95 frames. ], batch size: 389, lr: 3.32e-02, grad_scale: 16.0 2023-06-18 00:58:47,940 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=69360.0, ans=0.0 2023-06-18 00:59:07,789 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=69420.0, ans=0.2 2023-06-18 00:59:09,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=69480.0, ans=0.125 2023-06-18 00:59:18,052 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=69480.0, ans=0.0 2023-06-18 00:59:40,085 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=8.04 vs. limit=12.0 2023-06-18 01:00:05,179 INFO [train.py:996] (3/4) Epoch 1, batch 11600, loss[loss=0.3358, simple_loss=0.4044, pruned_loss=0.1337, over 21438.00 frames. ], tot_loss[loss=0.3559, simple_loss=0.4073, pruned_loss=0.1523, over 4269761.83 frames. ], batch size: 194, lr: 3.32e-02, grad_scale: 32.0 2023-06-18 01:00:44,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=69660.0, ans=0.125 2023-06-18 01:00:50,978 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.524e+02 3.816e+02 4.671e+02 6.267e+02 1.056e+03, threshold=9.343e+02, percent-clipped=9.0 2023-06-18 01:00:53,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=69720.0, ans=0.125 2023-06-18 01:01:10,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=69780.0, ans=0.1 2023-06-18 01:01:57,958 INFO [train.py:996] (3/4) Epoch 1, batch 11650, loss[loss=0.314, simple_loss=0.3895, pruned_loss=0.1192, over 21277.00 frames. ], tot_loss[loss=0.3584, simple_loss=0.4125, pruned_loss=0.1522, over 4260887.21 frames. ], batch size: 143, lr: 3.31e-02, grad_scale: 32.0 2023-06-18 01:01:59,239 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.95 vs. limit=15.0 2023-06-18 01:02:47,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=70020.0, ans=0.0 2023-06-18 01:02:50,905 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.71 vs. limit=15.0 2023-06-18 01:03:06,115 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.43 vs. limit=15.0 2023-06-18 01:03:15,225 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.54 vs. limit=15.0 2023-06-18 01:03:37,388 INFO [train.py:996] (3/4) Epoch 1, batch 11700, loss[loss=0.3177, simple_loss=0.3474, pruned_loss=0.1439, over 21571.00 frames. ], tot_loss[loss=0.3547, simple_loss=0.4038, pruned_loss=0.1527, over 4253539.67 frames. ], batch size: 263, lr: 3.31e-02, grad_scale: 32.0 2023-06-18 01:04:02,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=70260.0, ans=0.0 2023-06-18 01:04:23,151 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.615e+02 4.235e+02 5.124e+02 6.443e+02 9.190e+02, threshold=1.025e+03, percent-clipped=0.0 2023-06-18 01:04:29,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=70320.0, ans=0.2 2023-06-18 01:04:34,179 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=70380.0, ans=0.1 2023-06-18 01:04:34,193 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=70380.0, ans=0.125 2023-06-18 01:04:43,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=70380.0, ans=0.0 2023-06-18 01:05:13,827 INFO [train.py:996] (3/4) Epoch 1, batch 11750, loss[loss=0.2961, simple_loss=0.3312, pruned_loss=0.1305, over 21410.00 frames. ], tot_loss[loss=0.3502, simple_loss=0.3944, pruned_loss=0.1531, over 4260288.08 frames. ], batch size: 131, lr: 3.30e-02, grad_scale: 32.0 2023-06-18 01:05:44,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=70560.0, ans=0.125 2023-06-18 01:05:54,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=70620.0, ans=0.5 2023-06-18 01:05:58,654 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=70620.0, ans=0.0 2023-06-18 01:07:13,329 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=70740.0, ans=0.125 2023-06-18 01:07:21,214 INFO [train.py:996] (3/4) Epoch 1, batch 11800, loss[loss=0.3972, simple_loss=0.4537, pruned_loss=0.1703, over 19839.00 frames. ], tot_loss[loss=0.3553, simple_loss=0.3974, pruned_loss=0.1566, over 4264055.80 frames. ], batch size: 704, lr: 3.30e-02, grad_scale: 32.0 2023-06-18 01:08:00,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=70860.0, ans=0.125 2023-06-18 01:08:10,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=70860.0, ans=0.125 2023-06-18 01:08:23,707 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.399e+02 3.432e+02 4.026e+02 5.092e+02 8.722e+02, threshold=8.051e+02, percent-clipped=0.0 2023-06-18 01:09:07,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=70980.0, ans=0.125 2023-06-18 01:09:09,319 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.67 vs. limit=10.0 2023-06-18 01:09:32,037 INFO [train.py:996] (3/4) Epoch 1, batch 11850, loss[loss=0.3142, simple_loss=0.3718, pruned_loss=0.1282, over 21761.00 frames. ], tot_loss[loss=0.3531, simple_loss=0.3975, pruned_loss=0.1544, over 4272422.14 frames. ], batch size: 247, lr: 3.29e-02, grad_scale: 32.0 2023-06-18 01:11:20,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=71400.0, ans=0.2 2023-06-18 01:11:21,899 INFO [train.py:996] (3/4) Epoch 1, batch 11900, loss[loss=0.2996, simple_loss=0.3765, pruned_loss=0.1114, over 21784.00 frames. ], tot_loss[loss=0.3483, simple_loss=0.3958, pruned_loss=0.1504, over 4273384.19 frames. ], batch size: 282, lr: 3.29e-02, grad_scale: 16.0 2023-06-18 01:11:41,445 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.17 vs. limit=15.0 2023-06-18 01:12:27,622 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.330e+02 3.389e+02 4.230e+02 4.928e+02 7.939e+02, threshold=8.459e+02, percent-clipped=0.0 2023-06-18 01:12:52,744 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=71520.0, ans=0.1 2023-06-18 01:13:28,745 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=71640.0, ans=0.125 2023-06-18 01:13:35,573 INFO [train.py:996] (3/4) Epoch 1, batch 11950, loss[loss=0.3122, simple_loss=0.3895, pruned_loss=0.1174, over 21623.00 frames. ], tot_loss[loss=0.3437, simple_loss=0.3953, pruned_loss=0.1461, over 4264973.48 frames. ], batch size: 247, lr: 3.28e-02, grad_scale: 16.0 2023-06-18 01:14:05,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=71760.0, ans=0.2 2023-06-18 01:14:41,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=71820.0, ans=0.125 2023-06-18 01:15:43,909 INFO [train.py:996] (3/4) Epoch 1, batch 12000, loss[loss=0.3307, simple_loss=0.3722, pruned_loss=0.1446, over 15346.00 frames. ], tot_loss[loss=0.3374, simple_loss=0.3869, pruned_loss=0.1439, over 4257837.15 frames. ], batch size: 60, lr: 3.28e-02, grad_scale: 32.0 2023-06-18 01:15:43,909 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-18 01:16:37,939 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.5.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([1.5565, 3.4429, 3.3468, 3.2695], device='cuda:3') 2023-06-18 01:16:39,562 INFO [train.py:1028] (3/4) Epoch 1, validation: loss=0.3214, simple_loss=0.4077, pruned_loss=0.1176, over 1796401.00 frames. 2023-06-18 01:16:39,563 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-18 01:17:26,124 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.232e+02 3.117e+02 3.794e+02 4.594e+02 6.987e+02, threshold=7.589e+02, percent-clipped=0.0 2023-06-18 01:17:35,445 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=72180.0, ans=0.2 2023-06-18 01:18:08,296 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.56 vs. limit=15.0 2023-06-18 01:18:09,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=72240.0, ans=0.04949747468305833 2023-06-18 01:18:09,452 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=72240.0, ans=0.1 2023-06-18 01:18:16,518 INFO [train.py:996] (3/4) Epoch 1, batch 12050, loss[loss=0.3488, simple_loss=0.3903, pruned_loss=0.1537, over 21867.00 frames. ], tot_loss[loss=0.3396, simple_loss=0.385, pruned_loss=0.1472, over 4269322.96 frames. ], batch size: 298, lr: 3.27e-02, grad_scale: 32.0 2023-06-18 01:19:14,392 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=72420.0, ans=0.125 2023-06-18 01:19:17,633 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.35 vs. limit=15.0 2023-06-18 01:19:35,221 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=72480.0, ans=0.0 2023-06-18 01:20:29,163 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.94 vs. limit=15.0 2023-06-18 01:20:39,878 INFO [train.py:996] (3/4) Epoch 1, batch 12100, loss[loss=0.3924, simple_loss=0.4219, pruned_loss=0.1814, over 21379.00 frames. ], tot_loss[loss=0.3538, simple_loss=0.398, pruned_loss=0.1548, over 4272472.63 frames. ], batch size: 548, lr: 3.27e-02, grad_scale: 32.0 2023-06-18 01:20:46,440 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=72600.0, ans=0.1 2023-06-18 01:21:32,856 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.780e+02 3.918e+02 4.969e+02 6.272e+02 1.033e+03, threshold=9.938e+02, percent-clipped=11.0 2023-06-18 01:22:52,401 INFO [train.py:996] (3/4) Epoch 1, batch 12150, loss[loss=0.4053, simple_loss=0.4714, pruned_loss=0.1696, over 21673.00 frames. ], tot_loss[loss=0.3557, simple_loss=0.4021, pruned_loss=0.1547, over 4275055.12 frames. ], batch size: 441, lr: 3.26e-02, grad_scale: 32.0 2023-06-18 01:23:22,685 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=72900.0, ans=0.05 2023-06-18 01:23:42,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=72960.0, ans=0.125 2023-06-18 01:23:46,394 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-06-18 01:25:22,165 INFO [train.py:996] (3/4) Epoch 1, batch 12200, loss[loss=0.3521, simple_loss=0.384, pruned_loss=0.1601, over 21757.00 frames. ], tot_loss[loss=0.3521, simple_loss=0.3972, pruned_loss=0.1535, over 4274876.17 frames. ], batch size: 351, lr: 3.26e-02, grad_scale: 32.0 2023-06-18 01:25:28,618 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=73200.0, ans=0.125 2023-06-18 01:25:47,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=73260.0, ans=0.0 2023-06-18 01:25:51,137 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.79 vs. limit=15.0 2023-06-18 01:26:10,495 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.716e+02 3.638e+02 4.548e+02 5.792e+02 8.635e+02, threshold=9.096e+02, percent-clipped=0.0 2023-06-18 01:26:42,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=73440.0, ans=0.05 2023-06-18 01:26:57,173 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=19.72 vs. limit=22.5 2023-06-18 01:27:17,666 INFO [train.py:996] (3/4) Epoch 1, batch 12250, loss[loss=0.3, simple_loss=0.3711, pruned_loss=0.1144, over 21212.00 frames. ], tot_loss[loss=0.341, simple_loss=0.3873, pruned_loss=0.1474, over 4271784.26 frames. ], batch size: 548, lr: 3.25e-02, grad_scale: 32.0 2023-06-18 01:27:40,697 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=73560.0, ans=0.0 2023-06-18 01:28:03,329 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=73620.0, ans=0.2 2023-06-18 01:28:14,578 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=73680.0, ans=0.0 2023-06-18 01:29:07,455 INFO [train.py:996] (3/4) Epoch 1, batch 12300, loss[loss=0.2432, simple_loss=0.3083, pruned_loss=0.08907, over 21375.00 frames. ], tot_loss[loss=0.3243, simple_loss=0.3746, pruned_loss=0.137, over 4272941.25 frames. ], batch size: 131, lr: 3.25e-02, grad_scale: 32.0 2023-06-18 01:29:08,467 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.60 vs. limit=15.0 2023-06-18 01:29:37,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=73860.0, ans=0.125 2023-06-18 01:29:37,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=73860.0, ans=0.2 2023-06-18 01:30:09,815 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.972e+02 3.199e+02 4.044e+02 5.045e+02 8.506e+02, threshold=8.089e+02, percent-clipped=0.0 2023-06-18 01:30:12,468 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.19 vs. limit=15.0 2023-06-18 01:30:16,212 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=73920.0, ans=0.1 2023-06-18 01:31:15,746 INFO [train.py:996] (3/4) Epoch 1, batch 12350, loss[loss=0.4186, simple_loss=0.4444, pruned_loss=0.1963, over 21852.00 frames. ], tot_loss[loss=0.3253, simple_loss=0.3775, pruned_loss=0.1365, over 4271059.30 frames. ], batch size: 414, lr: 3.24e-02, grad_scale: 32.0 2023-06-18 01:31:26,074 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=74100.0, ans=0.125 2023-06-18 01:31:28,213 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.34 vs. limit=6.0 2023-06-18 01:32:18,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=74220.0, ans=0.125 2023-06-18 01:32:27,357 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 01:32:28,791 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=74280.0, ans=0.0 2023-06-18 01:33:09,311 INFO [train.py:996] (3/4) Epoch 1, batch 12400, loss[loss=0.4458, simple_loss=0.4466, pruned_loss=0.2225, over 21758.00 frames. ], tot_loss[loss=0.334, simple_loss=0.3819, pruned_loss=0.143, over 4283320.71 frames. ], batch size: 508, lr: 3.24e-02, grad_scale: 32.0 2023-06-18 01:33:27,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=74400.0, ans=0.125 2023-06-18 01:33:29,107 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=74400.0, ans=0.125 2023-06-18 01:34:05,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=74520.0, ans=0.2 2023-06-18 01:34:07,778 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.455e+02 3.656e+02 4.405e+02 5.426e+02 8.475e+02, threshold=8.810e+02, percent-clipped=2.0 2023-06-18 01:35:09,980 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 01:35:23,700 INFO [train.py:996] (3/4) Epoch 1, batch 12450, loss[loss=0.3172, simple_loss=0.3422, pruned_loss=0.1461, over 20050.00 frames. ], tot_loss[loss=0.343, simple_loss=0.3879, pruned_loss=0.1491, over 4287923.02 frames. ], batch size: 702, lr: 3.23e-02, grad_scale: 32.0 2023-06-18 01:35:47,733 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.06 vs. limit=6.0 2023-06-18 01:35:56,650 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.04 vs. limit=15.0 2023-06-18 01:36:01,011 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.90 vs. limit=15.0 2023-06-18 01:36:28,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=74820.0, ans=0.04949747468305833 2023-06-18 01:36:30,051 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=74820.0, ans=0.2 2023-06-18 01:37:01,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=74880.0, ans=0.125 2023-06-18 01:37:44,975 INFO [train.py:996] (3/4) Epoch 1, batch 12500, loss[loss=0.4285, simple_loss=0.478, pruned_loss=0.1895, over 21616.00 frames. ], tot_loss[loss=0.3558, simple_loss=0.4009, pruned_loss=0.1553, over 4288426.46 frames. ], batch size: 389, lr: 3.23e-02, grad_scale: 32.0 2023-06-18 01:37:48,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=75000.0, ans=0.125 2023-06-18 01:38:51,707 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=75120.0, ans=0.125 2023-06-18 01:38:52,569 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.344e+02 3.484e+02 4.470e+02 5.562e+02 9.789e+02, threshold=8.941e+02, percent-clipped=2.0 2023-06-18 01:38:54,652 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=75120.0, ans=0.125 2023-06-18 01:39:16,387 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.99 vs. limit=15.0 2023-06-18 01:40:06,038 INFO [train.py:996] (3/4) Epoch 1, batch 12550, loss[loss=0.3277, simple_loss=0.386, pruned_loss=0.1347, over 21655.00 frames. ], tot_loss[loss=0.3627, simple_loss=0.4081, pruned_loss=0.1587, over 4285923.23 frames. ], batch size: 263, lr: 3.22e-02, grad_scale: 16.0 2023-06-18 01:40:42,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=75360.0, ans=0.04949747468305833 2023-06-18 01:41:03,184 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=75420.0, ans=0.125 2023-06-18 01:41:40,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=75480.0, ans=0.125 2023-06-18 01:41:46,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=75540.0, ans=0.1 2023-06-18 01:42:05,581 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=75600.0, ans=0.0 2023-06-18 01:42:06,789 INFO [train.py:996] (3/4) Epoch 1, batch 12600, loss[loss=0.2495, simple_loss=0.3031, pruned_loss=0.09793, over 21870.00 frames. ], tot_loss[loss=0.356, simple_loss=0.404, pruned_loss=0.154, over 4282470.91 frames. ], batch size: 98, lr: 3.22e-02, grad_scale: 16.0 2023-06-18 01:42:12,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=75600.0, ans=0.04949747468305833 2023-06-18 01:42:25,814 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 01:42:27,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=75600.0, ans=0.125 2023-06-18 01:43:02,238 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=75660.0, ans=0.125 2023-06-18 01:43:24,483 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.305e+02 3.361e+02 4.155e+02 4.807e+02 7.710e+02, threshold=8.311e+02, percent-clipped=0.0 2023-06-18 01:43:26,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=75720.0, ans=0.0 2023-06-18 01:43:39,514 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=75780.0, ans=0.125 2023-06-18 01:43:45,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=75780.0, ans=0.0 2023-06-18 01:43:53,365 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=75840.0, ans=0.5 2023-06-18 01:44:08,039 INFO [train.py:996] (3/4) Epoch 1, batch 12650, loss[loss=0.3163, simple_loss=0.3685, pruned_loss=0.1321, over 21386.00 frames. ], tot_loss[loss=0.3426, simple_loss=0.3923, pruned_loss=0.1465, over 4273494.26 frames. ], batch size: 548, lr: 3.21e-02, grad_scale: 16.0 2023-06-18 01:45:37,596 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.11 vs. limit=15.0 2023-06-18 01:45:38,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=76140.0, ans=10.0 2023-06-18 01:45:38,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=76140.0, ans=0.125 2023-06-18 01:45:45,871 INFO [train.py:996] (3/4) Epoch 1, batch 12700, loss[loss=0.3643, simple_loss=0.3983, pruned_loss=0.1651, over 21339.00 frames. ], tot_loss[loss=0.3472, simple_loss=0.3928, pruned_loss=0.1508, over 4283179.13 frames. ], batch size: 176, lr: 3.21e-02, grad_scale: 16.0 2023-06-18 01:46:12,134 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.33 vs. limit=15.0 2023-06-18 01:46:39,953 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.453e+02 4.163e+02 4.705e+02 5.622e+02 1.035e+03, threshold=9.411e+02, percent-clipped=4.0 2023-06-18 01:46:44,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=76320.0, ans=0.125 2023-06-18 01:46:49,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=76380.0, ans=0.025 2023-06-18 01:46:58,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=76380.0, ans=0.2 2023-06-18 01:47:12,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=76440.0, ans=0.2 2023-06-18 01:47:16,592 INFO [train.py:996] (3/4) Epoch 1, batch 12750, loss[loss=0.3569, simple_loss=0.3967, pruned_loss=0.1586, over 21934.00 frames. ], tot_loss[loss=0.3486, simple_loss=0.394, pruned_loss=0.1515, over 4283603.62 frames. ], batch size: 113, lr: 3.20e-02, grad_scale: 16.0 2023-06-18 01:48:03,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=76560.0, ans=0.2 2023-06-18 01:48:51,826 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=76680.0, ans=0.125 2023-06-18 01:48:53,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=76680.0, ans=0.125 2023-06-18 01:49:32,645 INFO [train.py:996] (3/4) Epoch 1, batch 12800, loss[loss=0.3716, simple_loss=0.4021, pruned_loss=0.1706, over 21824.00 frames. ], tot_loss[loss=0.3518, simple_loss=0.3949, pruned_loss=0.1543, over 4288153.95 frames. ], batch size: 107, lr: 3.20e-02, grad_scale: 32.0 2023-06-18 01:50:33,552 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.220e+02 3.808e+02 4.446e+02 5.808e+02 1.607e+03, threshold=8.892e+02, percent-clipped=7.0 2023-06-18 01:50:47,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=76920.0, ans=0.0 2023-06-18 01:50:52,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=76980.0, ans=0.0 2023-06-18 01:51:33,877 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=77040.0, ans=0.025 2023-06-18 01:51:59,030 INFO [train.py:996] (3/4) Epoch 1, batch 12850, loss[loss=0.3196, simple_loss=0.3913, pruned_loss=0.1239, over 21755.00 frames. ], tot_loss[loss=0.3564, simple_loss=0.3987, pruned_loss=0.157, over 4284469.90 frames. ], batch size: 351, lr: 3.19e-02, grad_scale: 32.0 2023-06-18 01:52:02,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=77100.0, ans=0.2 2023-06-18 01:52:05,303 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=77100.0, ans=0.125 2023-06-18 01:52:39,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=77160.0, ans=0.125 2023-06-18 01:52:49,186 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 01:53:51,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=77340.0, ans=0.0 2023-06-18 01:54:12,623 INFO [train.py:996] (3/4) Epoch 1, batch 12900, loss[loss=0.2884, simple_loss=0.3573, pruned_loss=0.1098, over 21681.00 frames. ], tot_loss[loss=0.3479, simple_loss=0.3943, pruned_loss=0.1507, over 4275695.26 frames. ], batch size: 247, lr: 3.19e-02, grad_scale: 32.0 2023-06-18 01:54:50,771 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.169e+02 2.899e+02 3.578e+02 4.051e+02 7.858e+02, threshold=7.156e+02, percent-clipped=0.0 2023-06-18 01:55:19,063 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=77640.0, ans=0.125 2023-06-18 01:55:57,445 INFO [train.py:996] (3/4) Epoch 1, batch 12950, loss[loss=0.319, simple_loss=0.3678, pruned_loss=0.1351, over 21736.00 frames. ], tot_loss[loss=0.3423, simple_loss=0.3906, pruned_loss=0.147, over 4279487.06 frames. ], batch size: 298, lr: 3.19e-02, grad_scale: 32.0 2023-06-18 01:57:34,296 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=77880.0, ans=0.125 2023-06-18 01:57:34,346 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=77880.0, ans=0.125 2023-06-18 01:58:03,850 INFO [train.py:996] (3/4) Epoch 1, batch 13000, loss[loss=0.2698, simple_loss=0.3434, pruned_loss=0.09808, over 21739.00 frames. ], tot_loss[loss=0.3423, simple_loss=0.3906, pruned_loss=0.147, over 4279790.02 frames. ], batch size: 332, lr: 3.18e-02, grad_scale: 32.0 2023-06-18 01:58:11,971 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=78000.0, ans=0.0 2023-06-18 01:58:13,939 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.94 vs. limit=10.0 2023-06-18 01:58:54,016 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.166e+02 3.081e+02 4.107e+02 5.952e+02 9.573e+02, threshold=8.214e+02, percent-clipped=12.0 2023-06-18 01:58:54,556 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=78120.0, ans=0.0 2023-06-18 01:59:06,988 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=78120.0, ans=0.125 2023-06-18 01:59:38,771 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=78180.0, ans=0.0 2023-06-18 01:59:41,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=78240.0, ans=0.125 2023-06-18 02:00:05,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=78240.0, ans=0.125 2023-06-18 02:00:08,202 INFO [train.py:996] (3/4) Epoch 1, batch 13050, loss[loss=0.3321, simple_loss=0.3775, pruned_loss=0.1433, over 21766.00 frames. ], tot_loss[loss=0.3338, simple_loss=0.3831, pruned_loss=0.1423, over 4276031.41 frames. ], batch size: 247, lr: 3.18e-02, grad_scale: 32.0 2023-06-18 02:00:17,897 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=78300.0, ans=0.125 2023-06-18 02:01:31,952 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=78540.0, ans=0.0 2023-06-18 02:01:56,470 INFO [train.py:996] (3/4) Epoch 1, batch 13100, loss[loss=0.3513, simple_loss=0.4004, pruned_loss=0.1511, over 21358.00 frames. ], tot_loss[loss=0.3376, simple_loss=0.3869, pruned_loss=0.1442, over 4283178.53 frames. ], batch size: 159, lr: 3.17e-02, grad_scale: 32.0 2023-06-18 02:02:56,736 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.39 vs. limit=22.5 2023-06-18 02:02:57,603 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=78660.0, ans=0.125 2023-06-18 02:03:04,513 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.92 vs. limit=8.0 2023-06-18 02:03:15,634 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.74 vs. limit=10.0 2023-06-18 02:03:16,372 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.576e+02 3.573e+02 4.115e+02 5.192e+02 9.461e+02, threshold=8.229e+02, percent-clipped=5.0 2023-06-18 02:03:18,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=78720.0, ans=0.04949747468305833 2023-06-18 02:03:50,853 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 02:04:04,238 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=78900.0, ans=0.125 2023-06-18 02:04:05,261 INFO [train.py:996] (3/4) Epoch 1, batch 13150, loss[loss=0.4164, simple_loss=0.437, pruned_loss=0.1978, over 21374.00 frames. ], tot_loss[loss=0.3469, simple_loss=0.3932, pruned_loss=0.1503, over 4280562.13 frames. ], batch size: 507, lr: 3.17e-02, grad_scale: 32.0 2023-06-18 02:05:27,929 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=79020.0, ans=0.07 2023-06-18 02:05:38,204 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=79080.0, ans=0.1 2023-06-18 02:05:52,980 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=79140.0, ans=0.1 2023-06-18 02:06:18,759 INFO [train.py:996] (3/4) Epoch 1, batch 13200, loss[loss=0.3611, simple_loss=0.394, pruned_loss=0.1641, over 22015.00 frames. ], tot_loss[loss=0.3453, simple_loss=0.391, pruned_loss=0.1498, over 4281437.54 frames. ], batch size: 317, lr: 3.16e-02, grad_scale: 32.0 2023-06-18 02:07:21,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=79260.0, ans=0.2 2023-06-18 02:07:22,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=79260.0, ans=0.1 2023-06-18 02:07:23,388 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.59 vs. limit=15.0 2023-06-18 02:07:41,147 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=79320.0, ans=0.1 2023-06-18 02:07:43,689 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.227e+02 3.259e+02 3.924e+02 5.244e+02 8.674e+02, threshold=7.848e+02, percent-clipped=1.0 2023-06-18 02:07:58,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=79380.0, ans=0.1 2023-06-18 02:08:12,198 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=79440.0, ans=0.0 2023-06-18 02:08:22,173 INFO [train.py:996] (3/4) Epoch 1, batch 13250, loss[loss=0.3219, simple_loss=0.3835, pruned_loss=0.1302, over 21785.00 frames. ], tot_loss[loss=0.3474, simple_loss=0.3913, pruned_loss=0.1517, over 4287008.71 frames. ], batch size: 247, lr: 3.16e-02, grad_scale: 32.0 2023-06-18 02:09:54,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=79680.0, ans=0.125 2023-06-18 02:10:28,319 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=79740.0, ans=0.0 2023-06-18 02:10:54,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=79800.0, ans=0.125 2023-06-18 02:10:55,429 INFO [train.py:996] (3/4) Epoch 1, batch 13300, loss[loss=0.3703, simple_loss=0.4153, pruned_loss=0.1627, over 21516.00 frames. ], tot_loss[loss=0.3476, simple_loss=0.3935, pruned_loss=0.1508, over 4293012.20 frames. ], batch size: 131, lr: 3.15e-02, grad_scale: 32.0 2023-06-18 02:11:00,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=79800.0, ans=0.2 2023-06-18 02:11:03,045 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=79800.0, ans=0.125 2023-06-18 02:11:29,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=79860.0, ans=0.125 2023-06-18 02:11:30,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=79860.0, ans=0.125 2023-06-18 02:11:42,305 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=79920.0, ans=0.125 2023-06-18 02:11:49,159 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.378e+02 3.863e+02 4.396e+02 5.432e+02 9.138e+02, threshold=8.792e+02, percent-clipped=4.0 2023-06-18 02:12:23,671 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-18 02:13:14,162 INFO [train.py:996] (3/4) Epoch 1, batch 13350, loss[loss=0.299, simple_loss=0.3418, pruned_loss=0.1281, over 16349.00 frames. ], tot_loss[loss=0.3523, simple_loss=0.3976, pruned_loss=0.1535, over 4287289.74 frames. ], batch size: 60, lr: 3.15e-02, grad_scale: 32.0 2023-06-18 02:14:16,206 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=80220.0, ans=0.2 2023-06-18 02:15:05,226 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=80340.0, ans=0.2 2023-06-18 02:15:43,160 INFO [train.py:996] (3/4) Epoch 1, batch 13400, loss[loss=0.3347, simple_loss=0.3799, pruned_loss=0.1448, over 21804.00 frames. ], tot_loss[loss=0.355, simple_loss=0.3993, pruned_loss=0.1554, over 4289159.30 frames. ], batch size: 247, lr: 3.14e-02, grad_scale: 32.0 2023-06-18 02:16:33,156 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.555e+02 3.828e+02 4.349e+02 5.476e+02 1.150e+03, threshold=8.698e+02, percent-clipped=4.0 2023-06-18 02:16:46,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=80520.0, ans=0.05 2023-06-18 02:16:58,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=80580.0, ans=0.0 2023-06-18 02:17:09,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=80580.0, ans=0.0 2023-06-18 02:17:22,715 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.93 vs. limit=15.0 2023-06-18 02:17:31,969 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=80640.0, ans=0.125 2023-06-18 02:17:55,764 INFO [train.py:996] (3/4) Epoch 1, batch 13450, loss[loss=0.3298, simple_loss=0.3749, pruned_loss=0.1424, over 21696.00 frames. ], tot_loss[loss=0.3592, simple_loss=0.4006, pruned_loss=0.1589, over 4280692.00 frames. ], batch size: 298, lr: 3.14e-02, grad_scale: 32.0 2023-06-18 02:18:15,777 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=80760.0, ans=0.125 2023-06-18 02:19:07,951 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=80940.0, ans=0.2 2023-06-18 02:19:24,071 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=80940.0, ans=0.125 2023-06-18 02:19:33,605 INFO [train.py:996] (3/4) Epoch 1, batch 13500, loss[loss=0.4424, simple_loss=0.4579, pruned_loss=0.2134, over 21448.00 frames. ], tot_loss[loss=0.3449, simple_loss=0.3867, pruned_loss=0.1515, over 4279602.36 frames. ], batch size: 509, lr: 3.14e-02, grad_scale: 32.0 2023-06-18 02:20:48,221 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=81120.0, ans=0.025 2023-06-18 02:20:49,348 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.184e+02 3.441e+02 4.151e+02 5.577e+02 1.126e+03, threshold=8.302e+02, percent-clipped=5.0 2023-06-18 02:21:30,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=81180.0, ans=0.0 2023-06-18 02:21:50,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=81240.0, ans=0.125 2023-06-18 02:22:09,583 INFO [train.py:996] (3/4) Epoch 1, batch 13550, loss[loss=0.3191, simple_loss=0.3757, pruned_loss=0.1312, over 21785.00 frames. ], tot_loss[loss=0.3489, simple_loss=0.3937, pruned_loss=0.152, over 4284473.39 frames. ], batch size: 124, lr: 3.13e-02, grad_scale: 32.0 2023-06-18 02:23:08,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=81420.0, ans=0.2 2023-06-18 02:23:11,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=81420.0, ans=0.2 2023-06-18 02:24:25,308 INFO [train.py:996] (3/4) Epoch 1, batch 13600, loss[loss=0.3355, simple_loss=0.3742, pruned_loss=0.1485, over 21341.00 frames. ], tot_loss[loss=0.3523, simple_loss=0.397, pruned_loss=0.1537, over 4287180.48 frames. ], batch size: 176, lr: 3.13e-02, grad_scale: 32.0 2023-06-18 02:24:25,801 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=81600.0, ans=0.1 2023-06-18 02:24:43,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=81660.0, ans=0.0 2023-06-18 02:24:54,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=81660.0, ans=0.1 2023-06-18 02:25:03,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=81660.0, ans=0.015 2023-06-18 02:25:25,450 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.475e+02 3.459e+02 4.202e+02 5.582e+02 9.308e+02, threshold=8.405e+02, percent-clipped=4.0 2023-06-18 02:26:10,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=81840.0, ans=0.125 2023-06-18 02:26:25,970 INFO [train.py:996] (3/4) Epoch 1, batch 13650, loss[loss=0.3031, simple_loss=0.3464, pruned_loss=0.1299, over 21775.00 frames. ], tot_loss[loss=0.3444, simple_loss=0.3901, pruned_loss=0.1493, over 4292406.78 frames. ], batch size: 371, lr: 3.12e-02, grad_scale: 32.0 2023-06-18 02:27:08,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=81960.0, ans=0.1 2023-06-18 02:27:35,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=82020.0, ans=0.125 2023-06-18 02:27:53,527 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=15.0 2023-06-18 02:28:02,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=82080.0, ans=0.0 2023-06-18 02:28:04,217 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=82080.0, ans=0.0 2023-06-18 02:28:26,806 INFO [train.py:996] (3/4) Epoch 1, batch 13700, loss[loss=0.2806, simple_loss=0.3122, pruned_loss=0.1245, over 16449.00 frames. ], tot_loss[loss=0.3387, simple_loss=0.3815, pruned_loss=0.148, over 4278557.21 frames. ], batch size: 64, lr: 3.12e-02, grad_scale: 32.0 2023-06-18 02:29:09,018 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=82260.0, ans=0.1 2023-06-18 02:29:19,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=82260.0, ans=0.125 2023-06-18 02:29:46,846 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.386e+02 3.308e+02 4.161e+02 5.184e+02 1.059e+03, threshold=8.322e+02, percent-clipped=1.0 2023-06-18 02:30:16,281 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.29 vs. limit=15.0 2023-06-18 02:30:28,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=82440.0, ans=0.0 2023-06-18 02:30:39,907 INFO [train.py:996] (3/4) Epoch 1, batch 13750, loss[loss=0.3756, simple_loss=0.4178, pruned_loss=0.1667, over 21593.00 frames. ], tot_loss[loss=0.3372, simple_loss=0.3813, pruned_loss=0.1465, over 4271008.10 frames. ], batch size: 442, lr: 3.11e-02, grad_scale: 32.0 2023-06-18 02:31:26,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=82560.0, ans=0.125 2023-06-18 02:32:21,184 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=82680.0, ans=0.0 2023-06-18 02:32:27,562 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=82680.0, ans=0.2 2023-06-18 02:33:17,446 INFO [train.py:996] (3/4) Epoch 1, batch 13800, loss[loss=0.3551, simple_loss=0.4321, pruned_loss=0.139, over 21802.00 frames. ], tot_loss[loss=0.339, simple_loss=0.3875, pruned_loss=0.1452, over 4264966.54 frames. ], batch size: 316, lr: 3.11e-02, grad_scale: 32.0 2023-06-18 02:33:21,466 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.47 vs. limit=10.0 2023-06-18 02:33:23,752 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=82800.0, ans=0.0 2023-06-18 02:33:57,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=82860.0, ans=0.015 2023-06-18 02:34:28,553 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.483e+02 3.529e+02 4.559e+02 5.712e+02 9.830e+02, threshold=9.119e+02, percent-clipped=1.0 2023-06-18 02:34:33,449 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=82920.0, ans=0.125 2023-06-18 02:35:49,008 INFO [train.py:996] (3/4) Epoch 1, batch 13850, loss[loss=0.3735, simple_loss=0.4177, pruned_loss=0.1647, over 21432.00 frames. ], tot_loss[loss=0.3429, simple_loss=0.3932, pruned_loss=0.1463, over 4271347.45 frames. ], batch size: 211, lr: 3.11e-02, grad_scale: 32.0 2023-06-18 02:35:51,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=83100.0, ans=0.0 2023-06-18 02:36:07,446 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.27 vs. limit=15.0 2023-06-18 02:36:11,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff3.min_abs, batch_count=83160.0, ans=0.2 2023-06-18 02:36:58,423 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=83280.0, ans=0.125 2023-06-18 02:37:05,972 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.08 vs. limit=15.0 2023-06-18 02:37:55,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=83340.0, ans=0.0 2023-06-18 02:37:59,149 INFO [train.py:996] (3/4) Epoch 1, batch 13900, loss[loss=0.3668, simple_loss=0.3981, pruned_loss=0.1677, over 21855.00 frames. ], tot_loss[loss=0.3504, simple_loss=0.3976, pruned_loss=0.1516, over 4277698.54 frames. ], batch size: 332, lr: 3.10e-02, grad_scale: 32.0 2023-06-18 02:37:59,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=83400.0, ans=0.2 2023-06-18 02:38:00,162 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.46 vs. limit=6.0 2023-06-18 02:38:32,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=83460.0, ans=0.04949747468305833 2023-06-18 02:38:46,647 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.741e+02 3.529e+02 4.116e+02 5.133e+02 1.052e+03, threshold=8.231e+02, percent-clipped=3.0 2023-06-18 02:39:43,492 INFO [train.py:996] (3/4) Epoch 1, batch 13950, loss[loss=0.3464, simple_loss=0.3961, pruned_loss=0.1483, over 21896.00 frames. ], tot_loss[loss=0.3555, simple_loss=0.4009, pruned_loss=0.1551, over 4279213.14 frames. ], batch size: 118, lr: 3.10e-02, grad_scale: 32.0 2023-06-18 02:40:31,386 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=83760.0, ans=0.0 2023-06-18 02:40:37,145 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=83820.0, ans=0.125 2023-06-18 02:40:41,438 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=83820.0, ans=0.125 2023-06-18 02:40:49,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=83820.0, ans=0.1 2023-06-18 02:41:04,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=83880.0, ans=0.125 2023-06-18 02:41:13,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=83880.0, ans=0.125 2023-06-18 02:41:58,060 INFO [train.py:996] (3/4) Epoch 1, batch 14000, loss[loss=0.311, simple_loss=0.3631, pruned_loss=0.1295, over 21412.00 frames. ], tot_loss[loss=0.3495, simple_loss=0.396, pruned_loss=0.1515, over 4277954.89 frames. ], batch size: 548, lr: 3.09e-02, grad_scale: 32.0 2023-06-18 02:42:41,326 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.780e+02 3.528e+02 4.428e+02 5.328e+02 9.334e+02, threshold=8.857e+02, percent-clipped=2.0 2023-06-18 02:43:49,529 INFO [train.py:996] (3/4) Epoch 1, batch 14050, loss[loss=0.2733, simple_loss=0.3355, pruned_loss=0.1056, over 21758.00 frames. ], tot_loss[loss=0.3404, simple_loss=0.391, pruned_loss=0.1449, over 4263257.24 frames. ], batch size: 282, lr: 3.09e-02, grad_scale: 32.0 2023-06-18 02:43:55,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=84300.0, ans=0.125 2023-06-18 02:44:34,813 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=84420.0, ans=0.1 2023-06-18 02:45:21,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=84540.0, ans=0.1 2023-06-18 02:45:44,406 INFO [train.py:996] (3/4) Epoch 1, batch 14100, loss[loss=0.2891, simple_loss=0.321, pruned_loss=0.1285, over 15008.00 frames. ], tot_loss[loss=0.3366, simple_loss=0.3837, pruned_loss=0.1447, over 4263118.36 frames. ], batch size: 61, lr: 3.08e-02, grad_scale: 32.0 2023-06-18 02:46:07,840 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 02:46:47,138 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.207e+02 3.554e+02 4.205e+02 5.494e+02 8.066e+02, threshold=8.411e+02, percent-clipped=0.0 2023-06-18 02:46:49,090 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=84720.0, ans=0.0 2023-06-18 02:47:00,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=84780.0, ans=0.0 2023-06-18 02:47:39,389 INFO [train.py:996] (3/4) Epoch 1, batch 14150, loss[loss=0.3997, simple_loss=0.4372, pruned_loss=0.1811, over 21495.00 frames. ], tot_loss[loss=0.3401, simple_loss=0.3873, pruned_loss=0.1465, over 4263786.54 frames. ], batch size: 509, lr: 3.08e-02, grad_scale: 32.0 2023-06-18 02:47:51,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=84900.0, ans=0.1 2023-06-18 02:48:05,461 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.20 vs. limit=15.0 2023-06-18 02:48:25,688 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.34 vs. limit=15.0 2023-06-18 02:49:32,056 INFO [train.py:996] (3/4) Epoch 1, batch 14200, loss[loss=0.2819, simple_loss=0.3363, pruned_loss=0.1137, over 21664.00 frames. ], tot_loss[loss=0.336, simple_loss=0.3839, pruned_loss=0.144, over 4269501.02 frames. ], batch size: 230, lr: 3.08e-02, grad_scale: 16.0 2023-06-18 02:49:46,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=85200.0, ans=0.0 2023-06-18 02:50:21,578 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=85320.0, ans=0.0 2023-06-18 02:50:28,369 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.871e+02 3.158e+02 3.617e+02 4.610e+02 1.061e+03, threshold=7.235e+02, percent-clipped=3.0 2023-06-18 02:50:31,853 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=85320.0, ans=0.125 2023-06-18 02:50:33,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=85380.0, ans=0.2 2023-06-18 02:51:31,297 INFO [train.py:996] (3/4) Epoch 1, batch 14250, loss[loss=0.2902, simple_loss=0.3628, pruned_loss=0.1088, over 20929.00 frames. ], tot_loss[loss=0.3313, simple_loss=0.3769, pruned_loss=0.1428, over 4264619.33 frames. ], batch size: 607, lr: 3.07e-02, grad_scale: 16.0 2023-06-18 02:51:43,226 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff3.min_abs, batch_count=85500.0, ans=0.2 2023-06-18 02:51:45,047 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=85500.0, ans=0.125 2023-06-18 02:51:51,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=85560.0, ans=0.125 2023-06-18 02:53:19,409 INFO [train.py:996] (3/4) Epoch 1, batch 14300, loss[loss=0.3215, simple_loss=0.3908, pruned_loss=0.1261, over 21587.00 frames. ], tot_loss[loss=0.3322, simple_loss=0.3797, pruned_loss=0.1423, over 4256054.98 frames. ], batch size: 230, lr: 3.07e-02, grad_scale: 16.0 2023-06-18 02:54:23,981 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.357e+02 3.210e+02 4.117e+02 5.928e+02 1.578e+03, threshold=8.234e+02, percent-clipped=19.0 2023-06-18 02:54:52,190 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.49 vs. limit=22.5 2023-06-18 02:55:30,411 INFO [train.py:996] (3/4) Epoch 1, batch 14350, loss[loss=0.3939, simple_loss=0.415, pruned_loss=0.1864, over 21643.00 frames. ], tot_loss[loss=0.3352, simple_loss=0.3838, pruned_loss=0.1433, over 4245410.09 frames. ], batch size: 471, lr: 3.06e-02, grad_scale: 16.0 2023-06-18 02:55:55,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=86160.0, ans=0.05 2023-06-18 02:56:54,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=86340.0, ans=0.1 2023-06-18 02:57:06,527 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.37 vs. limit=15.0 2023-06-18 02:57:12,504 INFO [train.py:996] (3/4) Epoch 1, batch 14400, loss[loss=0.3562, simple_loss=0.3793, pruned_loss=0.1666, over 21782.00 frames. ], tot_loss[loss=0.3372, simple_loss=0.3833, pruned_loss=0.1456, over 4256633.13 frames. ], batch size: 351, lr: 3.06e-02, grad_scale: 32.0 2023-06-18 02:57:17,576 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=86400.0, ans=0.125 2023-06-18 02:57:56,144 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.214e+02 3.630e+02 4.233e+02 4.966e+02 8.953e+02, threshold=8.465e+02, percent-clipped=1.0 2023-06-18 02:59:04,506 INFO [train.py:996] (3/4) Epoch 1, batch 14450, loss[loss=0.2885, simple_loss=0.3271, pruned_loss=0.1249, over 21637.00 frames. ], tot_loss[loss=0.3343, simple_loss=0.3776, pruned_loss=0.1455, over 4265274.81 frames. ], batch size: 247, lr: 3.05e-02, grad_scale: 32.0 2023-06-18 02:59:14,549 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=86700.0, ans=0.0 2023-06-18 02:59:17,538 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=86760.0, ans=0.0 2023-06-18 02:59:40,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=86820.0, ans=0.2 2023-06-18 03:00:34,012 INFO [train.py:996] (3/4) Epoch 1, batch 14500, loss[loss=0.3089, simple_loss=0.3658, pruned_loss=0.126, over 21882.00 frames. ], tot_loss[loss=0.3298, simple_loss=0.3727, pruned_loss=0.1435, over 4259582.24 frames. ], batch size: 107, lr: 3.05e-02, grad_scale: 32.0 2023-06-18 03:01:20,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=87120.0, ans=0.0 2023-06-18 03:01:23,215 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.389e+02 3.289e+02 4.039e+02 4.928e+02 1.223e+03, threshold=8.079e+02, percent-clipped=4.0 2023-06-18 03:01:55,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=87180.0, ans=0.04949747468305833 2023-06-18 03:02:40,189 INFO [train.py:996] (3/4) Epoch 1, batch 14550, loss[loss=0.3764, simple_loss=0.4145, pruned_loss=0.1691, over 21202.00 frames. ], tot_loss[loss=0.3343, simple_loss=0.3785, pruned_loss=0.145, over 4258752.87 frames. ], batch size: 143, lr: 3.05e-02, grad_scale: 32.0 2023-06-18 03:03:17,441 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.50 vs. limit=15.0 2023-06-18 03:03:25,510 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.98 vs. limit=15.0 2023-06-18 03:04:33,066 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.05 vs. limit=15.0 2023-06-18 03:04:37,687 INFO [train.py:996] (3/4) Epoch 1, batch 14600, loss[loss=0.3679, simple_loss=0.4083, pruned_loss=0.1637, over 21370.00 frames. ], tot_loss[loss=0.3465, simple_loss=0.3897, pruned_loss=0.1516, over 4260996.04 frames. ], batch size: 159, lr: 3.04e-02, grad_scale: 32.0 2023-06-18 03:04:56,821 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.43 vs. limit=6.0 2023-06-18 03:05:08,571 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=87660.0, ans=0.125 2023-06-18 03:05:23,270 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.17 vs. limit=15.0 2023-06-18 03:05:37,683 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.259e+02 3.686e+02 4.597e+02 5.767e+02 8.268e+02, threshold=9.195e+02, percent-clipped=3.0 2023-06-18 03:05:57,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=87780.0, ans=0.125 2023-06-18 03:05:58,545 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=87780.0, ans=0.125 2023-06-18 03:06:11,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=87840.0, ans=0.2 2023-06-18 03:06:28,364 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 03:06:31,183 INFO [train.py:996] (3/4) Epoch 1, batch 14650, loss[loss=0.3491, simple_loss=0.4062, pruned_loss=0.146, over 19914.00 frames. ], tot_loss[loss=0.3453, simple_loss=0.3906, pruned_loss=0.15, over 4254393.10 frames. ], batch size: 702, lr: 3.04e-02, grad_scale: 32.0 2023-06-18 03:06:43,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=87900.0, ans=0.07 2023-06-18 03:06:43,897 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=87900.0, ans=0.125 2023-06-18 03:07:06,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=88020.0, ans=0.05 2023-06-18 03:07:12,792 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.16 vs. limit=15.0 2023-06-18 03:08:23,523 INFO [train.py:996] (3/4) Epoch 1, batch 14700, loss[loss=0.2871, simple_loss=0.3537, pruned_loss=0.1103, over 21473.00 frames. ], tot_loss[loss=0.3282, simple_loss=0.3785, pruned_loss=0.1389, over 4254698.28 frames. ], batch size: 211, lr: 3.03e-02, grad_scale: 32.0 2023-06-18 03:09:37,809 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.719e+02 3.037e+02 3.906e+02 4.878e+02 1.107e+03, threshold=7.811e+02, percent-clipped=1.0 2023-06-18 03:09:44,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=88380.0, ans=0.125 2023-06-18 03:10:10,997 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.44 vs. limit=15.0 2023-06-18 03:10:24,997 INFO [train.py:996] (3/4) Epoch 1, batch 14750, loss[loss=0.4481, simple_loss=0.4895, pruned_loss=0.2034, over 21900.00 frames. ], tot_loss[loss=0.339, simple_loss=0.3871, pruned_loss=0.1455, over 4259932.87 frames. ], batch size: 372, lr: 3.03e-02, grad_scale: 32.0 2023-06-18 03:12:47,661 INFO [train.py:996] (3/4) Epoch 1, batch 14800, loss[loss=0.3403, simple_loss=0.3819, pruned_loss=0.1494, over 21199.00 frames. ], tot_loss[loss=0.3518, simple_loss=0.3987, pruned_loss=0.1524, over 4261657.07 frames. ], batch size: 176, lr: 3.03e-02, grad_scale: 32.0 2023-06-18 03:12:52,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=88800.0, ans=0.07 2023-06-18 03:13:00,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=88800.0, ans=0.04949747468305833 2023-06-18 03:13:20,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=88860.0, ans=0.125 2023-06-18 03:13:23,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=88920.0, ans=0.1 2023-06-18 03:13:46,436 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.441e+02 3.653e+02 4.322e+02 5.427e+02 8.202e+02, threshold=8.644e+02, percent-clipped=2.0 2023-06-18 03:13:54,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=88980.0, ans=0.125 2023-06-18 03:14:47,072 INFO [train.py:996] (3/4) Epoch 1, batch 14850, loss[loss=0.324, simple_loss=0.3684, pruned_loss=0.1398, over 21544.00 frames. ], tot_loss[loss=0.3485, simple_loss=0.3937, pruned_loss=0.1516, over 4257600.43 frames. ], batch size: 263, lr: 3.02e-02, grad_scale: 32.0 2023-06-18 03:14:50,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=89100.0, ans=0.125 2023-06-18 03:15:20,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=89160.0, ans=0.1 2023-06-18 03:15:27,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=89160.0, ans=0.2 2023-06-18 03:15:27,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=89160.0, ans=0.125 2023-06-18 03:15:27,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=89160.0, ans=0.125 2023-06-18 03:15:57,248 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=89160.0, ans=0.125 2023-06-18 03:16:12,273 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.91 vs. limit=6.0 2023-06-18 03:16:56,391 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.55 vs. limit=15.0 2023-06-18 03:16:56,802 INFO [train.py:996] (3/4) Epoch 1, batch 14900, loss[loss=0.3589, simple_loss=0.3965, pruned_loss=0.1607, over 22027.00 frames. ], tot_loss[loss=0.3536, simple_loss=0.3977, pruned_loss=0.1547, over 4254548.61 frames. ], batch size: 317, lr: 3.02e-02, grad_scale: 32.0 2023-06-18 03:17:52,388 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.464e+02 3.509e+02 4.418e+02 5.707e+02 9.536e+02, threshold=8.836e+02, percent-clipped=5.0 2023-06-18 03:18:40,703 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=89640.0, ans=10.0 2023-06-18 03:18:47,167 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=89640.0, ans=0.05 2023-06-18 03:19:03,618 INFO [train.py:996] (3/4) Epoch 1, batch 14950, loss[loss=0.3201, simple_loss=0.3819, pruned_loss=0.1291, over 21299.00 frames. ], tot_loss[loss=0.3524, simple_loss=0.3976, pruned_loss=0.1536, over 4260516.64 frames. ], batch size: 176, lr: 3.01e-02, grad_scale: 32.0 2023-06-18 03:19:23,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=89700.0, ans=0.0 2023-06-18 03:19:53,401 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=89820.0, ans=0.0 2023-06-18 03:20:01,818 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.84 vs. limit=15.0 2023-06-18 03:20:03,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=89820.0, ans=0.125 2023-06-18 03:20:52,672 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.25 vs. limit=22.5 2023-06-18 03:20:55,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=89940.0, ans=0.0 2023-06-18 03:21:01,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=89940.0, ans=0.0 2023-06-18 03:21:07,201 INFO [train.py:996] (3/4) Epoch 1, batch 15000, loss[loss=0.3519, simple_loss=0.3907, pruned_loss=0.1565, over 21558.00 frames. ], tot_loss[loss=0.3568, simple_loss=0.401, pruned_loss=0.1563, over 4265320.43 frames. ], batch size: 194, lr: 3.01e-02, grad_scale: 16.0 2023-06-18 03:21:07,202 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-18 03:21:55,986 INFO [train.py:1028] (3/4) Epoch 1, validation: loss=0.3047, simple_loss=0.3953, pruned_loss=0.107, over 1796401.00 frames. 2023-06-18 03:21:55,988 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-18 03:22:19,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=90060.0, ans=0.0 2023-06-18 03:22:25,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=90060.0, ans=0.125 2023-06-18 03:22:26,627 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=90060.0, ans=0.0 2023-06-18 03:22:48,011 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.606e+02 3.680e+02 4.733e+02 5.350e+02 8.092e+02, threshold=9.466e+02, percent-clipped=0.0 2023-06-18 03:23:16,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=90180.0, ans=0.2 2023-06-18 03:23:35,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=90240.0, ans=0.0 2023-06-18 03:23:48,950 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=90240.0, ans=0.1 2023-06-18 03:23:51,335 INFO [train.py:996] (3/4) Epoch 1, batch 15050, loss[loss=0.4221, simple_loss=0.4761, pruned_loss=0.1841, over 21519.00 frames. ], tot_loss[loss=0.3568, simple_loss=0.4006, pruned_loss=0.1565, over 4275495.67 frames. ], batch size: 471, lr: 3.01e-02, grad_scale: 16.0 2023-06-18 03:24:51,409 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.25 vs. limit=15.0 2023-06-18 03:25:32,293 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.90 vs. limit=10.0 2023-06-18 03:25:33,685 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.94 vs. limit=15.0 2023-06-18 03:25:49,345 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=90540.0, ans=0.0 2023-06-18 03:26:03,915 INFO [train.py:996] (3/4) Epoch 1, batch 15100, loss[loss=0.3008, simple_loss=0.3441, pruned_loss=0.1288, over 16967.00 frames. ], tot_loss[loss=0.3549, simple_loss=0.4003, pruned_loss=0.1548, over 4264529.76 frames. ], batch size: 60, lr: 3.00e-02, grad_scale: 16.0 2023-06-18 03:26:05,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=90600.0, ans=0.125 2023-06-18 03:26:49,484 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.21 vs. limit=10.0 2023-06-18 03:26:54,267 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.579e+02 3.766e+02 4.918e+02 6.265e+02 9.726e+02, threshold=9.837e+02, percent-clipped=1.0 2023-06-18 03:27:45,854 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=90840.0, ans=0.125 2023-06-18 03:27:56,496 INFO [train.py:996] (3/4) Epoch 1, batch 15150, loss[loss=0.3158, simple_loss=0.3447, pruned_loss=0.1434, over 21624.00 frames. ], tot_loss[loss=0.355, simple_loss=0.3981, pruned_loss=0.156, over 4252083.99 frames. ], batch size: 231, lr: 3.00e-02, grad_scale: 16.0 2023-06-18 03:28:38,615 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-18 03:29:38,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=91140.0, ans=0.2 2023-06-18 03:30:02,032 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=91200.0, ans=0.1 2023-06-18 03:30:03,110 INFO [train.py:996] (3/4) Epoch 1, batch 15200, loss[loss=0.2996, simple_loss=0.3465, pruned_loss=0.1263, over 21779.00 frames. ], tot_loss[loss=0.3442, simple_loss=0.3889, pruned_loss=0.1497, over 4247926.51 frames. ], batch size: 112, lr: 2.99e-02, grad_scale: 32.0 2023-06-18 03:30:57,179 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.017e+02 3.451e+02 4.223e+02 5.136e+02 8.420e+02, threshold=8.446e+02, percent-clipped=0.0 2023-06-18 03:30:59,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=91320.0, ans=0.05 2023-06-18 03:31:15,266 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=91380.0, ans=0.0 2023-06-18 03:31:35,380 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.65 vs. limit=15.0 2023-06-18 03:31:52,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=91440.0, ans=0.1 2023-06-18 03:31:59,050 INFO [train.py:996] (3/4) Epoch 1, batch 15250, loss[loss=0.3087, simple_loss=0.3475, pruned_loss=0.1349, over 21693.00 frames. ], tot_loss[loss=0.3376, simple_loss=0.3811, pruned_loss=0.147, over 4245603.50 frames. ], batch size: 333, lr: 2.99e-02, grad_scale: 32.0 2023-06-18 03:33:05,728 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=91620.0, ans=0.0 2023-06-18 03:33:25,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=91680.0, ans=0.125 2023-06-18 03:34:02,108 INFO [train.py:996] (3/4) Epoch 1, batch 15300, loss[loss=0.2913, simple_loss=0.3129, pruned_loss=0.1349, over 20843.00 frames. ], tot_loss[loss=0.344, simple_loss=0.3849, pruned_loss=0.1515, over 4252623.68 frames. ], batch size: 609, lr: 2.99e-02, grad_scale: 32.0 2023-06-18 03:34:16,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=91800.0, ans=0.125 2023-06-18 03:35:03,294 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=91920.0, ans=0.1 2023-06-18 03:35:10,025 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.236e+02 3.694e+02 4.588e+02 5.460e+02 1.157e+03, threshold=9.176e+02, percent-clipped=1.0 2023-06-18 03:36:05,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=92040.0, ans=0.125 2023-06-18 03:36:16,123 INFO [train.py:996] (3/4) Epoch 1, batch 15350, loss[loss=0.3684, simple_loss=0.4206, pruned_loss=0.1581, over 21891.00 frames. ], tot_loss[loss=0.3497, simple_loss=0.3906, pruned_loss=0.1544, over 4258364.15 frames. ], batch size: 371, lr: 2.98e-02, grad_scale: 32.0 2023-06-18 03:36:20,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=92100.0, ans=0.125 2023-06-18 03:38:05,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=92340.0, ans=0.125 2023-06-18 03:38:13,657 INFO [train.py:996] (3/4) Epoch 1, batch 15400, loss[loss=0.3373, simple_loss=0.3817, pruned_loss=0.1465, over 21877.00 frames. ], tot_loss[loss=0.3463, simple_loss=0.3896, pruned_loss=0.1515, over 4270782.33 frames. ], batch size: 351, lr: 2.98e-02, grad_scale: 32.0 2023-06-18 03:38:33,249 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=92460.0, ans=0.125 2023-06-18 03:38:35,174 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.53 vs. limit=15.0 2023-06-18 03:38:44,968 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=92460.0, ans=0.1 2023-06-18 03:39:06,572 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.210e+02 3.578e+02 4.472e+02 5.706e+02 1.204e+03, threshold=8.945e+02, percent-clipped=6.0 2023-06-18 03:39:28,962 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.95 vs. limit=15.0 2023-06-18 03:40:03,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=92700.0, ans=0.2 2023-06-18 03:40:04,075 INFO [train.py:996] (3/4) Epoch 1, batch 15450, loss[loss=0.3114, simple_loss=0.3815, pruned_loss=0.1206, over 21859.00 frames. ], tot_loss[loss=0.3431, simple_loss=0.3871, pruned_loss=0.1496, over 4267315.04 frames. ], batch size: 316, lr: 2.97e-02, grad_scale: 32.0 2023-06-18 03:40:07,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=92700.0, ans=0.0 2023-06-18 03:41:11,722 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.29 vs. limit=10.0 2023-06-18 03:41:21,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=92940.0, ans=10.0 2023-06-18 03:41:26,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=92940.0, ans=0.0 2023-06-18 03:41:33,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=93000.0, ans=0.2 2023-06-18 03:41:34,863 INFO [train.py:996] (3/4) Epoch 1, batch 15500, loss[loss=0.418, simple_loss=0.4535, pruned_loss=0.1913, over 21746.00 frames. ], tot_loss[loss=0.3446, simple_loss=0.3904, pruned_loss=0.1494, over 4254687.43 frames. ], batch size: 441, lr: 2.97e-02, grad_scale: 16.0 2023-06-18 03:42:09,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=93120.0, ans=0.125 2023-06-18 03:42:38,906 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.287e+02 3.182e+02 4.008e+02 5.609e+02 1.059e+03, threshold=8.016e+02, percent-clipped=3.0 2023-06-18 03:43:11,221 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.40 vs. limit=22.5 2023-06-18 03:43:41,840 INFO [train.py:996] (3/4) Epoch 1, batch 15550, loss[loss=0.3041, simple_loss=0.3677, pruned_loss=0.1203, over 21739.00 frames. ], tot_loss[loss=0.3397, simple_loss=0.3878, pruned_loss=0.1458, over 4256949.63 frames. ], batch size: 298, lr: 2.97e-02, grad_scale: 16.0 2023-06-18 03:44:21,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=93420.0, ans=0.0 2023-06-18 03:44:43,946 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.47 vs. limit=22.5 2023-06-18 03:45:13,899 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=93540.0, ans=0.0 2023-06-18 03:45:25,076 INFO [train.py:996] (3/4) Epoch 1, batch 15600, loss[loss=0.2957, simple_loss=0.3331, pruned_loss=0.1292, over 21433.00 frames. ], tot_loss[loss=0.336, simple_loss=0.3831, pruned_loss=0.1444, over 4257728.10 frames. ], batch size: 212, lr: 2.96e-02, grad_scale: 32.0 2023-06-18 03:45:53,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=93660.0, ans=0.0 2023-06-18 03:46:02,532 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=93660.0, ans=0.125 2023-06-18 03:46:40,240 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.412e+02 3.054e+02 4.153e+02 4.865e+02 7.806e+02, threshold=8.306e+02, percent-clipped=0.0 2023-06-18 03:47:09,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=93840.0, ans=0.5 2023-06-18 03:47:35,558 INFO [train.py:996] (3/4) Epoch 1, batch 15650, loss[loss=0.3018, simple_loss=0.3487, pruned_loss=0.1275, over 21653.00 frames. ], tot_loss[loss=0.3341, simple_loss=0.3815, pruned_loss=0.1433, over 4252071.66 frames. ], batch size: 282, lr: 2.96e-02, grad_scale: 32.0 2023-06-18 03:47:39,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=93900.0, ans=0.0 2023-06-18 03:47:39,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=93900.0, ans=0.2 2023-06-18 03:47:46,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=93900.0, ans=0.1 2023-06-18 03:48:36,515 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=94080.0, ans=0.0 2023-06-18 03:49:29,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=94140.0, ans=0.2 2023-06-18 03:49:33,341 INFO [train.py:996] (3/4) Epoch 1, batch 15700, loss[loss=0.2935, simple_loss=0.335, pruned_loss=0.126, over 21409.00 frames. ], tot_loss[loss=0.3316, simple_loss=0.3781, pruned_loss=0.1425, over 4256935.34 frames. ], batch size: 131, lr: 2.95e-02, grad_scale: 32.0 2023-06-18 03:50:06,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=94260.0, ans=0.2 2023-06-18 03:50:47,009 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.075e+02 3.466e+02 4.238e+02 5.232e+02 9.594e+02, threshold=8.475e+02, percent-clipped=2.0 2023-06-18 03:50:48,940 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=94380.0, ans=0.0 2023-06-18 03:50:53,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=94380.0, ans=0.0 2023-06-18 03:51:17,990 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=94440.0, ans=0.125 2023-06-18 03:51:22,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=94440.0, ans=0.1 2023-06-18 03:51:24,783 INFO [train.py:996] (3/4) Epoch 1, batch 15750, loss[loss=0.3232, simple_loss=0.3609, pruned_loss=0.1427, over 21712.00 frames. ], tot_loss[loss=0.3278, simple_loss=0.3724, pruned_loss=0.1416, over 4240696.80 frames. ], batch size: 282, lr: 2.95e-02, grad_scale: 32.0 2023-06-18 03:52:09,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=94620.0, ans=0.125 2023-06-18 03:52:17,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=94620.0, ans=0.1 2023-06-18 03:52:42,105 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=94680.0, ans=0.125 2023-06-18 03:53:13,094 INFO [train.py:996] (3/4) Epoch 1, batch 15800, loss[loss=0.3221, simple_loss=0.3655, pruned_loss=0.1394, over 21668.00 frames. ], tot_loss[loss=0.3249, simple_loss=0.367, pruned_loss=0.1414, over 4235204.54 frames. ], batch size: 332, lr: 2.95e-02, grad_scale: 32.0 2023-06-18 03:54:21,731 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=94920.0, ans=0.1 2023-06-18 03:54:24,046 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.132e+02 3.502e+02 4.028e+02 5.074e+02 9.979e+02, threshold=8.057e+02, percent-clipped=5.0 2023-06-18 03:54:43,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=94980.0, ans=0.125 2023-06-18 03:55:22,919 INFO [train.py:996] (3/4) Epoch 1, batch 15850, loss[loss=0.2937, simple_loss=0.3424, pruned_loss=0.1225, over 21773.00 frames. ], tot_loss[loss=0.3274, simple_loss=0.3684, pruned_loss=0.1432, over 4236425.92 frames. ], batch size: 124, lr: 2.94e-02, grad_scale: 32.0 2023-06-18 03:55:39,597 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.90 vs. limit=10.0 2023-06-18 03:55:50,026 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.66 vs. limit=6.0 2023-06-18 03:56:19,606 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=95220.0, ans=0.0 2023-06-18 03:56:19,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=95220.0, ans=0.2 2023-06-18 03:57:41,512 INFO [train.py:996] (3/4) Epoch 1, batch 15900, loss[loss=0.2949, simple_loss=0.3326, pruned_loss=0.1286, over 21083.00 frames. ], tot_loss[loss=0.3262, simple_loss=0.3661, pruned_loss=0.1431, over 4244802.41 frames. ], batch size: 143, lr: 2.94e-02, grad_scale: 32.0 2023-06-18 03:58:27,153 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.35 vs. limit=22.5 2023-06-18 03:58:39,808 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.679e+02 3.625e+02 4.216e+02 5.103e+02 7.976e+02, threshold=8.431e+02, percent-clipped=0.0 2023-06-18 03:59:18,960 INFO [train.py:996] (3/4) Epoch 1, batch 15950, loss[loss=0.3139, simple_loss=0.3696, pruned_loss=0.1291, over 21672.00 frames. ], tot_loss[loss=0.3232, simple_loss=0.3669, pruned_loss=0.1398, over 4250030.26 frames. ], batch size: 298, lr: 2.94e-02, grad_scale: 32.0 2023-06-18 03:59:39,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=95760.0, ans=0.0 2023-06-18 04:00:19,042 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=95880.0, ans=0.0 2023-06-18 04:00:20,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=95880.0, ans=0.125 2023-06-18 04:00:30,142 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.64 vs. limit=22.5 2023-06-18 04:00:40,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=95940.0, ans=0.125 2023-06-18 04:00:55,804 INFO [train.py:996] (3/4) Epoch 1, batch 16000, loss[loss=0.3232, simple_loss=0.377, pruned_loss=0.1347, over 21829.00 frames. ], tot_loss[loss=0.3176, simple_loss=0.3653, pruned_loss=0.1349, over 4246217.98 frames. ], batch size: 351, lr: 2.93e-02, grad_scale: 32.0 2023-06-18 04:00:56,816 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.19 vs. limit=6.0 2023-06-18 04:01:49,637 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 2.964e+02 3.608e+02 4.425e+02 8.344e+02, threshold=7.217e+02, percent-clipped=0.0 2023-06-18 04:02:14,404 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.77 vs. limit=6.0 2023-06-18 04:02:22,050 INFO [train.py:996] (3/4) Epoch 1, batch 16050, loss[loss=0.2635, simple_loss=0.3241, pruned_loss=0.1014, over 21432.00 frames. ], tot_loss[loss=0.3123, simple_loss=0.3646, pruned_loss=0.13, over 4254809.73 frames. ], batch size: 131, lr: 2.93e-02, grad_scale: 32.0 2023-06-18 04:03:36,838 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=96420.0, ans=0.1 2023-06-18 04:03:53,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=96480.0, ans=0.2 2023-06-18 04:03:54,474 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=96480.0, ans=0.2 2023-06-18 04:04:15,095 INFO [train.py:996] (3/4) Epoch 1, batch 16100, loss[loss=0.3687, simple_loss=0.3947, pruned_loss=0.1714, over 21520.00 frames. ], tot_loss[loss=0.3191, simple_loss=0.371, pruned_loss=0.1336, over 4269018.05 frames. ], batch size: 548, lr: 2.92e-02, grad_scale: 32.0 2023-06-18 04:04:50,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=96660.0, ans=0.1 2023-06-18 04:05:38,120 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.902e+02 3.663e+02 4.522e+02 5.171e+02 8.163e+02, threshold=9.044e+02, percent-clipped=3.0 2023-06-18 04:05:47,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=96780.0, ans=0.0 2023-06-18 04:06:10,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=96900.0, ans=0.125 2023-06-18 04:06:11,405 INFO [train.py:996] (3/4) Epoch 1, batch 16150, loss[loss=0.3317, simple_loss=0.382, pruned_loss=0.1406, over 21598.00 frames. ], tot_loss[loss=0.3236, simple_loss=0.3737, pruned_loss=0.1368, over 4278173.47 frames. ], batch size: 195, lr: 2.92e-02, grad_scale: 32.0 2023-06-18 04:07:45,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=97080.0, ans=0.2 2023-06-18 04:08:05,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=97140.0, ans=0.125 2023-06-18 04:08:24,767 INFO [train.py:996] (3/4) Epoch 1, batch 16200, loss[loss=0.4746, simple_loss=0.4762, pruned_loss=0.2364, over 21353.00 frames. ], tot_loss[loss=0.3307, simple_loss=0.3794, pruned_loss=0.141, over 4279216.53 frames. ], batch size: 507, lr: 2.92e-02, grad_scale: 32.0 2023-06-18 04:08:40,400 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=97200.0, ans=0.125 2023-06-18 04:09:14,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=97320.0, ans=0.0 2023-06-18 04:09:41,322 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.094e+02 3.302e+02 3.926e+02 4.944e+02 8.800e+02, threshold=7.851e+02, percent-clipped=0.0 2023-06-18 04:10:00,377 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.87 vs. limit=15.0 2023-06-18 04:10:19,004 INFO [train.py:996] (3/4) Epoch 1, batch 16250, loss[loss=0.2679, simple_loss=0.3173, pruned_loss=0.1092, over 21457.00 frames. ], tot_loss[loss=0.3327, simple_loss=0.381, pruned_loss=0.1422, over 4277866.21 frames. ], batch size: 212, lr: 2.91e-02, grad_scale: 32.0 2023-06-18 04:10:20,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=97500.0, ans=0.0 2023-06-18 04:11:00,448 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=97560.0, ans=0.125 2023-06-18 04:11:32,179 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=97620.0, ans=0.0 2023-06-18 04:11:55,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=97740.0, ans=0.125 2023-06-18 04:12:18,163 INFO [train.py:996] (3/4) Epoch 1, batch 16300, loss[loss=0.2826, simple_loss=0.3545, pruned_loss=0.1054, over 21217.00 frames. ], tot_loss[loss=0.324, simple_loss=0.3747, pruned_loss=0.1366, over 4274660.26 frames. ], batch size: 548, lr: 2.91e-02, grad_scale: 32.0 2023-06-18 04:13:19,444 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.140e+02 2.969e+02 3.585e+02 4.834e+02 8.506e+02, threshold=7.169e+02, percent-clipped=1.0 2023-06-18 04:13:50,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=97980.0, ans=0.125 2023-06-18 04:14:26,805 INFO [train.py:996] (3/4) Epoch 1, batch 16350, loss[loss=0.3506, simple_loss=0.3757, pruned_loss=0.1628, over 20193.00 frames. ], tot_loss[loss=0.3239, simple_loss=0.3736, pruned_loss=0.1372, over 4275956.63 frames. ], batch size: 707, lr: 2.91e-02, grad_scale: 32.0 2023-06-18 04:15:57,710 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=98280.0, ans=0.125 2023-06-18 04:16:37,186 INFO [train.py:996] (3/4) Epoch 1, batch 16400, loss[loss=0.3803, simple_loss=0.4184, pruned_loss=0.1711, over 21696.00 frames. ], tot_loss[loss=0.3289, simple_loss=0.3782, pruned_loss=0.1398, over 4273353.24 frames. ], batch size: 389, lr: 2.90e-02, grad_scale: 32.0 2023-06-18 04:16:51,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=98400.0, ans=0.0 2023-06-18 04:17:05,133 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.91 vs. limit=15.0 2023-06-18 04:17:44,959 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.113e+02 3.149e+02 3.641e+02 4.935e+02 8.709e+02, threshold=7.281e+02, percent-clipped=2.0 2023-06-18 04:18:58,265 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 04:18:59,110 INFO [train.py:996] (3/4) Epoch 1, batch 16450, loss[loss=0.3574, simple_loss=0.3965, pruned_loss=0.1591, over 20674.00 frames. ], tot_loss[loss=0.3294, simple_loss=0.3777, pruned_loss=0.1406, over 4277000.81 frames. ], batch size: 607, lr: 2.90e-02, grad_scale: 32.0 2023-06-18 04:19:34,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=98760.0, ans=0.0 2023-06-18 04:19:35,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=98760.0, ans=0.125 2023-06-18 04:19:56,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=98820.0, ans=10.0 2023-06-18 04:19:56,078 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=98820.0, ans=0.2 2023-06-18 04:19:59,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=98820.0, ans=0.0 2023-06-18 04:20:54,863 INFO [train.py:996] (3/4) Epoch 1, batch 16500, loss[loss=0.2703, simple_loss=0.3183, pruned_loss=0.1112, over 21746.00 frames. ], tot_loss[loss=0.328, simple_loss=0.3753, pruned_loss=0.1403, over 4279366.85 frames. ], batch size: 247, lr: 2.89e-02, grad_scale: 32.0 2023-06-18 04:21:13,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=99000.0, ans=0.125 2023-06-18 04:21:14,071 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=99000.0, ans=0.2 2023-06-18 04:21:18,929 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.00 vs. limit=15.0 2023-06-18 04:21:33,534 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.07 vs. limit=15.0 2023-06-18 04:22:03,124 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=99120.0, ans=0.0 2023-06-18 04:22:25,039 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 3.246e+02 4.122e+02 4.972e+02 8.514e+02, threshold=8.244e+02, percent-clipped=2.0 2023-06-18 04:22:55,071 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=99240.0, ans=0.5 2023-06-18 04:23:26,668 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=99240.0, ans=0.125 2023-06-18 04:23:43,058 INFO [train.py:996] (3/4) Epoch 1, batch 16550, loss[loss=0.2971, simple_loss=0.3678, pruned_loss=0.1132, over 21879.00 frames. ], tot_loss[loss=0.3204, simple_loss=0.3698, pruned_loss=0.1355, over 4279492.54 frames. ], batch size: 316, lr: 2.89e-02, grad_scale: 32.0 2023-06-18 04:25:35,138 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=99540.0, ans=0.0 2023-06-18 04:25:39,586 INFO [train.py:996] (3/4) Epoch 1, batch 16600, loss[loss=0.3518, simple_loss=0.4222, pruned_loss=0.1407, over 21617.00 frames. ], tot_loss[loss=0.3317, simple_loss=0.381, pruned_loss=0.1413, over 4265312.35 frames. ], batch size: 263, lr: 2.89e-02, grad_scale: 32.0 2023-06-18 04:25:47,937 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=99600.0, ans=0.125 2023-06-18 04:25:52,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=99600.0, ans=0.125 2023-06-18 04:26:41,435 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=99720.0, ans=0.125 2023-06-18 04:26:45,520 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.308e+02 3.572e+02 4.785e+02 5.858e+02 1.029e+03, threshold=9.570e+02, percent-clipped=5.0 2023-06-18 04:27:21,803 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=99840.0, ans=0.1 2023-06-18 04:27:29,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=99840.0, ans=0.125 2023-06-18 04:27:34,747 INFO [train.py:996] (3/4) Epoch 1, batch 16650, loss[loss=0.4416, simple_loss=0.464, pruned_loss=0.2095, over 21455.00 frames. ], tot_loss[loss=0.3452, simple_loss=0.3954, pruned_loss=0.1475, over 4266625.11 frames. ], batch size: 471, lr: 2.88e-02, grad_scale: 32.0 2023-06-18 04:27:35,832 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.31 vs. limit=15.0 2023-06-18 04:28:22,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=99960.0, ans=0.125 2023-06-18 04:28:23,291 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.68 vs. limit=15.0 2023-06-18 04:28:57,094 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=26.27 vs. limit=22.5 2023-06-18 04:28:57,731 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=100020.0, ans=0.125 2023-06-18 04:28:57,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=100020.0, ans=0.0 2023-06-18 04:29:38,733 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=100140.0, ans=0.125 2023-06-18 04:29:51,754 INFO [train.py:996] (3/4) Epoch 1, batch 16700, loss[loss=0.3285, simple_loss=0.3892, pruned_loss=0.1339, over 21886.00 frames. ], tot_loss[loss=0.3441, simple_loss=0.3947, pruned_loss=0.1468, over 4265046.89 frames. ], batch size: 372, lr: 2.88e-02, grad_scale: 32.0 2023-06-18 04:31:04,279 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=100320.0, ans=0.1 2023-06-18 04:31:12,789 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.672e+02 3.595e+02 4.250e+02 5.162e+02 1.058e+03, threshold=8.499e+02, percent-clipped=1.0 2023-06-18 04:32:38,956 INFO [train.py:996] (3/4) Epoch 1, batch 16750, loss[loss=0.3727, simple_loss=0.4014, pruned_loss=0.172, over 21376.00 frames. ], tot_loss[loss=0.3491, simple_loss=0.3975, pruned_loss=0.1504, over 4266798.89 frames. ], batch size: 549, lr: 2.88e-02, grad_scale: 32.0 2023-06-18 04:33:05,686 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=100560.0, ans=0.0 2023-06-18 04:33:08,706 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=100560.0, ans=0.125 2023-06-18 04:33:42,679 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-18 04:34:21,718 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=100680.0, ans=0.0 2023-06-18 04:35:07,930 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=100740.0, ans=0.0 2023-06-18 04:35:18,803 INFO [train.py:996] (3/4) Epoch 1, batch 16800, loss[loss=0.3619, simple_loss=0.3965, pruned_loss=0.1637, over 21864.00 frames. ], tot_loss[loss=0.3561, simple_loss=0.4072, pruned_loss=0.1524, over 4263044.33 frames. ], batch size: 351, lr: 2.87e-02, grad_scale: 32.0 2023-06-18 04:35:43,121 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 04:36:05,521 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.666e+02 3.547e+02 4.070e+02 4.993e+02 8.656e+02, threshold=8.140e+02, percent-clipped=1.0 2023-06-18 04:36:13,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=100980.0, ans=0.125 2023-06-18 04:36:27,990 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=100980.0, ans=0.5 2023-06-18 04:37:02,095 INFO [train.py:996] (3/4) Epoch 1, batch 16850, loss[loss=0.3828, simple_loss=0.4043, pruned_loss=0.1807, over 21658.00 frames. ], tot_loss[loss=0.3541, simple_loss=0.4033, pruned_loss=0.1524, over 4271271.68 frames. ], batch size: 471, lr: 2.87e-02, grad_scale: 32.0 2023-06-18 04:37:16,427 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.69 vs. limit=15.0 2023-06-18 04:37:23,386 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.70 vs. limit=15.0 2023-06-18 04:38:32,071 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.05 vs. limit=6.0 2023-06-18 04:39:12,671 INFO [train.py:996] (3/4) Epoch 1, batch 16900, loss[loss=0.2863, simple_loss=0.3323, pruned_loss=0.1202, over 21677.00 frames. ], tot_loss[loss=0.3473, simple_loss=0.3954, pruned_loss=0.1496, over 4276770.49 frames. ], batch size: 298, lr: 2.87e-02, grad_scale: 32.0 2023-06-18 04:39:18,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=101400.0, ans=0.07 2023-06-18 04:39:18,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=101400.0, ans=0.125 2023-06-18 04:39:26,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=101460.0, ans=0.0 2023-06-18 04:39:26,931 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.83 vs. limit=15.0 2023-06-18 04:40:16,076 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.242e+02 3.290e+02 3.976e+02 5.128e+02 6.971e+02, threshold=7.952e+02, percent-clipped=0.0 2023-06-18 04:40:34,130 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.77 vs. limit=22.5 2023-06-18 04:40:34,253 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.05 vs. limit=15.0 2023-06-18 04:40:53,165 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=101640.0, ans=0.125 2023-06-18 04:41:08,118 INFO [train.py:996] (3/4) Epoch 1, batch 16950, loss[loss=0.3941, simple_loss=0.403, pruned_loss=0.1926, over 21752.00 frames. ], tot_loss[loss=0.3408, simple_loss=0.3871, pruned_loss=0.1472, over 4278836.75 frames. ], batch size: 508, lr: 2.86e-02, grad_scale: 32.0 2023-06-18 04:43:28,696 INFO [train.py:996] (3/4) Epoch 1, batch 17000, loss[loss=0.3889, simple_loss=0.4085, pruned_loss=0.1846, over 21798.00 frames. ], tot_loss[loss=0.3399, simple_loss=0.3836, pruned_loss=0.1481, over 4286280.16 frames. ], batch size: 441, lr: 2.86e-02, grad_scale: 32.0 2023-06-18 04:43:36,718 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=102000.0, ans=0.125 2023-06-18 04:44:21,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=102120.0, ans=0.025 2023-06-18 04:44:39,325 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.399e+02 3.298e+02 3.924e+02 4.876e+02 1.271e+03, threshold=7.848e+02, percent-clipped=1.0 2023-06-18 04:45:36,796 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.39 vs. limit=15.0 2023-06-18 04:45:43,376 INFO [train.py:996] (3/4) Epoch 1, batch 17050, loss[loss=0.3574, simple_loss=0.4177, pruned_loss=0.1486, over 21875.00 frames. ], tot_loss[loss=0.3457, simple_loss=0.3889, pruned_loss=0.1512, over 4281989.37 frames. ], batch size: 316, lr: 2.86e-02, grad_scale: 32.0 2023-06-18 04:46:01,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=102360.0, ans=0.2 2023-06-18 04:46:16,971 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.56 vs. limit=15.0 2023-06-18 04:47:03,697 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=102480.0, ans=0.125 2023-06-18 04:47:05,989 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.13 vs. limit=5.0 2023-06-18 04:47:22,232 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.20 vs. limit=15.0 2023-06-18 04:47:23,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=102540.0, ans=0.0 2023-06-18 04:47:27,427 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=102540.0, ans=0.125 2023-06-18 04:47:28,891 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=102540.0, ans=0.0 2023-06-18 04:47:43,114 INFO [train.py:996] (3/4) Epoch 1, batch 17100, loss[loss=0.3151, simple_loss=0.3535, pruned_loss=0.1383, over 21563.00 frames. ], tot_loss[loss=0.3456, simple_loss=0.3889, pruned_loss=0.1511, over 4281408.81 frames. ], batch size: 195, lr: 2.85e-02, grad_scale: 32.0 2023-06-18 04:49:04,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=102720.0, ans=0.125 2023-06-18 04:49:06,914 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.244e+02 3.266e+02 3.852e+02 4.929e+02 1.111e+03, threshold=7.703e+02, percent-clipped=4.0 2023-06-18 04:49:30,166 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=102780.0, ans=0.1 2023-06-18 04:49:35,437 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.54 vs. limit=8.0 2023-06-18 04:49:55,672 INFO [train.py:996] (3/4) Epoch 1, batch 17150, loss[loss=0.2779, simple_loss=0.3427, pruned_loss=0.1065, over 21872.00 frames. ], tot_loss[loss=0.3418, simple_loss=0.3841, pruned_loss=0.1497, over 4277459.89 frames. ], batch size: 371, lr: 2.85e-02, grad_scale: 32.0 2023-06-18 04:50:03,576 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=102900.0, ans=0.1 2023-06-18 04:51:13,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=103080.0, ans=0.1 2023-06-18 04:51:36,715 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-06-18 04:51:44,158 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=103140.0, ans=0.125 2023-06-18 04:51:49,929 INFO [train.py:996] (3/4) Epoch 1, batch 17200, loss[loss=0.4516, simple_loss=0.4595, pruned_loss=0.2218, over 21333.00 frames. ], tot_loss[loss=0.3408, simple_loss=0.3839, pruned_loss=0.1489, over 4278221.37 frames. ], batch size: 507, lr: 2.84e-02, grad_scale: 32.0 2023-06-18 04:53:06,107 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.204e+02 3.334e+02 4.026e+02 5.110e+02 1.056e+03, threshold=8.051e+02, percent-clipped=6.0 2023-06-18 04:53:25,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=103380.0, ans=0.125 2023-06-18 04:54:01,780 INFO [train.py:996] (3/4) Epoch 1, batch 17250, loss[loss=0.3502, simple_loss=0.4202, pruned_loss=0.1401, over 16946.00 frames. ], tot_loss[loss=0.3455, simple_loss=0.3887, pruned_loss=0.1512, over 4265094.16 frames. ], batch size: 60, lr: 2.84e-02, grad_scale: 32.0 2023-06-18 04:54:02,472 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=103500.0, ans=0.0 2023-06-18 04:54:12,718 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=103500.0, ans=0.125 2023-06-18 04:54:59,501 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=103560.0, ans=0.125 2023-06-18 04:55:58,634 INFO [train.py:996] (3/4) Epoch 1, batch 17300, loss[loss=0.3757, simple_loss=0.4174, pruned_loss=0.167, over 21493.00 frames. ], tot_loss[loss=0.3541, simple_loss=0.3972, pruned_loss=0.1555, over 4266047.32 frames. ], batch size: 131, lr: 2.84e-02, grad_scale: 32.0 2023-06-18 04:56:21,593 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.70 vs. limit=15.0 2023-06-18 04:57:28,607 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.729e+02 3.957e+02 4.761e+02 5.817e+02 1.132e+03, threshold=9.521e+02, percent-clipped=5.0 2023-06-18 04:58:01,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=104040.0, ans=0.04949747468305833 2023-06-18 04:58:17,378 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=104040.0, ans=0.0 2023-06-18 04:58:23,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=104040.0, ans=0.125 2023-06-18 04:58:35,431 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=104040.0, ans=0.125 2023-06-18 04:58:35,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=104040.0, ans=0.125 2023-06-18 04:58:39,401 INFO [train.py:996] (3/4) Epoch 1, batch 17350, loss[loss=0.3524, simple_loss=0.4245, pruned_loss=0.1402, over 21239.00 frames. ], tot_loss[loss=0.3531, simple_loss=0.3968, pruned_loss=0.1547, over 4260308.65 frames. ], batch size: 548, lr: 2.83e-02, grad_scale: 32.0 2023-06-18 04:58:44,661 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=104100.0, ans=0.125 2023-06-18 04:59:08,028 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=104160.0, ans=0.2 2023-06-18 04:59:30,853 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.56 vs. limit=10.0 2023-06-18 05:00:36,713 INFO [train.py:996] (3/4) Epoch 1, batch 17400, loss[loss=0.254, simple_loss=0.2925, pruned_loss=0.1077, over 21228.00 frames. ], tot_loss[loss=0.3456, simple_loss=0.392, pruned_loss=0.1496, over 4262627.75 frames. ], batch size: 143, lr: 2.83e-02, grad_scale: 32.0 2023-06-18 05:02:13,465 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.371e+02 3.697e+02 4.904e+02 6.158e+02 8.783e+02, threshold=9.807e+02, percent-clipped=0.0 2023-06-18 05:02:32,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=104580.0, ans=0.0 2023-06-18 05:02:59,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=104640.0, ans=0.2 2023-06-18 05:03:10,193 INFO [train.py:996] (3/4) Epoch 1, batch 17450, loss[loss=0.2604, simple_loss=0.3399, pruned_loss=0.09041, over 21737.00 frames. ], tot_loss[loss=0.3354, simple_loss=0.3848, pruned_loss=0.1431, over 4268920.69 frames. ], batch size: 351, lr: 2.83e-02, grad_scale: 32.0 2023-06-18 05:03:21,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=104700.0, ans=0.125 2023-06-18 05:03:47,859 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=104760.0, ans=0.0 2023-06-18 05:04:03,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=104760.0, ans=0.1 2023-06-18 05:04:53,944 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=104880.0, ans=0.07 2023-06-18 05:05:15,514 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=104940.0, ans=0.07 2023-06-18 05:05:29,161 INFO [train.py:996] (3/4) Epoch 1, batch 17500, loss[loss=0.3499, simple_loss=0.3863, pruned_loss=0.1568, over 21465.00 frames. ], tot_loss[loss=0.3267, simple_loss=0.378, pruned_loss=0.1377, over 4269405.15 frames. ], batch size: 548, lr: 2.82e-02, grad_scale: 64.0 2023-06-18 05:05:55,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=105060.0, ans=0.2 2023-06-18 05:06:12,162 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.70 vs. limit=15.0 2023-06-18 05:06:14,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=105120.0, ans=0.1 2023-06-18 05:06:23,241 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.766e+02 2.676e+02 3.184e+02 3.972e+02 6.733e+02, threshold=6.368e+02, percent-clipped=0.0 2023-06-18 05:06:29,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=105180.0, ans=0.0 2023-06-18 05:07:10,588 INFO [train.py:996] (3/4) Epoch 1, batch 17550, loss[loss=0.2883, simple_loss=0.3606, pruned_loss=0.108, over 21272.00 frames. ], tot_loss[loss=0.3256, simple_loss=0.3782, pruned_loss=0.1364, over 4271387.65 frames. ], batch size: 143, lr: 2.82e-02, grad_scale: 32.0 2023-06-18 05:07:31,610 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.50 vs. limit=15.0 2023-06-18 05:07:39,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=105360.0, ans=0.0 2023-06-18 05:08:08,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=105420.0, ans=0.0 2023-06-18 05:09:08,739 INFO [train.py:996] (3/4) Epoch 1, batch 17600, loss[loss=0.3149, simple_loss=0.3785, pruned_loss=0.1256, over 21281.00 frames. ], tot_loss[loss=0.3258, simple_loss=0.3793, pruned_loss=0.1361, over 4272513.77 frames. ], batch size: 143, lr: 2.82e-02, grad_scale: 32.0 2023-06-18 05:10:18,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=105720.0, ans=0.0 2023-06-18 05:10:23,869 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.884e+02 3.173e+02 4.298e+02 5.550e+02 1.174e+03, threshold=8.596e+02, percent-clipped=15.0 2023-06-18 05:10:47,400 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=105840.0, ans=0.1 2023-06-18 05:10:59,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=105840.0, ans=0.2 2023-06-18 05:11:05,584 INFO [train.py:996] (3/4) Epoch 1, batch 17650, loss[loss=0.2413, simple_loss=0.2901, pruned_loss=0.09624, over 21578.00 frames. ], tot_loss[loss=0.3249, simple_loss=0.3767, pruned_loss=0.1365, over 4263658.63 frames. ], batch size: 230, lr: 2.81e-02, grad_scale: 32.0 2023-06-18 05:11:12,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=105900.0, ans=0.125 2023-06-18 05:11:25,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=105960.0, ans=0.0 2023-06-18 05:11:47,980 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=106020.0, ans=0.09899494936611666 2023-06-18 05:12:07,401 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=106080.0, ans=0.0 2023-06-18 05:12:19,728 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 05:12:32,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=106140.0, ans=0.125 2023-06-18 05:12:42,037 INFO [train.py:996] (3/4) Epoch 1, batch 17700, loss[loss=0.3516, simple_loss=0.4118, pruned_loss=0.1457, over 20788.00 frames. ], tot_loss[loss=0.3145, simple_loss=0.3673, pruned_loss=0.1309, over 4269181.70 frames. ], batch size: 609, lr: 2.81e-02, grad_scale: 32.0 2023-06-18 05:13:30,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=106260.0, ans=0.125 2023-06-18 05:13:49,988 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=106320.0, ans=0.2 2023-06-18 05:14:06,836 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.164e+02 3.144e+02 3.611e+02 4.728e+02 8.023e+02, threshold=7.222e+02, percent-clipped=0.0 2023-06-18 05:14:44,162 INFO [train.py:996] (3/4) Epoch 1, batch 17750, loss[loss=0.3623, simple_loss=0.4112, pruned_loss=0.1567, over 21722.00 frames. ], tot_loss[loss=0.3275, simple_loss=0.379, pruned_loss=0.138, over 4272510.67 frames. ], batch size: 298, lr: 2.81e-02, grad_scale: 32.0 2023-06-18 05:15:36,028 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.23 vs. limit=15.0 2023-06-18 05:15:40,224 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=106560.0, ans=0.2 2023-06-18 05:16:14,255 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 05:16:17,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=106680.0, ans=0.125 2023-06-18 05:16:29,994 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 05:16:35,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=106740.0, ans=0.95 2023-06-18 05:16:41,540 INFO [train.py:996] (3/4) Epoch 1, batch 17800, loss[loss=0.3442, simple_loss=0.392, pruned_loss=0.1482, over 21859.00 frames. ], tot_loss[loss=0.3264, simple_loss=0.3786, pruned_loss=0.1371, over 4271827.82 frames. ], batch size: 372, lr: 2.80e-02, grad_scale: 32.0 2023-06-18 05:18:10,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=106920.0, ans=0.125 2023-06-18 05:18:11,425 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 3.015e+02 3.874e+02 4.676e+02 8.507e+02, threshold=7.748e+02, percent-clipped=1.0 2023-06-18 05:18:44,680 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.77 vs. limit=22.5 2023-06-18 05:18:53,251 INFO [train.py:996] (3/4) Epoch 1, batch 17850, loss[loss=0.3535, simple_loss=0.3898, pruned_loss=0.1586, over 21816.00 frames. ], tot_loss[loss=0.3253, simple_loss=0.3774, pruned_loss=0.1366, over 4273474.12 frames. ], batch size: 282, lr: 2.80e-02, grad_scale: 32.0 2023-06-18 05:19:23,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=107160.0, ans=0.1 2023-06-18 05:21:17,568 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.30 vs. limit=15.0 2023-06-18 05:21:23,384 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=107400.0, ans=0.05 2023-06-18 05:21:24,302 INFO [train.py:996] (3/4) Epoch 1, batch 17900, loss[loss=0.3238, simple_loss=0.3938, pruned_loss=0.1269, over 21640.00 frames. ], tot_loss[loss=0.3337, simple_loss=0.3857, pruned_loss=0.1408, over 4277479.84 frames. ], batch size: 263, lr: 2.80e-02, grad_scale: 32.0 2023-06-18 05:22:43,627 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.103e+02 3.224e+02 3.725e+02 5.130e+02 9.496e+02, threshold=7.451e+02, percent-clipped=4.0 2023-06-18 05:23:10,336 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.79 vs. limit=10.0 2023-06-18 05:23:46,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=107640.0, ans=0.0 2023-06-18 05:23:50,221 INFO [train.py:996] (3/4) Epoch 1, batch 17950, loss[loss=0.207, simple_loss=0.2564, pruned_loss=0.07873, over 15945.00 frames. ], tot_loss[loss=0.3265, simple_loss=0.3829, pruned_loss=0.1351, over 4269595.31 frames. ], batch size: 60, lr: 2.79e-02, grad_scale: 32.0 2023-06-18 05:24:47,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=107820.0, ans=0.125 2023-06-18 05:25:21,524 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=107880.0, ans=0.125 2023-06-18 05:25:57,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=108000.0, ans=0.125 2023-06-18 05:25:58,835 INFO [train.py:996] (3/4) Epoch 1, batch 18000, loss[loss=0.3846, simple_loss=0.3783, pruned_loss=0.1954, over 21400.00 frames. ], tot_loss[loss=0.3217, simple_loss=0.3758, pruned_loss=0.1338, over 4265522.78 frames. ], batch size: 509, lr: 2.79e-02, grad_scale: 32.0 2023-06-18 05:25:58,835 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-18 05:26:54,357 INFO [train.py:1028] (3/4) Epoch 1, validation: loss=0.3106, simple_loss=0.4066, pruned_loss=0.1073, over 1796401.00 frames. 2023-06-18 05:26:54,358 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-18 05:27:11,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=108060.0, ans=0.0 2023-06-18 05:27:33,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=108120.0, ans=0.125 2023-06-18 05:27:47,986 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.025e+02 3.259e+02 3.858e+02 4.507e+02 8.062e+02, threshold=7.716e+02, percent-clipped=1.0 2023-06-18 05:28:30,789 INFO [train.py:996] (3/4) Epoch 1, batch 18050, loss[loss=0.3397, simple_loss=0.3803, pruned_loss=0.1496, over 21403.00 frames. ], tot_loss[loss=0.3178, simple_loss=0.3697, pruned_loss=0.133, over 4260860.84 frames. ], batch size: 131, lr: 2.79e-02, grad_scale: 32.0 2023-06-18 05:28:37,834 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.43 vs. limit=15.0 2023-06-18 05:28:49,936 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=108360.0, ans=0.0 2023-06-18 05:29:56,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=108480.0, ans=0.1 2023-06-18 05:29:59,587 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=108480.0, ans=0.125 2023-06-18 05:30:04,014 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=108480.0, ans=0.125 2023-06-18 05:30:12,052 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.08 vs. limit=15.0 2023-06-18 05:30:34,521 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=108540.0, ans=0.0 2023-06-18 05:30:51,580 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=108600.0, ans=0.125 2023-06-18 05:30:52,499 INFO [train.py:996] (3/4) Epoch 1, batch 18100, loss[loss=0.3188, simple_loss=0.3944, pruned_loss=0.1216, over 21710.00 frames. ], tot_loss[loss=0.3288, simple_loss=0.3792, pruned_loss=0.1392, over 4269021.17 frames. ], batch size: 298, lr: 2.78e-02, grad_scale: 32.0 2023-06-18 05:31:08,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=108660.0, ans=0.0 2023-06-18 05:31:43,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=108660.0, ans=0.125 2023-06-18 05:31:45,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=108720.0, ans=0.2 2023-06-18 05:32:16,172 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=108720.0, ans=0.125 2023-06-18 05:32:17,145 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.115e+02 3.257e+02 4.345e+02 5.048e+02 8.084e+02, threshold=8.690e+02, percent-clipped=1.0 2023-06-18 05:32:20,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=108780.0, ans=0.125 2023-06-18 05:32:23,620 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=108780.0, ans=0.125 2023-06-18 05:33:05,217 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=108840.0, ans=0.125 2023-06-18 05:33:28,594 INFO [train.py:996] (3/4) Epoch 1, batch 18150, loss[loss=0.2831, simple_loss=0.317, pruned_loss=0.1246, over 15639.00 frames. ], tot_loss[loss=0.3291, simple_loss=0.3807, pruned_loss=0.1387, over 4264162.10 frames. ], batch size: 60, lr: 2.78e-02, grad_scale: 32.0 2023-06-18 05:34:05,575 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=108960.0, ans=0.04949747468305833 2023-06-18 05:34:13,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=109020.0, ans=0.125 2023-06-18 05:34:42,549 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.31 vs. limit=15.0 2023-06-18 05:34:43,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=109080.0, ans=0.125 2023-06-18 05:35:07,211 INFO [train.py:996] (3/4) Epoch 1, batch 18200, loss[loss=0.2988, simple_loss=0.3364, pruned_loss=0.1305, over 21861.00 frames. ], tot_loss[loss=0.3253, simple_loss=0.3739, pruned_loss=0.1384, over 4265254.04 frames. ], batch size: 107, lr: 2.78e-02, grad_scale: 32.0 2023-06-18 05:35:25,873 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=109260.0, ans=0.035 2023-06-18 05:35:50,234 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.14 vs. limit=22.5 2023-06-18 05:36:02,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=109320.0, ans=0.125 2023-06-18 05:36:12,730 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.338e+02 3.177e+02 3.790e+02 4.775e+02 7.519e+02, threshold=7.579e+02, percent-clipped=0.0 2023-06-18 05:37:00,792 INFO [train.py:996] (3/4) Epoch 1, batch 18250, loss[loss=0.2382, simple_loss=0.2978, pruned_loss=0.08924, over 21430.00 frames. ], tot_loss[loss=0.3133, simple_loss=0.3623, pruned_loss=0.1321, over 4262794.64 frames. ], batch size: 194, lr: 2.77e-02, grad_scale: 32.0 2023-06-18 05:38:03,020 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=109620.0, ans=0.0 2023-06-18 05:38:46,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=109740.0, ans=0.1 2023-06-18 05:39:23,201 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=109800.0, ans=0.125 2023-06-18 05:39:24,243 INFO [train.py:996] (3/4) Epoch 1, batch 18300, loss[loss=0.3507, simple_loss=0.3937, pruned_loss=0.1538, over 21846.00 frames. ], tot_loss[loss=0.3132, simple_loss=0.3624, pruned_loss=0.1319, over 4261473.76 frames. ], batch size: 351, lr: 2.77e-02, grad_scale: 32.0 2023-06-18 05:40:20,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff3.min_abs, batch_count=109920.0, ans=0.2 2023-06-18 05:40:52,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=109920.0, ans=0.2 2023-06-18 05:40:53,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=109920.0, ans=0.125 2023-06-18 05:41:05,960 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.243e+02 3.323e+02 3.972e+02 4.906e+02 9.934e+02, threshold=7.944e+02, percent-clipped=3.0 2023-06-18 05:41:40,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=110040.0, ans=0.125 2023-06-18 05:41:42,518 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.22 vs. limit=6.0 2023-06-18 05:41:43,039 INFO [train.py:996] (3/4) Epoch 1, batch 18350, loss[loss=0.3099, simple_loss=0.3459, pruned_loss=0.1369, over 20805.00 frames. ], tot_loss[loss=0.3181, simple_loss=0.3716, pruned_loss=0.1323, over 4254076.66 frames. ], batch size: 608, lr: 2.77e-02, grad_scale: 32.0 2023-06-18 05:41:54,327 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=110100.0, ans=0.125 2023-06-18 05:41:55,102 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.56 vs. limit=22.5 2023-06-18 05:42:06,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=110160.0, ans=0.2 2023-06-18 05:42:55,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=110220.0, ans=0.125 2023-06-18 05:43:32,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=110340.0, ans=0.1 2023-06-18 05:43:33,131 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.89 vs. limit=22.5 2023-06-18 05:43:59,115 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=110340.0, ans=0.125 2023-06-18 05:44:12,473 INFO [train.py:996] (3/4) Epoch 1, batch 18400, loss[loss=0.2697, simple_loss=0.3202, pruned_loss=0.1096, over 21154.00 frames. ], tot_loss[loss=0.3146, simple_loss=0.3663, pruned_loss=0.1314, over 4252147.73 frames. ], batch size: 143, lr: 2.76e-02, grad_scale: 32.0 2023-06-18 05:44:21,838 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=110400.0, ans=0.125 2023-06-18 05:44:49,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=110460.0, ans=0.0 2023-06-18 05:45:20,423 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.46 vs. limit=15.0 2023-06-18 05:45:22,721 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.114e+02 3.119e+02 3.679e+02 4.893e+02 6.747e+02, threshold=7.358e+02, percent-clipped=0.0 2023-06-18 05:45:37,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=110580.0, ans=0.125 2023-06-18 05:46:30,238 INFO [train.py:996] (3/4) Epoch 1, batch 18450, loss[loss=0.283, simple_loss=0.3455, pruned_loss=0.1102, over 21534.00 frames. ], tot_loss[loss=0.3065, simple_loss=0.3615, pruned_loss=0.1258, over 4254316.51 frames. ], batch size: 212, lr: 2.76e-02, grad_scale: 32.0 2023-06-18 05:46:32,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=110700.0, ans=0.1 2023-06-18 05:48:16,283 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.00 vs. limit=15.0 2023-06-18 05:48:26,250 INFO [train.py:996] (3/4) Epoch 1, batch 18500, loss[loss=0.2577, simple_loss=0.3299, pruned_loss=0.09271, over 21674.00 frames. ], tot_loss[loss=0.3039, simple_loss=0.3571, pruned_loss=0.1253, over 4259449.90 frames. ], batch size: 247, lr: 2.76e-02, grad_scale: 32.0 2023-06-18 05:48:26,661 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=111000.0, ans=0.125 2023-06-18 05:49:00,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=111120.0, ans=0.125 2023-06-18 05:49:00,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=111120.0, ans=0.125 2023-06-18 05:49:20,945 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.167e+02 3.596e+02 4.284e+02 6.309e+02 9.887e+02, threshold=8.569e+02, percent-clipped=11.0 2023-06-18 05:49:28,950 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=111180.0, ans=0.125 2023-06-18 05:50:02,713 INFO [train.py:996] (3/4) Epoch 1, batch 18550, loss[loss=0.2542, simple_loss=0.301, pruned_loss=0.1037, over 21421.00 frames. ], tot_loss[loss=0.3021, simple_loss=0.3547, pruned_loss=0.1247, over 4260502.18 frames. ], batch size: 194, lr: 2.76e-02, grad_scale: 32.0 2023-06-18 05:50:13,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=111300.0, ans=0.0 2023-06-18 05:50:25,031 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.46 vs. limit=15.0 2023-06-18 05:52:01,505 INFO [train.py:996] (3/4) Epoch 1, batch 18600, loss[loss=0.3511, simple_loss=0.3997, pruned_loss=0.1513, over 21890.00 frames. ], tot_loss[loss=0.3011, simple_loss=0.3519, pruned_loss=0.1252, over 4244654.50 frames. ], batch size: 373, lr: 2.75e-02, grad_scale: 32.0 2023-06-18 05:52:59,185 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.94 vs. limit=22.5 2023-06-18 05:53:20,117 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.160e+02 2.842e+02 3.411e+02 4.162e+02 7.856e+02, threshold=6.821e+02, percent-clipped=0.0 2023-06-18 05:54:02,506 INFO [train.py:996] (3/4) Epoch 1, batch 18650, loss[loss=0.2545, simple_loss=0.3199, pruned_loss=0.09452, over 21452.00 frames. ], tot_loss[loss=0.3014, simple_loss=0.3511, pruned_loss=0.1259, over 4241584.84 frames. ], batch size: 212, lr: 2.75e-02, grad_scale: 32.0 2023-06-18 05:54:45,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=112020.0, ans=0.1 2023-06-18 05:54:55,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=112020.0, ans=0.125 2023-06-18 05:55:40,745 INFO [train.py:996] (3/4) Epoch 1, batch 18700, loss[loss=0.3429, simple_loss=0.376, pruned_loss=0.1549, over 21723.00 frames. ], tot_loss[loss=0.3052, simple_loss=0.3514, pruned_loss=0.1295, over 4254498.31 frames. ], batch size: 389, lr: 2.75e-02, grad_scale: 32.0 2023-06-18 05:56:03,682 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.76 vs. limit=10.0 2023-06-18 05:56:37,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=112320.0, ans=0.125 2023-06-18 05:57:01,543 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.994e+02 3.044e+02 3.629e+02 4.654e+02 7.971e+02, threshold=7.259e+02, percent-clipped=4.0 2023-06-18 05:57:18,394 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.05 vs. limit=6.0 2023-06-18 05:57:27,820 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=112440.0, ans=0.04949747468305833 2023-06-18 05:57:50,359 INFO [train.py:996] (3/4) Epoch 1, batch 18750, loss[loss=0.2854, simple_loss=0.3334, pruned_loss=0.1187, over 21344.00 frames. ], tot_loss[loss=0.3096, simple_loss=0.3539, pruned_loss=0.1326, over 4242448.93 frames. ], batch size: 176, lr: 2.74e-02, grad_scale: 32.0 2023-06-18 05:58:18,750 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=112500.0, ans=0.125 2023-06-18 05:58:45,504 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=112560.0, ans=0.09899494936611666 2023-06-18 06:00:01,967 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=112740.0, ans=0.125 2023-06-18 06:00:15,024 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=112740.0, ans=0.04949747468305833 2023-06-18 06:00:27,315 INFO [train.py:996] (3/4) Epoch 1, batch 18800, loss[loss=0.2615, simple_loss=0.33, pruned_loss=0.09654, over 21833.00 frames. ], tot_loss[loss=0.3145, simple_loss=0.3608, pruned_loss=0.1342, over 4251807.52 frames. ], batch size: 316, lr: 2.74e-02, grad_scale: 32.0 2023-06-18 06:01:10,055 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.99 vs. limit=22.5 2023-06-18 06:01:24,231 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=112860.0, ans=0.2 2023-06-18 06:01:44,477 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.68 vs. limit=10.0 2023-06-18 06:02:04,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=112920.0, ans=0.04949747468305833 2023-06-18 06:02:08,036 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.633e+02 3.299e+02 4.034e+02 4.833e+02 8.926e+02, threshold=8.067e+02, percent-clipped=4.0 2023-06-18 06:02:18,803 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=112980.0, ans=0.5 2023-06-18 06:02:48,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=113100.0, ans=0.125 2023-06-18 06:02:49,580 INFO [train.py:996] (3/4) Epoch 1, batch 18850, loss[loss=0.2596, simple_loss=0.3122, pruned_loss=0.1035, over 21509.00 frames. ], tot_loss[loss=0.304, simple_loss=0.3542, pruned_loss=0.1269, over 4247796.12 frames. ], batch size: 230, lr: 2.74e-02, grad_scale: 32.0 2023-06-18 06:04:04,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=113220.0, ans=0.2 2023-06-18 06:04:25,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=113280.0, ans=0.125 2023-06-18 06:04:34,777 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.05 vs. limit=15.0 2023-06-18 06:04:52,274 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.18 vs. limit=6.0 2023-06-18 06:04:56,927 INFO [train.py:996] (3/4) Epoch 1, batch 18900, loss[loss=0.2951, simple_loss=0.3186, pruned_loss=0.1358, over 21059.00 frames. ], tot_loss[loss=0.3026, simple_loss=0.3507, pruned_loss=0.1272, over 4253197.54 frames. ], batch size: 608, lr: 2.73e-02, grad_scale: 32.0 2023-06-18 06:05:11,471 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.36 vs. limit=15.0 2023-06-18 06:06:03,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=113520.0, ans=0.125 2023-06-18 06:06:38,852 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.916e+02 3.081e+02 3.640e+02 4.781e+02 9.031e+02, threshold=7.280e+02, percent-clipped=2.0 2023-06-18 06:07:07,627 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=113640.0, ans=0.2 2023-06-18 06:07:43,561 INFO [train.py:996] (3/4) Epoch 1, batch 18950, loss[loss=0.3312, simple_loss=0.4081, pruned_loss=0.1272, over 21835.00 frames. ], tot_loss[loss=0.3105, simple_loss=0.3561, pruned_loss=0.1324, over 4262388.21 frames. ], batch size: 351, lr: 2.73e-02, grad_scale: 32.0 2023-06-18 06:07:47,196 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.82 vs. limit=15.0 2023-06-18 06:07:54,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=113700.0, ans=0.0 2023-06-18 06:08:08,230 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=113760.0, ans=0.125 2023-06-18 06:08:37,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=113760.0, ans=0.125 2023-06-18 06:09:09,221 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=113820.0, ans=0.0 2023-06-18 06:09:30,350 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.80 vs. limit=15.0 2023-06-18 06:10:05,028 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=113940.0, ans=0.0 2023-06-18 06:10:26,256 INFO [train.py:996] (3/4) Epoch 1, batch 19000, loss[loss=0.3951, simple_loss=0.4411, pruned_loss=0.1745, over 21407.00 frames. ], tot_loss[loss=0.3188, simple_loss=0.3674, pruned_loss=0.1351, over 4268422.07 frames. ], batch size: 471, lr: 2.73e-02, grad_scale: 32.0 2023-06-18 06:10:35,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=114000.0, ans=0.0 2023-06-18 06:11:28,485 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.448e+02 3.443e+02 4.158e+02 4.977e+02 1.551e+03, threshold=8.315e+02, percent-clipped=7.0 2023-06-18 06:11:50,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=114180.0, ans=0.0 2023-06-18 06:12:15,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=114240.0, ans=0.125 2023-06-18 06:12:23,119 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.40 vs. limit=15.0 2023-06-18 06:12:32,964 INFO [train.py:996] (3/4) Epoch 1, batch 19050, loss[loss=0.3592, simple_loss=0.3959, pruned_loss=0.1613, over 20628.00 frames. ], tot_loss[loss=0.326, simple_loss=0.3726, pruned_loss=0.1397, over 4272168.70 frames. ], batch size: 607, lr: 2.72e-02, grad_scale: 32.0 2023-06-18 06:12:52,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=114300.0, ans=0.0 2023-06-18 06:14:14,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=114480.0, ans=0.125 2023-06-18 06:15:15,459 INFO [train.py:996] (3/4) Epoch 1, batch 19100, loss[loss=0.3331, simple_loss=0.3641, pruned_loss=0.1511, over 21555.00 frames. ], tot_loss[loss=0.3269, simple_loss=0.3714, pruned_loss=0.1412, over 4274456.41 frames. ], batch size: 414, lr: 2.72e-02, grad_scale: 32.0 2023-06-18 06:15:16,438 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.63 vs. limit=15.0 2023-06-18 06:16:10,746 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=114720.0, ans=0.1 2023-06-18 06:16:32,259 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.540e+02 3.357e+02 4.030e+02 5.006e+02 7.474e+02, threshold=8.060e+02, percent-clipped=0.0 2023-06-18 06:17:14,944 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=114840.0, ans=0.0 2023-06-18 06:18:01,641 INFO [train.py:996] (3/4) Epoch 1, batch 19150, loss[loss=0.4634, simple_loss=0.5086, pruned_loss=0.2091, over 21497.00 frames. ], tot_loss[loss=0.3283, simple_loss=0.3728, pruned_loss=0.1419, over 4265382.31 frames. ], batch size: 471, lr: 2.72e-02, grad_scale: 32.0 2023-06-18 06:18:02,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=114900.0, ans=0.125 2023-06-18 06:18:06,865 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=114900.0, ans=10.0 2023-06-18 06:18:07,411 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.54 vs. limit=6.0 2023-06-18 06:18:49,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=115020.0, ans=0.0 2023-06-18 06:19:43,368 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=7.956e-03 2023-06-18 06:20:38,013 INFO [train.py:996] (3/4) Epoch 1, batch 19200, loss[loss=0.3291, simple_loss=0.4152, pruned_loss=0.1215, over 21719.00 frames. ], tot_loss[loss=0.3337, simple_loss=0.3828, pruned_loss=0.1423, over 4269346.38 frames. ], batch size: 298, lr: 2.71e-02, grad_scale: 32.0 2023-06-18 06:21:38,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=115320.0, ans=0.125 2023-06-18 06:21:49,702 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.840e+02 3.092e+02 3.837e+02 4.671e+02 8.670e+02, threshold=7.675e+02, percent-clipped=1.0 2023-06-18 06:22:44,220 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.98 vs. limit=22.5 2023-06-18 06:23:03,907 INFO [train.py:996] (3/4) Epoch 1, batch 19250, loss[loss=0.3108, simple_loss=0.3701, pruned_loss=0.1258, over 21716.00 frames. ], tot_loss[loss=0.3234, simple_loss=0.3796, pruned_loss=0.1336, over 4278024.76 frames. ], batch size: 441, lr: 2.71e-02, grad_scale: 32.0 2023-06-18 06:23:43,071 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=115500.0, ans=0.125 2023-06-18 06:24:28,969 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=115680.0, ans=0.05 2023-06-18 06:24:35,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=115680.0, ans=0.125 2023-06-18 06:25:36,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=115740.0, ans=0.125 2023-06-18 06:25:40,811 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=115800.0, ans=0.0 2023-06-18 06:25:41,843 INFO [train.py:996] (3/4) Epoch 1, batch 19300, loss[loss=0.2862, simple_loss=0.3442, pruned_loss=0.1141, over 21585.00 frames. ], tot_loss[loss=0.3232, simple_loss=0.378, pruned_loss=0.1342, over 4276692.09 frames. ], batch size: 195, lr: 2.71e-02, grad_scale: 32.0 2023-06-18 06:26:46,937 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=115860.0, ans=0.2 2023-06-18 06:27:02,462 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.47 vs. limit=22.5 2023-06-18 06:27:15,169 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.742e+02 2.985e+02 3.697e+02 4.596e+02 6.937e+02, threshold=7.395e+02, percent-clipped=0.0 2023-06-18 06:27:17,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=115980.0, ans=0.02 2023-06-18 06:28:28,634 INFO [train.py:996] (3/4) Epoch 1, batch 19350, loss[loss=0.4143, simple_loss=0.4847, pruned_loss=0.1719, over 19714.00 frames. ], tot_loss[loss=0.3141, simple_loss=0.371, pruned_loss=0.1286, over 4276006.24 frames. ], batch size: 703, lr: 2.71e-02, grad_scale: 32.0 2023-06-18 06:29:05,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=116160.0, ans=0.1 2023-06-18 06:29:39,011 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=116220.0, ans=0.0 2023-06-18 06:30:16,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=116280.0, ans=0.125 2023-06-18 06:30:59,114 INFO [train.py:996] (3/4) Epoch 1, batch 19400, loss[loss=0.2903, simple_loss=0.3394, pruned_loss=0.1206, over 21533.00 frames. ], tot_loss[loss=0.3101, simple_loss=0.3667, pruned_loss=0.1268, over 4280482.51 frames. ], batch size: 195, lr: 2.70e-02, grad_scale: 32.0 2023-06-18 06:32:21,369 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.044e+02 3.419e+02 3.848e+02 4.708e+02 7.710e+02, threshold=7.695e+02, percent-clipped=3.0 2023-06-18 06:32:36,282 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.41 vs. limit=15.0 2023-06-18 06:32:47,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=116580.0, ans=0.125 2023-06-18 06:33:32,546 INFO [train.py:996] (3/4) Epoch 1, batch 19450, loss[loss=0.3182, simple_loss=0.352, pruned_loss=0.1422, over 21860.00 frames. ], tot_loss[loss=0.3136, simple_loss=0.366, pruned_loss=0.1306, over 4290127.47 frames. ], batch size: 107, lr: 2.70e-02, grad_scale: 32.0 2023-06-18 06:34:08,813 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.10 vs. limit=15.0 2023-06-18 06:34:11,008 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=116760.0, ans=0.125 2023-06-18 06:35:49,138 INFO [train.py:996] (3/4) Epoch 1, batch 19500, loss[loss=0.2812, simple_loss=0.3211, pruned_loss=0.1206, over 21532.00 frames. ], tot_loss[loss=0.3131, simple_loss=0.362, pruned_loss=0.1321, over 4281951.03 frames. ], batch size: 230, lr: 2.70e-02, grad_scale: 32.0 2023-06-18 06:37:30,032 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.376e+02 3.181e+02 3.762e+02 4.814e+02 7.838e+02, threshold=7.523e+02, percent-clipped=1.0 2023-06-18 06:38:13,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=117240.0, ans=0.125 2023-06-18 06:38:20,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=117240.0, ans=0.035 2023-06-18 06:38:40,443 INFO [train.py:996] (3/4) Epoch 1, batch 19550, loss[loss=0.3208, simple_loss=0.3924, pruned_loss=0.1246, over 21534.00 frames. ], tot_loss[loss=0.3078, simple_loss=0.3568, pruned_loss=0.1294, over 4265553.17 frames. ], batch size: 471, lr: 2.69e-02, grad_scale: 64.0 2023-06-18 06:38:43,764 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=117300.0, ans=0.125 2023-06-18 06:38:46,498 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=117300.0, ans=0.1 2023-06-18 06:38:58,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=117360.0, ans=0.125 2023-06-18 06:39:28,080 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.65 vs. limit=15.0 2023-06-18 06:40:27,855 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=117480.0, ans=0.0 2023-06-18 06:40:46,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=117540.0, ans=0.125 2023-06-18 06:41:02,438 INFO [train.py:996] (3/4) Epoch 1, batch 19600, loss[loss=0.343, simple_loss=0.3713, pruned_loss=0.1573, over 21620.00 frames. ], tot_loss[loss=0.3134, simple_loss=0.3613, pruned_loss=0.1328, over 4272048.21 frames. ], batch size: 548, lr: 2.69e-02, grad_scale: 64.0 2023-06-18 06:41:21,907 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=117660.0, ans=0.125 2023-06-18 06:41:22,592 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.76 vs. limit=10.0 2023-06-18 06:41:40,234 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.35 vs. limit=10.0 2023-06-18 06:42:13,893 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.49 vs. limit=22.5 2023-06-18 06:42:19,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=117720.0, ans=0.025 2023-06-18 06:42:30,569 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.047e+02 3.158e+02 3.823e+02 5.330e+02 9.735e+02, threshold=7.645e+02, percent-clipped=7.0 2023-06-18 06:42:40,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=117780.0, ans=0.0 2023-06-18 06:43:45,968 INFO [train.py:996] (3/4) Epoch 1, batch 19650, loss[loss=0.3433, simple_loss=0.3969, pruned_loss=0.1448, over 21468.00 frames. ], tot_loss[loss=0.3248, simple_loss=0.3703, pruned_loss=0.1396, over 4276030.48 frames. ], batch size: 131, lr: 2.69e-02, grad_scale: 32.0 2023-06-18 06:44:02,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=117900.0, ans=0.125 2023-06-18 06:44:50,423 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=117960.0, ans=0.0 2023-06-18 06:46:18,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=118140.0, ans=0.1 2023-06-18 06:46:27,028 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=118140.0, ans=0.09899494936611666 2023-06-18 06:46:50,132 INFO [train.py:996] (3/4) Epoch 1, batch 19700, loss[loss=0.2976, simple_loss=0.3744, pruned_loss=0.1104, over 21708.00 frames. ], tot_loss[loss=0.3283, simple_loss=0.3751, pruned_loss=0.1407, over 4275274.70 frames. ], batch size: 351, lr: 2.68e-02, grad_scale: 32.0 2023-06-18 06:47:06,408 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.94 vs. limit=22.5 2023-06-18 06:47:24,232 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=118260.0, ans=0.2 2023-06-18 06:47:35,147 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=118260.0, ans=0.0 2023-06-18 06:47:38,319 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=118260.0, ans=0.125 2023-06-18 06:47:43,967 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=118320.0, ans=0.125 2023-06-18 06:48:30,005 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.207e+02 3.429e+02 4.231e+02 5.435e+02 1.062e+03, threshold=8.463e+02, percent-clipped=10.0 2023-06-18 06:49:32,816 INFO [train.py:996] (3/4) Epoch 1, batch 19750, loss[loss=0.3347, simple_loss=0.3974, pruned_loss=0.136, over 21615.00 frames. ], tot_loss[loss=0.3343, simple_loss=0.3851, pruned_loss=0.1417, over 4276144.62 frames. ], batch size: 263, lr: 2.68e-02, grad_scale: 32.0 2023-06-18 06:49:50,329 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=118500.0, ans=0.125 2023-06-18 06:50:15,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=118560.0, ans=0.0 2023-06-18 06:50:29,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=118620.0, ans=0.125 2023-06-18 06:51:35,822 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 06:51:58,203 INFO [train.py:996] (3/4) Epoch 1, batch 19800, loss[loss=0.3304, simple_loss=0.3759, pruned_loss=0.1425, over 21829.00 frames. ], tot_loss[loss=0.3336, simple_loss=0.3836, pruned_loss=0.1418, over 4280014.18 frames. ], batch size: 282, lr: 2.68e-02, grad_scale: 32.0 2023-06-18 06:51:59,183 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.78 vs. limit=15.0 2023-06-18 06:52:00,895 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.92 vs. limit=22.5 2023-06-18 06:52:23,929 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-06-18 06:52:26,858 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.80 vs. limit=22.5 2023-06-18 06:53:46,853 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.129e+02 3.372e+02 3.953e+02 4.969e+02 1.016e+03, threshold=7.905e+02, percent-clipped=3.0 2023-06-18 06:53:55,275 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.68 vs. limit=10.0 2023-06-18 06:54:02,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=119040.0, ans=0.1 2023-06-18 06:54:39,912 INFO [train.py:996] (3/4) Epoch 1, batch 19850, loss[loss=0.2938, simple_loss=0.3664, pruned_loss=0.1106, over 21876.00 frames. ], tot_loss[loss=0.3199, simple_loss=0.3733, pruned_loss=0.1332, over 4283043.44 frames. ], batch size: 317, lr: 2.68e-02, grad_scale: 32.0 2023-06-18 06:54:49,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=119100.0, ans=0.0 2023-06-18 06:55:18,219 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-18 06:55:31,009 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=119160.0, ans=0.025 2023-06-18 06:55:32,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=119160.0, ans=0.025 2023-06-18 06:55:50,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=119220.0, ans=0.125 2023-06-18 06:56:04,007 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.69 vs. limit=15.0 2023-06-18 06:57:16,284 INFO [train.py:996] (3/4) Epoch 1, batch 19900, loss[loss=0.2175, simple_loss=0.2888, pruned_loss=0.07312, over 15775.00 frames. ], tot_loss[loss=0.3155, simple_loss=0.3717, pruned_loss=0.1297, over 4282947.92 frames. ], batch size: 60, lr: 2.67e-02, grad_scale: 32.0 2023-06-18 06:57:16,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=119400.0, ans=0.1 2023-06-18 06:58:14,642 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=119460.0, ans=0.125 2023-06-18 06:58:37,167 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.79 vs. limit=15.0 2023-06-18 06:58:40,853 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.938e+02 3.179e+02 3.927e+02 4.733e+02 6.841e+02, threshold=7.854e+02, percent-clipped=0.0 2023-06-18 06:58:44,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=119580.0, ans=0.125 2023-06-18 06:59:35,268 INFO [train.py:996] (3/4) Epoch 1, batch 19950, loss[loss=0.2911, simple_loss=0.3438, pruned_loss=0.1192, over 21758.00 frames. ], tot_loss[loss=0.3114, simple_loss=0.364, pruned_loss=0.1293, over 4272623.95 frames. ], batch size: 351, lr: 2.67e-02, grad_scale: 32.0 2023-06-18 07:00:42,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=119820.0, ans=0.125 2023-06-18 07:01:04,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=119820.0, ans=0.1 2023-06-18 07:01:13,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=119880.0, ans=0.125 2023-06-18 07:01:34,499 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.05 vs. limit=15.0 2023-06-18 07:01:35,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=119880.0, ans=0.0 2023-06-18 07:02:17,141 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-18 07:02:22,520 INFO [train.py:996] (3/4) Epoch 1, batch 20000, loss[loss=0.3435, simple_loss=0.3876, pruned_loss=0.1497, over 21875.00 frames. ], tot_loss[loss=0.313, simple_loss=0.3651, pruned_loss=0.1305, over 4281662.96 frames. ], batch size: 118, lr: 2.67e-02, grad_scale: 32.0 2023-06-18 07:03:08,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=120060.0, ans=0.09899494936611666 2023-06-18 07:03:31,930 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=120120.0, ans=0.125 2023-06-18 07:03:43,964 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.407e+02 3.393e+02 3.852e+02 4.728e+02 8.512e+02, threshold=7.705e+02, percent-clipped=1.0 2023-06-18 07:04:58,211 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=120300.0, ans=0.1 2023-06-18 07:04:59,169 INFO [train.py:996] (3/4) Epoch 1, batch 20050, loss[loss=0.3453, simple_loss=0.38, pruned_loss=0.1553, over 21775.00 frames. ], tot_loss[loss=0.3184, simple_loss=0.3684, pruned_loss=0.1342, over 4285626.66 frames. ], batch size: 441, lr: 2.66e-02, grad_scale: 32.0 2023-06-18 07:05:38,921 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=120360.0, ans=0.125 2023-06-18 07:05:40,343 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=120360.0, ans=0.125 2023-06-18 07:05:44,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=120360.0, ans=0.2 2023-06-18 07:06:31,978 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.13 vs. limit=22.5 2023-06-18 07:06:34,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=120480.0, ans=0.125 2023-06-18 07:06:55,953 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=120480.0, ans=0.1 2023-06-18 07:07:00,054 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.38 vs. limit=10.0 2023-06-18 07:07:28,465 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=120540.0, ans=0.125 2023-06-18 07:07:39,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=120540.0, ans=0.2 2023-06-18 07:07:43,289 INFO [train.py:996] (3/4) Epoch 1, batch 20100, loss[loss=0.3052, simple_loss=0.3589, pruned_loss=0.1257, over 21373.00 frames. ], tot_loss[loss=0.3236, simple_loss=0.3715, pruned_loss=0.1378, over 4290742.25 frames. ], batch size: 159, lr: 2.66e-02, grad_scale: 32.0 2023-06-18 07:08:01,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=120600.0, ans=0.0 2023-06-18 07:08:21,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=120660.0, ans=0.5 2023-06-18 07:09:37,475 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.454e+02 3.314e+02 3.983e+02 4.752e+02 1.053e+03, threshold=7.965e+02, percent-clipped=3.0 2023-06-18 07:09:47,075 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=120780.0, ans=0.0 2023-06-18 07:10:18,621 INFO [train.py:996] (3/4) Epoch 1, batch 20150, loss[loss=0.3328, simple_loss=0.3818, pruned_loss=0.1419, over 21548.00 frames. ], tot_loss[loss=0.3353, simple_loss=0.3841, pruned_loss=0.1433, over 4293081.27 frames. ], batch size: 230, lr: 2.66e-02, grad_scale: 32.0 2023-06-18 07:11:47,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=121020.0, ans=0.0 2023-06-18 07:11:54,967 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=121020.0, ans=0.2 2023-06-18 07:12:10,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=121080.0, ans=0.2 2023-06-18 07:13:15,482 INFO [train.py:996] (3/4) Epoch 1, batch 20200, loss[loss=0.3195, simple_loss=0.3912, pruned_loss=0.1239, over 21391.00 frames. ], tot_loss[loss=0.3404, simple_loss=0.3896, pruned_loss=0.1456, over 4290523.73 frames. ], batch size: 194, lr: 2.65e-02, grad_scale: 32.0 2023-06-18 07:14:22,053 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=121260.0, ans=0.1 2023-06-18 07:14:34,286 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=121320.0, ans=0.0 2023-06-18 07:15:08,167 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.956e+02 3.389e+02 4.045e+02 4.963e+02 8.967e+02, threshold=8.091e+02, percent-clipped=1.0 2023-06-18 07:15:18,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=121380.0, ans=0.125 2023-06-18 07:15:25,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=121380.0, ans=0.125 2023-06-18 07:16:06,300 INFO [train.py:996] (3/4) Epoch 1, batch 20250, loss[loss=0.4031, simple_loss=0.4224, pruned_loss=0.1919, over 21604.00 frames. ], tot_loss[loss=0.3385, simple_loss=0.3898, pruned_loss=0.1436, over 4293165.00 frames. ], batch size: 507, lr: 2.65e-02, grad_scale: 32.0 2023-06-18 07:17:11,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=121560.0, ans=0.125 2023-06-18 07:18:26,011 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=121740.0, ans=0.0 2023-06-18 07:18:30,401 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=121740.0, ans=0.125 2023-06-18 07:18:32,717 INFO [train.py:996] (3/4) Epoch 1, batch 20300, loss[loss=0.2529, simple_loss=0.3121, pruned_loss=0.09681, over 21878.00 frames. ], tot_loss[loss=0.3303, simple_loss=0.3846, pruned_loss=0.138, over 4282951.65 frames. ], batch size: 98, lr: 2.65e-02, grad_scale: 32.0 2023-06-18 07:18:37,931 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-18 07:19:29,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=121860.0, ans=0.125 2023-06-18 07:19:58,576 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.147e+02 2.892e+02 3.299e+02 4.163e+02 6.538e+02, threshold=6.599e+02, percent-clipped=0.0 2023-06-18 07:20:08,920 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.83 vs. limit=10.0 2023-06-18 07:20:24,946 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=122040.0, ans=0.1 2023-06-18 07:20:52,731 INFO [train.py:996] (3/4) Epoch 1, batch 20350, loss[loss=0.3656, simple_loss=0.4023, pruned_loss=0.1644, over 21654.00 frames. ], tot_loss[loss=0.331, simple_loss=0.3845, pruned_loss=0.1388, over 4280709.79 frames. ], batch size: 389, lr: 2.65e-02, grad_scale: 32.0 2023-06-18 07:21:03,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=122100.0, ans=0.0 2023-06-18 07:21:15,780 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=12.0 2023-06-18 07:21:38,436 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=7.708e-03 2023-06-18 07:22:34,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=122280.0, ans=0.09899494936611666 2023-06-18 07:22:51,211 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=122340.0, ans=0.125 2023-06-18 07:22:57,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=122340.0, ans=0.1 2023-06-18 07:23:20,713 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=122340.0, ans=0.125 2023-06-18 07:23:23,936 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.88 vs. limit=15.0 2023-06-18 07:23:24,536 INFO [train.py:996] (3/4) Epoch 1, batch 20400, loss[loss=0.3714, simple_loss=0.4165, pruned_loss=0.1632, over 21678.00 frames. ], tot_loss[loss=0.3358, simple_loss=0.3876, pruned_loss=0.142, over 4274105.96 frames. ], batch size: 389, lr: 2.64e-02, grad_scale: 32.0 2023-06-18 07:23:27,576 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=122400.0, ans=0.0 2023-06-18 07:23:29,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=122400.0, ans=0.2 2023-06-18 07:24:01,330 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.58 vs. limit=22.5 2023-06-18 07:24:03,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=122460.0, ans=0.0 2023-06-18 07:24:20,913 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-18 07:24:28,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=122520.0, ans=0.0 2023-06-18 07:24:36,251 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.299e+02 3.452e+02 4.202e+02 5.218e+02 1.418e+03, threshold=8.403e+02, percent-clipped=8.0 2023-06-18 07:25:38,635 INFO [train.py:996] (3/4) Epoch 1, batch 20450, loss[loss=0.3359, simple_loss=0.3796, pruned_loss=0.1461, over 21696.00 frames. ], tot_loss[loss=0.341, simple_loss=0.3897, pruned_loss=0.1462, over 4267039.18 frames. ], batch size: 112, lr: 2.64e-02, grad_scale: 32.0 2023-06-18 07:26:13,984 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.89 vs. limit=15.0 2023-06-18 07:26:14,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=122760.0, ans=0.125 2023-06-18 07:26:46,685 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=122760.0, ans=0.02 2023-06-18 07:26:53,758 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=122820.0, ans=0.1 2023-06-18 07:28:09,581 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=122940.0, ans=0.1 2023-06-18 07:28:18,145 INFO [train.py:996] (3/4) Epoch 1, batch 20500, loss[loss=0.3048, simple_loss=0.3398, pruned_loss=0.1349, over 21357.00 frames. ], tot_loss[loss=0.3404, simple_loss=0.3863, pruned_loss=0.1472, over 4267849.47 frames. ], batch size: 176, lr: 2.64e-02, grad_scale: 32.0 2023-06-18 07:29:06,217 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=18.17 vs. limit=22.5 2023-06-18 07:29:08,477 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=123120.0, ans=0.0 2023-06-18 07:29:54,634 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.106e+02 3.481e+02 4.279e+02 5.718e+02 8.765e+02, threshold=8.557e+02, percent-clipped=1.0 2023-06-18 07:30:18,125 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=123240.0, ans=0.125 2023-06-18 07:30:35,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=123240.0, ans=0.1 2023-06-18 07:30:46,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=123300.0, ans=0.125 2023-06-18 07:30:47,879 INFO [train.py:996] (3/4) Epoch 1, batch 20550, loss[loss=0.3809, simple_loss=0.4314, pruned_loss=0.1652, over 21475.00 frames. ], tot_loss[loss=0.332, simple_loss=0.3773, pruned_loss=0.1433, over 4254174.59 frames. ], batch size: 473, lr: 2.63e-02, grad_scale: 16.0 2023-06-18 07:31:50,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=123420.0, ans=0.035 2023-06-18 07:33:01,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=123540.0, ans=0.125 2023-06-18 07:33:33,246 INFO [train.py:996] (3/4) Epoch 1, batch 20600, loss[loss=0.4344, simple_loss=0.4429, pruned_loss=0.213, over 21619.00 frames. ], tot_loss[loss=0.3283, simple_loss=0.378, pruned_loss=0.1393, over 4255618.96 frames. ], batch size: 507, lr: 2.63e-02, grad_scale: 16.0 2023-06-18 07:33:33,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=123600.0, ans=0.125 2023-06-18 07:34:14,839 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=123660.0, ans=0.2 2023-06-18 07:34:48,011 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=123720.0, ans=0.0 2023-06-18 07:35:13,352 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.138e+02 2.900e+02 3.607e+02 4.743e+02 9.151e+02, threshold=7.215e+02, percent-clipped=2.0 2023-06-18 07:35:23,790 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=123780.0, ans=0.0 2023-06-18 07:35:25,711 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.33 vs. limit=22.5 2023-06-18 07:36:08,483 INFO [train.py:996] (3/4) Epoch 1, batch 20650, loss[loss=0.2726, simple_loss=0.3107, pruned_loss=0.1173, over 21574.00 frames. ], tot_loss[loss=0.3257, simple_loss=0.373, pruned_loss=0.1392, over 4266561.46 frames. ], batch size: 263, lr: 2.63e-02, grad_scale: 16.0 2023-06-18 07:36:14,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=123900.0, ans=0.0 2023-06-18 07:36:34,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=123900.0, ans=0.0 2023-06-18 07:36:35,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=123900.0, ans=0.125 2023-06-18 07:37:36,570 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=124020.0, ans=0.125 2023-06-18 07:38:28,629 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=124140.0, ans=0.04949747468305833 2023-06-18 07:38:30,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=124140.0, ans=0.2 2023-06-18 07:38:40,440 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=124140.0, ans=0.125 2023-06-18 07:38:46,622 INFO [train.py:996] (3/4) Epoch 1, batch 20700, loss[loss=0.3458, simple_loss=0.4051, pruned_loss=0.1433, over 19994.00 frames. ], tot_loss[loss=0.3173, simple_loss=0.365, pruned_loss=0.1348, over 4256565.18 frames. ], batch size: 703, lr: 2.63e-02, grad_scale: 16.0 2023-06-18 07:38:49,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=124200.0, ans=0.05 2023-06-18 07:39:34,294 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=124260.0, ans=0.125 2023-06-18 07:39:37,450 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.22 vs. limit=15.0 2023-06-18 07:39:44,403 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.35 vs. limit=15.0 2023-06-18 07:39:54,084 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=124320.0, ans=0.125 2023-06-18 07:40:14,134 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.75 vs. limit=6.0 2023-06-18 07:40:16,087 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.154e+02 3.474e+02 3.916e+02 5.055e+02 8.009e+02, threshold=7.832e+02, percent-clipped=2.0 2023-06-18 07:41:27,293 INFO [train.py:996] (3/4) Epoch 1, batch 20750, loss[loss=0.3406, simple_loss=0.4083, pruned_loss=0.1365, over 21773.00 frames. ], tot_loss[loss=0.3155, simple_loss=0.3664, pruned_loss=0.1324, over 4263351.46 frames. ], batch size: 332, lr: 2.62e-02, grad_scale: 16.0 2023-06-18 07:42:39,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=124620.0, ans=0.125 2023-06-18 07:43:11,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=124680.0, ans=0.125 2023-06-18 07:44:11,472 INFO [train.py:996] (3/4) Epoch 1, batch 20800, loss[loss=0.2847, simple_loss=0.3257, pruned_loss=0.1219, over 21974.00 frames. ], tot_loss[loss=0.3189, simple_loss=0.3692, pruned_loss=0.1343, over 4253525.86 frames. ], batch size: 103, lr: 2.62e-02, grad_scale: 32.0 2023-06-18 07:44:11,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=124800.0, ans=0.1 2023-06-18 07:44:41,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=124860.0, ans=0.1 2023-06-18 07:44:41,904 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=124860.0, ans=0.1 2023-06-18 07:45:16,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=124920.0, ans=0.07 2023-06-18 07:45:34,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=124980.0, ans=0.125 2023-06-18 07:45:37,095 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.454e+02 3.418e+02 4.058e+02 5.141e+02 9.579e+02, threshold=8.117e+02, percent-clipped=5.0 2023-06-18 07:46:17,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=125040.0, ans=0.2 2023-06-18 07:46:35,497 INFO [train.py:996] (3/4) Epoch 1, batch 20850, loss[loss=0.3326, simple_loss=0.365, pruned_loss=0.1501, over 21531.00 frames. ], tot_loss[loss=0.3111, simple_loss=0.3602, pruned_loss=0.131, over 4255830.10 frames. ], batch size: 471, lr: 2.62e-02, grad_scale: 32.0 2023-06-18 07:46:54,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=125160.0, ans=0.0 2023-06-18 07:46:54,810 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.66 vs. limit=12.0 2023-06-18 07:47:40,606 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=125220.0, ans=0.125 2023-06-18 07:48:07,308 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.98 vs. limit=15.0 2023-06-18 07:48:56,239 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-06-18 07:49:08,586 INFO [train.py:996] (3/4) Epoch 1, batch 20900, loss[loss=0.3129, simple_loss=0.3699, pruned_loss=0.128, over 21785.00 frames. ], tot_loss[loss=0.3141, simple_loss=0.3616, pruned_loss=0.1333, over 4272450.28 frames. ], batch size: 332, lr: 2.62e-02, grad_scale: 32.0 2023-06-18 07:49:49,417 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.96 vs. limit=15.0 2023-06-18 07:50:24,784 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.999e+02 2.865e+02 3.704e+02 4.826e+02 8.208e+02, threshold=7.408e+02, percent-clipped=1.0 2023-06-18 07:51:11,913 INFO [train.py:996] (3/4) Epoch 1, batch 20950, loss[loss=0.3772, simple_loss=0.3929, pruned_loss=0.1807, over 21594.00 frames. ], tot_loss[loss=0.3062, simple_loss=0.3566, pruned_loss=0.1279, over 4275807.26 frames. ], batch size: 508, lr: 2.61e-02, grad_scale: 32.0 2023-06-18 07:52:38,404 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.81 vs. limit=22.5 2023-06-18 07:52:44,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=125880.0, ans=0.2 2023-06-18 07:53:03,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=125940.0, ans=0.125 2023-06-18 07:53:19,311 INFO [train.py:996] (3/4) Epoch 1, batch 21000, loss[loss=0.3361, simple_loss=0.3783, pruned_loss=0.1469, over 21873.00 frames. ], tot_loss[loss=0.3066, simple_loss=0.3563, pruned_loss=0.1285, over 4274637.76 frames. ], batch size: 414, lr: 2.61e-02, grad_scale: 32.0 2023-06-18 07:53:19,312 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-18 07:54:10,682 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.2.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.6280, 3.4932, 3.1162, 3.2903], device='cuda:3') 2023-06-18 07:54:13,245 INFO [train.py:1028] (3/4) Epoch 1, validation: loss=0.3148, simple_loss=0.4061, pruned_loss=0.1118, over 1796401.00 frames. 2023-06-18 07:54:13,253 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-18 07:54:40,683 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 07:55:11,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=126120.0, ans=0.0 2023-06-18 07:55:15,210 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.017e+02 2.966e+02 3.592e+02 4.576e+02 7.047e+02, threshold=7.185e+02, percent-clipped=0.0 2023-06-18 07:55:32,185 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=126180.0, ans=0.2 2023-06-18 07:56:21,893 INFO [train.py:996] (3/4) Epoch 1, batch 21050, loss[loss=0.3267, simple_loss=0.3682, pruned_loss=0.1426, over 21535.00 frames. ], tot_loss[loss=0.3063, simple_loss=0.3543, pruned_loss=0.1292, over 4269267.54 frames. ], batch size: 414, lr: 2.61e-02, grad_scale: 32.0 2023-06-18 07:56:23,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=126300.0, ans=0.0 2023-06-18 07:58:17,175 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.90 vs. limit=15.0 2023-06-18 07:58:25,920 INFO [train.py:996] (3/4) Epoch 1, batch 21100, loss[loss=0.2546, simple_loss=0.3072, pruned_loss=0.101, over 21865.00 frames. ], tot_loss[loss=0.3034, simple_loss=0.35, pruned_loss=0.1284, over 4266745.47 frames. ], batch size: 107, lr: 2.60e-02, grad_scale: 32.0 2023-06-18 07:59:02,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=126660.0, ans=0.125 2023-06-18 07:59:13,334 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.13 vs. limit=22.5 2023-06-18 07:59:31,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=126720.0, ans=0.5 2023-06-18 07:59:45,401 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=126720.0, ans=0.125 2023-06-18 07:59:50,577 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.024e+02 3.220e+02 3.817e+02 4.701e+02 7.563e+02, threshold=7.635e+02, percent-clipped=1.0 2023-06-18 08:00:50,249 INFO [train.py:996] (3/4) Epoch 1, batch 21150, loss[loss=0.2895, simple_loss=0.3398, pruned_loss=0.1195, over 15647.00 frames. ], tot_loss[loss=0.3007, simple_loss=0.3453, pruned_loss=0.128, over 4266023.80 frames. ], batch size: 60, lr: 2.60e-02, grad_scale: 32.0 2023-06-18 08:01:36,084 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=126960.0, ans=0.1 2023-06-18 08:02:08,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=127020.0, ans=0.1 2023-06-18 08:02:09,697 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=127080.0, ans=0.125 2023-06-18 08:02:25,845 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.93 vs. limit=15.0 2023-06-18 08:03:11,672 INFO [train.py:996] (3/4) Epoch 1, batch 21200, loss[loss=0.2863, simple_loss=0.3332, pruned_loss=0.1197, over 21586.00 frames. ], tot_loss[loss=0.2961, simple_loss=0.3407, pruned_loss=0.1257, over 4267649.15 frames. ], batch size: 414, lr: 2.60e-02, grad_scale: 32.0 2023-06-18 08:04:02,654 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=127260.0, ans=0.125 2023-06-18 08:04:35,297 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.202e+02 3.266e+02 3.911e+02 4.430e+02 6.717e+02, threshold=7.823e+02, percent-clipped=0.0 2023-06-18 08:05:05,504 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=127380.0, ans=0.125 2023-06-18 08:05:57,565 INFO [train.py:996] (3/4) Epoch 1, batch 21250, loss[loss=0.2991, simple_loss=0.3477, pruned_loss=0.1253, over 21339.00 frames. ], tot_loss[loss=0.2953, simple_loss=0.3396, pruned_loss=0.1255, over 4268941.59 frames. ], batch size: 159, lr: 2.60e-02, grad_scale: 32.0 2023-06-18 08:06:01,045 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=127500.0, ans=0.5 2023-06-18 08:06:08,807 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.98 vs. limit=22.5 2023-06-18 08:06:58,032 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.70 vs. limit=22.5 2023-06-18 08:07:16,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=127620.0, ans=0.2 2023-06-18 08:07:40,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=127680.0, ans=0.1 2023-06-18 08:07:58,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=127740.0, ans=0.125 2023-06-18 08:08:27,254 INFO [train.py:996] (3/4) Epoch 1, batch 21300, loss[loss=0.3137, simple_loss=0.3432, pruned_loss=0.1421, over 20042.00 frames. ], tot_loss[loss=0.3023, simple_loss=0.3466, pruned_loss=0.129, over 4270441.54 frames. ], batch size: 702, lr: 2.59e-02, grad_scale: 32.0 2023-06-18 08:09:01,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=127860.0, ans=0.2 2023-06-18 08:09:12,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=127860.0, ans=0.0 2023-06-18 08:09:16,472 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=127860.0, ans=0.125 2023-06-18 08:09:22,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=127920.0, ans=0.125 2023-06-18 08:09:55,004 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.404e+02 3.217e+02 3.809e+02 4.662e+02 7.490e+02, threshold=7.618e+02, percent-clipped=0.0 2023-06-18 08:10:47,965 INFO [train.py:996] (3/4) Epoch 1, batch 21350, loss[loss=0.2408, simple_loss=0.3161, pruned_loss=0.0827, over 21442.00 frames. ], tot_loss[loss=0.3068, simple_loss=0.3528, pruned_loss=0.1304, over 4280813.77 frames. ], batch size: 195, lr: 2.59e-02, grad_scale: 32.0 2023-06-18 08:10:55,565 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.54 vs. limit=15.0 2023-06-18 08:11:33,550 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.78 vs. limit=6.0 2023-06-18 08:12:43,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=128280.0, ans=0.015 2023-06-18 08:12:51,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=128280.0, ans=0.0 2023-06-18 08:13:25,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=128340.0, ans=0.1 2023-06-18 08:13:32,905 INFO [train.py:996] (3/4) Epoch 1, batch 21400, loss[loss=0.3682, simple_loss=0.4084, pruned_loss=0.164, over 21703.00 frames. ], tot_loss[loss=0.311, simple_loss=0.359, pruned_loss=0.1315, over 4283032.95 frames. ], batch size: 351, lr: 2.59e-02, grad_scale: 32.0 2023-06-18 08:14:07,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=128460.0, ans=0.0 2023-06-18 08:14:09,165 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=128460.0, ans=0.125 2023-06-18 08:14:09,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=128460.0, ans=0.125 2023-06-18 08:15:26,239 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.954e+02 2.956e+02 3.569e+02 4.352e+02 6.315e+02, threshold=7.139e+02, percent-clipped=0.0 2023-06-18 08:16:33,236 INFO [train.py:996] (3/4) Epoch 1, batch 21450, loss[loss=0.3428, simple_loss=0.38, pruned_loss=0.1529, over 21894.00 frames. ], tot_loss[loss=0.3141, simple_loss=0.3621, pruned_loss=0.133, over 4282303.89 frames. ], batch size: 414, lr: 2.59e-02, grad_scale: 32.0 2023-06-18 08:16:47,880 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.50 vs. limit=10.0 2023-06-18 08:17:01,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=128760.0, ans=0.125 2023-06-18 08:17:20,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=128820.0, ans=0.125 2023-06-18 08:17:55,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=128880.0, ans=0.0 2023-06-18 08:18:59,346 INFO [train.py:996] (3/4) Epoch 1, batch 21500, loss[loss=0.29, simple_loss=0.3301, pruned_loss=0.1249, over 21665.00 frames. ], tot_loss[loss=0.3142, simple_loss=0.3604, pruned_loss=0.134, over 4272675.08 frames. ], batch size: 333, lr: 2.58e-02, grad_scale: 32.0 2023-06-18 08:18:59,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=129000.0, ans=0.0 2023-06-18 08:19:05,615 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=129000.0, ans=0.0 2023-06-18 08:19:49,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=129060.0, ans=0.0 2023-06-18 08:19:58,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=129120.0, ans=0.125 2023-06-18 08:20:30,703 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.387e+02 3.519e+02 4.524e+02 5.451e+02 1.077e+03, threshold=9.047e+02, percent-clipped=9.0 2023-06-18 08:20:35,661 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=129180.0, ans=0.125 2023-06-18 08:21:28,501 INFO [train.py:996] (3/4) Epoch 1, batch 21550, loss[loss=0.2482, simple_loss=0.2938, pruned_loss=0.1013, over 21273.00 frames. ], tot_loss[loss=0.3062, simple_loss=0.3519, pruned_loss=0.1302, over 4272469.71 frames. ], batch size: 159, lr: 2.58e-02, grad_scale: 32.0 2023-06-18 08:22:36,594 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.50 vs. limit=15.0 2023-06-18 08:22:53,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff3.min_abs, batch_count=129480.0, ans=0.2 2023-06-18 08:23:27,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=129540.0, ans=0.1 2023-06-18 08:24:15,448 INFO [train.py:996] (3/4) Epoch 1, batch 21600, loss[loss=0.3146, simple_loss=0.3669, pruned_loss=0.1311, over 21828.00 frames. ], tot_loss[loss=0.3011, simple_loss=0.3466, pruned_loss=0.1279, over 4273160.72 frames. ], batch size: 372, lr: 2.58e-02, grad_scale: 32.0 2023-06-18 08:24:42,543 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=129660.0, ans=0.04949747468305833 2023-06-18 08:24:45,490 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=129660.0, ans=0.125 2023-06-18 08:25:40,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=129780.0, ans=0.125 2023-06-18 08:25:43,520 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.188e+02 3.179e+02 3.851e+02 4.633e+02 7.831e+02, threshold=7.701e+02, percent-clipped=0.0 2023-06-18 08:26:12,452 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=129840.0, ans=0.125 2023-06-18 08:26:12,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=129840.0, ans=0.1 2023-06-18 08:26:40,565 INFO [train.py:996] (3/4) Epoch 1, batch 21650, loss[loss=0.2597, simple_loss=0.3303, pruned_loss=0.09452, over 21741.00 frames. ], tot_loss[loss=0.3024, simple_loss=0.3525, pruned_loss=0.1261, over 4269895.30 frames. ], batch size: 124, lr: 2.57e-02, grad_scale: 32.0 2023-06-18 08:26:41,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=129900.0, ans=0.1 2023-06-18 08:26:54,603 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=129960.0, ans=0.2 2023-06-18 08:28:49,752 INFO [train.py:996] (3/4) Epoch 1, batch 21700, loss[loss=0.2706, simple_loss=0.3269, pruned_loss=0.1072, over 21285.00 frames. ], tot_loss[loss=0.2962, simple_loss=0.3495, pruned_loss=0.1214, over 4261640.18 frames. ], batch size: 176, lr: 2.57e-02, grad_scale: 32.0 2023-06-18 08:29:01,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=130200.0, ans=0.0 2023-06-18 08:29:24,221 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 08:30:12,478 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.003e+02 2.876e+02 3.492e+02 4.292e+02 6.287e+02, threshold=6.984e+02, percent-clipped=0.0 2023-06-18 08:31:03,497 INFO [train.py:996] (3/4) Epoch 1, batch 21750, loss[loss=0.2802, simple_loss=0.3318, pruned_loss=0.1143, over 21794.00 frames. ], tot_loss[loss=0.2961, simple_loss=0.3463, pruned_loss=0.123, over 4258259.82 frames. ], batch size: 107, lr: 2.57e-02, grad_scale: 32.0 2023-06-18 08:31:21,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=130560.0, ans=0.125 2023-06-18 08:31:24,526 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.07 vs. limit=15.0 2023-06-18 08:31:58,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=130560.0, ans=0.1 2023-06-18 08:32:24,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=130680.0, ans=0.125 2023-06-18 08:33:24,961 INFO [train.py:996] (3/4) Epoch 1, batch 21800, loss[loss=0.4054, simple_loss=0.433, pruned_loss=0.1889, over 21521.00 frames. ], tot_loss[loss=0.2979, simple_loss=0.3454, pruned_loss=0.1252, over 4256473.02 frames. ], batch size: 509, lr: 2.57e-02, grad_scale: 32.0 2023-06-18 08:33:49,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=130800.0, ans=0.2 2023-06-18 08:33:54,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=130860.0, ans=0.0 2023-06-18 08:35:17,747 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.184e+02 3.102e+02 3.611e+02 4.260e+02 6.998e+02, threshold=7.223e+02, percent-clipped=1.0 2023-06-18 08:35:21,572 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.64 vs. limit=15.0 2023-06-18 08:35:33,166 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=131040.0, ans=0.125 2023-06-18 08:36:06,420 INFO [train.py:996] (3/4) Epoch 1, batch 21850, loss[loss=0.3526, simple_loss=0.3937, pruned_loss=0.1558, over 21852.00 frames. ], tot_loss[loss=0.3019, simple_loss=0.3518, pruned_loss=0.126, over 4255890.30 frames. ], batch size: 414, lr: 2.56e-02, grad_scale: 32.0 2023-06-18 08:36:21,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=131100.0, ans=0.0 2023-06-18 08:37:34,179 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=131220.0, ans=0.0 2023-06-18 08:37:39,695 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.71 vs. limit=15.0 2023-06-18 08:38:49,650 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.24 vs. limit=15.0 2023-06-18 08:38:54,592 INFO [train.py:996] (3/4) Epoch 1, batch 21900, loss[loss=0.3133, simple_loss=0.3498, pruned_loss=0.1384, over 21818.00 frames. ], tot_loss[loss=0.3073, simple_loss=0.3569, pruned_loss=0.1289, over 4257268.45 frames. ], batch size: 316, lr: 2.56e-02, grad_scale: 32.0 2023-06-18 08:40:14,107 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.326e+02 3.557e+02 4.079e+02 5.086e+02 8.901e+02, threshold=8.158e+02, percent-clipped=3.0 2023-06-18 08:40:58,976 INFO [train.py:996] (3/4) Epoch 1, batch 21950, loss[loss=0.231, simple_loss=0.288, pruned_loss=0.08697, over 21723.00 frames. ], tot_loss[loss=0.3013, simple_loss=0.3499, pruned_loss=0.1264, over 4257909.30 frames. ], batch size: 112, lr: 2.56e-02, grad_scale: 32.0 2023-06-18 08:41:01,549 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.53 vs. limit=8.0 2023-06-18 08:41:09,723 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 08:41:50,199 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=131760.0, ans=0.1 2023-06-18 08:42:29,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=131820.0, ans=0.2 2023-06-18 08:42:33,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=131880.0, ans=0.125 2023-06-18 08:42:34,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=131880.0, ans=0.0 2023-06-18 08:43:18,608 INFO [train.py:996] (3/4) Epoch 1, batch 22000, loss[loss=0.3325, simple_loss=0.3722, pruned_loss=0.1464, over 21826.00 frames. ], tot_loss[loss=0.2921, simple_loss=0.342, pruned_loss=0.1211, over 4253060.44 frames. ], batch size: 372, lr: 2.56e-02, grad_scale: 32.0 2023-06-18 08:43:57,700 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=132000.0, ans=15.0 2023-06-18 08:44:07,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=132060.0, ans=0.0 2023-06-18 08:44:22,452 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=132060.0, ans=0.0 2023-06-18 08:44:27,644 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.79 vs. limit=15.0 2023-06-18 08:44:46,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=132120.0, ans=0.04949747468305833 2023-06-18 08:44:55,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=132120.0, ans=0.0 2023-06-18 08:45:03,453 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.854e+02 3.001e+02 3.656e+02 5.158e+02 1.119e+03, threshold=7.313e+02, percent-clipped=4.0 2023-06-18 08:45:11,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=132180.0, ans=0.1 2023-06-18 08:45:26,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=132180.0, ans=0.125 2023-06-18 08:45:43,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=132240.0, ans=0.125 2023-06-18 08:46:09,168 INFO [train.py:996] (3/4) Epoch 1, batch 22050, loss[loss=0.3995, simple_loss=0.4343, pruned_loss=0.1823, over 21262.00 frames. ], tot_loss[loss=0.2956, simple_loss=0.3456, pruned_loss=0.1228, over 4256064.24 frames. ], batch size: 159, lr: 2.55e-02, grad_scale: 32.0 2023-06-18 08:47:00,576 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=132360.0, ans=0.125 2023-06-18 08:47:29,581 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=18.70 vs. limit=15.0 2023-06-18 08:47:32,711 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=132420.0, ans=0.0 2023-06-18 08:48:11,877 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=132480.0, ans=0.0 2023-06-18 08:48:11,879 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=132480.0, ans=0.125 2023-06-18 08:48:48,457 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.31 vs. limit=15.0 2023-06-18 08:48:48,914 INFO [train.py:996] (3/4) Epoch 1, batch 22100, loss[loss=0.3357, simple_loss=0.3857, pruned_loss=0.1429, over 21776.00 frames. ], tot_loss[loss=0.3121, simple_loss=0.3606, pruned_loss=0.1318, over 4259695.72 frames. ], batch size: 282, lr: 2.55e-02, grad_scale: 32.0 2023-06-18 08:49:06,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=132600.0, ans=0.0 2023-06-18 08:49:16,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=132660.0, ans=0.2 2023-06-18 08:49:21,257 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=132660.0, ans=0.2 2023-06-18 08:49:51,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=132660.0, ans=0.125 2023-06-18 08:49:59,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=132720.0, ans=0.1 2023-06-18 08:50:10,446 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=132720.0, ans=0.125 2023-06-18 08:50:17,074 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.511e+02 3.342e+02 4.040e+02 5.215e+02 7.946e+02, threshold=8.079e+02, percent-clipped=2.0 2023-06-18 08:50:27,677 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=132780.0, ans=0.07 2023-06-18 08:50:56,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=132840.0, ans=0.125 2023-06-18 08:51:25,517 INFO [train.py:996] (3/4) Epoch 1, batch 22150, loss[loss=0.302, simple_loss=0.3451, pruned_loss=0.1294, over 21914.00 frames. ], tot_loss[loss=0.3173, simple_loss=0.3656, pruned_loss=0.1345, over 4259513.96 frames. ], batch size: 107, lr: 2.55e-02, grad_scale: 32.0 2023-06-18 08:51:51,723 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=132960.0, ans=0.0 2023-06-18 08:52:56,488 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=15.0 2023-06-18 08:53:25,918 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 08:53:44,051 INFO [train.py:996] (3/4) Epoch 1, batch 22200, loss[loss=0.3655, simple_loss=0.411, pruned_loss=0.16, over 21781.00 frames. ], tot_loss[loss=0.3198, simple_loss=0.3673, pruned_loss=0.1361, over 4272137.40 frames. ], batch size: 441, lr: 2.55e-02, grad_scale: 32.0 2023-06-18 08:54:11,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=133200.0, ans=0.125 2023-06-18 08:54:58,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=133260.0, ans=0.1 2023-06-18 08:55:43,633 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.153e+02 3.063e+02 3.788e+02 4.693e+02 1.107e+03, threshold=7.575e+02, percent-clipped=1.0 2023-06-18 08:55:59,248 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.23 vs. limit=15.0 2023-06-18 08:56:04,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=133440.0, ans=0.0 2023-06-18 08:56:47,422 INFO [train.py:996] (3/4) Epoch 1, batch 22250, loss[loss=0.3292, simple_loss=0.3907, pruned_loss=0.1338, over 21413.00 frames. ], tot_loss[loss=0.3239, simple_loss=0.3739, pruned_loss=0.1369, over 4281764.11 frames. ], batch size: 211, lr: 2.54e-02, grad_scale: 32.0 2023-06-18 08:56:47,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=133500.0, ans=0.0 2023-06-18 08:58:29,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=133680.0, ans=0.125 2023-06-18 08:59:08,862 INFO [train.py:996] (3/4) Epoch 1, batch 22300, loss[loss=0.3162, simple_loss=0.3507, pruned_loss=0.1408, over 21327.00 frames. ], tot_loss[loss=0.3281, simple_loss=0.3765, pruned_loss=0.1398, over 4285616.76 frames. ], batch size: 176, lr: 2.54e-02, grad_scale: 32.0 2023-06-18 08:59:23,950 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=133800.0, ans=0.0 2023-06-18 09:00:05,431 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=133860.0, ans=0.2 2023-06-18 09:00:44,298 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=133920.0, ans=0.1 2023-06-18 09:00:53,011 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.587e+02 3.379e+02 4.077e+02 4.905e+02 1.025e+03, threshold=8.153e+02, percent-clipped=2.0 2023-06-18 09:01:32,671 INFO [train.py:996] (3/4) Epoch 1, batch 22350, loss[loss=0.2863, simple_loss=0.3466, pruned_loss=0.113, over 21847.00 frames. ], tot_loss[loss=0.3278, simple_loss=0.3742, pruned_loss=0.1407, over 4284828.13 frames. ], batch size: 351, lr: 2.54e-02, grad_scale: 32.0 2023-06-18 09:01:48,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=134100.0, ans=0.1 2023-06-18 09:03:18,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=134280.0, ans=0.125 2023-06-18 09:03:29,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=134340.0, ans=0.125 2023-06-18 09:03:30,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=134340.0, ans=0.1 2023-06-18 09:04:10,741 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.66 vs. limit=15.0 2023-06-18 09:04:12,523 INFO [train.py:996] (3/4) Epoch 1, batch 22400, loss[loss=0.2853, simple_loss=0.3302, pruned_loss=0.1202, over 21347.00 frames. ], tot_loss[loss=0.3206, simple_loss=0.369, pruned_loss=0.1361, over 4287290.48 frames. ], batch size: 211, lr: 2.54e-02, grad_scale: 32.0 2023-06-18 09:04:20,578 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=134400.0, ans=0.0 2023-06-18 09:04:51,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=134460.0, ans=0.5 2023-06-18 09:05:21,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=134520.0, ans=0.0 2023-06-18 09:05:35,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=134520.0, ans=0.125 2023-06-18 09:05:44,195 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.065e+02 3.212e+02 4.054e+02 4.635e+02 7.635e+02, threshold=8.108e+02, percent-clipped=0.0 2023-06-18 09:05:59,405 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.93 vs. limit=15.0 2023-06-18 09:06:14,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=134640.0, ans=0.2 2023-06-18 09:06:46,093 INFO [train.py:996] (3/4) Epoch 1, batch 22450, loss[loss=0.2876, simple_loss=0.3278, pruned_loss=0.1237, over 21227.00 frames. ], tot_loss[loss=0.3165, simple_loss=0.3629, pruned_loss=0.135, over 4283316.82 frames. ], batch size: 144, lr: 2.53e-02, grad_scale: 32.0 2023-06-18 09:08:21,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=134880.0, ans=0.125 2023-06-18 09:08:57,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=134940.0, ans=0.0 2023-06-18 09:09:37,349 INFO [train.py:996] (3/4) Epoch 1, batch 22500, loss[loss=0.3229, simple_loss=0.3464, pruned_loss=0.1496, over 21204.00 frames. ], tot_loss[loss=0.313, simple_loss=0.3578, pruned_loss=0.1341, over 4281281.05 frames. ], batch size: 471, lr: 2.53e-02, grad_scale: 32.0 2023-06-18 09:09:39,385 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=135000.0, ans=0.1 2023-06-18 09:10:38,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=135060.0, ans=0.125 2023-06-18 09:10:47,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=135120.0, ans=0.0 2023-06-18 09:11:15,144 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.196e+02 3.294e+02 4.262e+02 5.582e+02 1.060e+03, threshold=8.524e+02, percent-clipped=5.0 2023-06-18 09:12:15,199 INFO [train.py:996] (3/4) Epoch 1, batch 22550, loss[loss=0.3062, simple_loss=0.3486, pruned_loss=0.1319, over 21874.00 frames. ], tot_loss[loss=0.3161, simple_loss=0.3626, pruned_loss=0.1348, over 4283249.88 frames. ], batch size: 282, lr: 2.53e-02, grad_scale: 64.0 2023-06-18 09:12:38,060 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.61 vs. limit=15.0 2023-06-18 09:14:46,433 INFO [train.py:996] (3/4) Epoch 1, batch 22600, loss[loss=0.1986, simple_loss=0.2298, pruned_loss=0.08369, over 16348.00 frames. ], tot_loss[loss=0.3163, simple_loss=0.3646, pruned_loss=0.1341, over 4277237.00 frames. ], batch size: 62, lr: 2.53e-02, grad_scale: 32.0 2023-06-18 09:14:50,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=135600.0, ans=0.125 2023-06-18 09:15:16,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=135660.0, ans=0.125 2023-06-18 09:15:47,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=135720.0, ans=0.2 2023-06-18 09:16:01,927 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=135780.0, ans=0.1 2023-06-18 09:16:04,382 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.177e+02 3.237e+02 4.063e+02 5.145e+02 1.049e+03, threshold=8.126e+02, percent-clipped=2.0 2023-06-18 09:16:54,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=135840.0, ans=0.0 2023-06-18 09:17:09,567 INFO [train.py:996] (3/4) Epoch 1, batch 22650, loss[loss=0.3639, simple_loss=0.4275, pruned_loss=0.1501, over 21626.00 frames. ], tot_loss[loss=0.3136, simple_loss=0.3613, pruned_loss=0.1329, over 4277833.38 frames. ], batch size: 441, lr: 2.52e-02, grad_scale: 32.0 2023-06-18 09:17:20,961 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=135900.0, ans=0.1 2023-06-18 09:17:47,509 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=135960.0, ans=0.0 2023-06-18 09:17:50,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=135960.0, ans=0.125 2023-06-18 09:18:10,224 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=136020.0, ans=0.125 2023-06-18 09:19:33,394 INFO [train.py:996] (3/4) Epoch 1, batch 22700, loss[loss=0.3231, simple_loss=0.3569, pruned_loss=0.1446, over 21833.00 frames. ], tot_loss[loss=0.3105, simple_loss=0.3565, pruned_loss=0.1322, over 4273019.71 frames. ], batch size: 317, lr: 2.52e-02, grad_scale: 32.0 2023-06-18 09:19:33,852 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=136200.0, ans=0.0 2023-06-18 09:19:33,880 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=136200.0, ans=0.125 2023-06-18 09:20:17,523 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.73 vs. limit=15.0 2023-06-18 09:21:14,816 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.549e+02 3.452e+02 4.145e+02 4.705e+02 7.533e+02, threshold=8.290e+02, percent-clipped=0.0 2023-06-18 09:21:53,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=136440.0, ans=0.035 2023-06-18 09:22:06,622 INFO [train.py:996] (3/4) Epoch 1, batch 22750, loss[loss=0.3633, simple_loss=0.405, pruned_loss=0.1608, over 21738.00 frames. ], tot_loss[loss=0.3125, simple_loss=0.357, pruned_loss=0.134, over 4260525.65 frames. ], batch size: 124, lr: 2.52e-02, grad_scale: 32.0 2023-06-18 09:22:23,744 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff2.min_abs, batch_count=136500.0, ans=0.1 2023-06-18 09:22:37,282 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=136500.0, ans=0.125 2023-06-18 09:23:39,793 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=136620.0, ans=0.0 2023-06-18 09:24:25,341 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.79 vs. limit=15.0 2023-06-18 09:24:40,985 INFO [train.py:996] (3/4) Epoch 1, batch 22800, loss[loss=0.3521, simple_loss=0.3894, pruned_loss=0.1575, over 21495.00 frames. ], tot_loss[loss=0.3185, simple_loss=0.3628, pruned_loss=0.1371, over 4270778.84 frames. ], batch size: 230, lr: 2.52e-02, grad_scale: 32.0 2023-06-18 09:26:06,750 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.321e+02 3.270e+02 3.833e+02 4.673e+02 9.508e+02, threshold=7.666e+02, percent-clipped=3.0 2023-06-18 09:26:07,970 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.22 vs. limit=12.0 2023-06-18 09:26:09,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=136980.0, ans=0.125 2023-06-18 09:26:31,841 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.76 vs. limit=15.0 2023-06-18 09:27:12,212 INFO [train.py:996] (3/4) Epoch 1, batch 22850, loss[loss=0.2633, simple_loss=0.3132, pruned_loss=0.1067, over 21662.00 frames. ], tot_loss[loss=0.3183, simple_loss=0.3627, pruned_loss=0.1369, over 4260928.92 frames. ], batch size: 247, lr: 2.51e-02, grad_scale: 32.0 2023-06-18 09:28:17,781 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.11 vs. limit=12.0 2023-06-18 09:29:46,660 INFO [train.py:996] (3/4) Epoch 1, batch 22900, loss[loss=0.2453, simple_loss=0.2923, pruned_loss=0.09916, over 21727.00 frames. ], tot_loss[loss=0.3163, simple_loss=0.363, pruned_loss=0.1348, over 4266712.86 frames. ], batch size: 112, lr: 2.51e-02, grad_scale: 32.0 2023-06-18 09:30:15,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=137460.0, ans=0.0 2023-06-18 09:30:41,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=137460.0, ans=0.0 2023-06-18 09:31:30,902 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=137520.0, ans=0.125 2023-06-18 09:31:37,296 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.375e+02 3.304e+02 3.880e+02 4.906e+02 7.646e+02, threshold=7.759e+02, percent-clipped=0.0 2023-06-18 09:31:56,620 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=137580.0, ans=0.125 2023-06-18 09:32:27,090 INFO [train.py:996] (3/4) Epoch 1, batch 22950, loss[loss=0.2905, simple_loss=0.3868, pruned_loss=0.09703, over 21682.00 frames. ], tot_loss[loss=0.3197, simple_loss=0.3742, pruned_loss=0.1326, over 4265107.40 frames. ], batch size: 247, lr: 2.51e-02, grad_scale: 32.0 2023-06-18 09:32:39,253 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.16 vs. limit=6.0 2023-06-18 09:33:16,298 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=137760.0, ans=10.0 2023-06-18 09:33:36,536 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.62 vs. limit=15.0 2023-06-18 09:34:09,722 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=137880.0, ans=0.125 2023-06-18 09:34:58,118 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=137940.0, ans=0.125 2023-06-18 09:35:12,972 INFO [train.py:996] (3/4) Epoch 1, batch 23000, loss[loss=0.3069, simple_loss=0.3648, pruned_loss=0.1245, over 21516.00 frames. ], tot_loss[loss=0.3131, simple_loss=0.3698, pruned_loss=0.1282, over 4274675.51 frames. ], batch size: 131, lr: 2.51e-02, grad_scale: 32.0 2023-06-18 09:35:57,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=138060.0, ans=0.0 2023-06-18 09:36:53,632 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.982e+02 2.985e+02 3.505e+02 4.304e+02 7.318e+02, threshold=7.010e+02, percent-clipped=0.0 2023-06-18 09:37:08,218 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.56 vs. limit=10.0 2023-06-18 09:37:16,433 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.64 vs. limit=12.0 2023-06-18 09:37:42,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=138240.0, ans=0.125 2023-06-18 09:37:44,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=138240.0, ans=0.2 2023-06-18 09:37:45,491 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.41 vs. limit=15.0 2023-06-18 09:38:12,249 INFO [train.py:996] (3/4) Epoch 1, batch 23050, loss[loss=0.3782, simple_loss=0.4158, pruned_loss=0.1704, over 21555.00 frames. ], tot_loss[loss=0.3188, simple_loss=0.3726, pruned_loss=0.1325, over 4278371.68 frames. ], batch size: 414, lr: 2.50e-02, grad_scale: 32.0 2023-06-18 09:38:43,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=138360.0, ans=0.04949747468305833 2023-06-18 09:39:23,285 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=138420.0, ans=0.2 2023-06-18 09:40:27,146 INFO [train.py:996] (3/4) Epoch 1, batch 23100, loss[loss=0.2713, simple_loss=0.3091, pruned_loss=0.1167, over 21236.00 frames. ], tot_loss[loss=0.3167, simple_loss=0.3674, pruned_loss=0.1329, over 4266914.75 frames. ], batch size: 548, lr: 2.50e-02, grad_scale: 32.0 2023-06-18 09:40:27,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=138600.0, ans=0.125 2023-06-18 09:40:35,032 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=138600.0, ans=0.125 2023-06-18 09:41:23,072 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 09:42:11,461 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.404e+02 3.111e+02 3.602e+02 4.415e+02 8.420e+02, threshold=7.204e+02, percent-clipped=6.0 2023-06-18 09:42:59,778 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.33 vs. limit=15.0 2023-06-18 09:43:11,445 INFO [train.py:996] (3/4) Epoch 1, batch 23150, loss[loss=0.3018, simple_loss=0.3441, pruned_loss=0.1298, over 21856.00 frames. ], tot_loss[loss=0.3119, simple_loss=0.3608, pruned_loss=0.1314, over 4272258.13 frames. ], batch size: 298, lr: 2.50e-02, grad_scale: 32.0 2023-06-18 09:43:30,503 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=138960.0, ans=0.125 2023-06-18 09:43:49,493 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.16 vs. limit=6.0 2023-06-18 09:44:33,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=139080.0, ans=0.125 2023-06-18 09:45:33,898 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.54 vs. limit=22.5 2023-06-18 09:45:38,685 INFO [train.py:996] (3/4) Epoch 1, batch 23200, loss[loss=0.3214, simple_loss=0.3626, pruned_loss=0.1401, over 21334.00 frames. ], tot_loss[loss=0.3124, simple_loss=0.36, pruned_loss=0.1324, over 4280951.66 frames. ], batch size: 159, lr: 2.50e-02, grad_scale: 32.0 2023-06-18 09:46:11,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=139260.0, ans=0.2 2023-06-18 09:47:28,184 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.287e+02 3.236e+02 3.747e+02 4.592e+02 8.495e+02, threshold=7.495e+02, percent-clipped=1.0 2023-06-18 09:47:28,618 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=139380.0, ans=0.1 2023-06-18 09:47:30,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=139380.0, ans=0.125 2023-06-18 09:47:31,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=139380.0, ans=0.125 2023-06-18 09:48:05,952 INFO [train.py:996] (3/4) Epoch 1, batch 23250, loss[loss=0.3309, simple_loss=0.3693, pruned_loss=0.1462, over 21910.00 frames. ], tot_loss[loss=0.3149, simple_loss=0.361, pruned_loss=0.1344, over 4282303.45 frames. ], batch size: 316, lr: 2.49e-02, grad_scale: 32.0 2023-06-18 09:48:15,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=139500.0, ans=0.0 2023-06-18 09:49:37,448 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=139620.0, ans=0.5 2023-06-18 09:50:02,022 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=12.0 2023-06-18 09:50:02,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=139680.0, ans=0.5 2023-06-18 09:50:38,085 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.17 vs. limit=15.0 2023-06-18 09:50:40,070 INFO [train.py:996] (3/4) Epoch 1, batch 23300, loss[loss=0.3996, simple_loss=0.4768, pruned_loss=0.1612, over 21684.00 frames. ], tot_loss[loss=0.3228, simple_loss=0.3701, pruned_loss=0.1377, over 4279361.39 frames. ], batch size: 389, lr: 2.49e-02, grad_scale: 32.0 2023-06-18 09:52:00,879 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=139860.0, ans=0.125 2023-06-18 09:52:07,647 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=139860.0, ans=0.125 2023-06-18 09:52:31,355 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=139980.0, ans=0.0 2023-06-18 09:52:33,706 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.330e+02 3.238e+02 4.090e+02 5.618e+02 9.282e+02, threshold=8.181e+02, percent-clipped=7.0 2023-06-18 09:52:47,129 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.02 vs. limit=15.0 2023-06-18 09:53:27,947 INFO [train.py:996] (3/4) Epoch 1, batch 23350, loss[loss=0.3103, simple_loss=0.3728, pruned_loss=0.1238, over 21660.00 frames. ], tot_loss[loss=0.323, simple_loss=0.3749, pruned_loss=0.1356, over 4280632.07 frames. ], batch size: 263, lr: 2.49e-02, grad_scale: 32.0 2023-06-18 09:54:52,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=140220.0, ans=0.125 2023-06-18 09:54:52,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=140220.0, ans=0.0 2023-06-18 09:55:42,629 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=140340.0, ans=0.125 2023-06-18 09:55:53,816 INFO [train.py:996] (3/4) Epoch 1, batch 23400, loss[loss=0.3067, simple_loss=0.3379, pruned_loss=0.1378, over 20124.00 frames. ], tot_loss[loss=0.3132, simple_loss=0.3661, pruned_loss=0.1302, over 4274686.20 frames. ], batch size: 703, lr: 2.49e-02, grad_scale: 32.0 2023-06-18 09:57:44,604 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 09:57:45,605 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.104e+02 2.809e+02 3.311e+02 4.029e+02 8.339e+02, threshold=6.622e+02, percent-clipped=1.0 2023-06-18 09:58:52,846 INFO [train.py:996] (3/4) Epoch 1, batch 23450, loss[loss=0.29, simple_loss=0.2974, pruned_loss=0.1413, over 20091.00 frames. ], tot_loss[loss=0.3193, simple_loss=0.3688, pruned_loss=0.1349, over 4279838.14 frames. ], batch size: 703, lr: 2.48e-02, grad_scale: 32.0 2023-06-18 09:59:39,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=140760.0, ans=0.125 2023-06-18 09:59:41,913 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=140760.0, ans=0.0 2023-06-18 09:59:51,922 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 10:00:14,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=140880.0, ans=0.0 2023-06-18 10:00:46,707 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.71 vs. limit=22.5 2023-06-18 10:01:04,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=140940.0, ans=15.0 2023-06-18 10:01:07,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=140940.0, ans=0.0 2023-06-18 10:01:28,659 INFO [train.py:996] (3/4) Epoch 1, batch 23500, loss[loss=0.3262, simple_loss=0.3678, pruned_loss=0.1423, over 21903.00 frames. ], tot_loss[loss=0.3221, simple_loss=0.3694, pruned_loss=0.1374, over 4286979.55 frames. ], batch size: 351, lr: 2.48e-02, grad_scale: 32.0 2023-06-18 10:01:29,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=141000.0, ans=0.0 2023-06-18 10:02:28,386 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.07 vs. limit=10.0 2023-06-18 10:02:29,118 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=141120.0, ans=0.125 2023-06-18 10:02:38,212 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.71 vs. limit=22.5 2023-06-18 10:03:03,291 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.344e+02 3.176e+02 3.933e+02 4.936e+02 8.356e+02, threshold=7.866e+02, percent-clipped=7.0 2023-06-18 10:03:03,830 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=141180.0, ans=0.125 2023-06-18 10:03:20,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=141240.0, ans=0.0 2023-06-18 10:03:34,756 INFO [train.py:996] (3/4) Epoch 1, batch 23550, loss[loss=0.2861, simple_loss=0.3212, pruned_loss=0.1255, over 21172.00 frames. ], tot_loss[loss=0.3169, simple_loss=0.3624, pruned_loss=0.1357, over 4265145.88 frames. ], batch size: 176, lr: 2.48e-02, grad_scale: 32.0 2023-06-18 10:05:12,489 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.28 vs. limit=15.0 2023-06-18 10:05:31,067 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=141480.0, ans=0.0 2023-06-18 10:06:23,365 INFO [train.py:996] (3/4) Epoch 1, batch 23600, loss[loss=0.4103, simple_loss=0.4383, pruned_loss=0.1912, over 21425.00 frames. ], tot_loss[loss=0.3188, simple_loss=0.3648, pruned_loss=0.1364, over 4261016.36 frames. ], batch size: 471, lr: 2.48e-02, grad_scale: 32.0 2023-06-18 10:08:08,856 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.088e+02 3.095e+02 3.693e+02 4.407e+02 6.966e+02, threshold=7.385e+02, percent-clipped=0.0 2023-06-18 10:09:12,738 INFO [train.py:996] (3/4) Epoch 1, batch 23650, loss[loss=0.2607, simple_loss=0.3379, pruned_loss=0.09169, over 21695.00 frames. ], tot_loss[loss=0.3155, simple_loss=0.3638, pruned_loss=0.1336, over 4266568.99 frames. ], batch size: 298, lr: 2.47e-02, grad_scale: 32.0 2023-06-18 10:09:55,442 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=141960.0, ans=10.0 2023-06-18 10:10:18,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=142020.0, ans=0.0 2023-06-18 10:10:20,290 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=142020.0, ans=0.125 2023-06-18 10:11:09,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=142080.0, ans=0.1 2023-06-18 10:11:51,522 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.13 vs. limit=15.0 2023-06-18 10:11:51,767 INFO [train.py:996] (3/4) Epoch 1, batch 23700, loss[loss=0.3817, simple_loss=0.4248, pruned_loss=0.1693, over 21851.00 frames. ], tot_loss[loss=0.3161, simple_loss=0.3671, pruned_loss=0.1326, over 4269300.02 frames. ], batch size: 118, lr: 2.47e-02, grad_scale: 32.0 2023-06-18 10:12:03,024 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.whiten.whitening_limit, batch_count=142200.0, ans=12.0 2023-06-18 10:12:18,612 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=17.83 vs. limit=15.0 2023-06-18 10:13:25,460 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.40 vs. limit=15.0 2023-06-18 10:13:44,066 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.166e+02 3.277e+02 3.827e+02 4.804e+02 8.451e+02, threshold=7.655e+02, percent-clipped=1.0 2023-06-18 10:13:44,558 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=142380.0, ans=0.125 2023-06-18 10:14:11,832 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=142440.0, ans=0.0 2023-06-18 10:14:20,757 INFO [train.py:996] (3/4) Epoch 1, batch 23750, loss[loss=0.2971, simple_loss=0.3728, pruned_loss=0.1107, over 21720.00 frames. ], tot_loss[loss=0.3181, simple_loss=0.3693, pruned_loss=0.1335, over 4272319.78 frames. ], batch size: 351, lr: 2.47e-02, grad_scale: 32.0 2023-06-18 10:15:07,124 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=142560.0, ans=0.0 2023-06-18 10:15:07,150 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=5.257e-03 2023-06-18 10:15:31,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=142620.0, ans=0.0 2023-06-18 10:15:49,522 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=142620.0, ans=0.2 2023-06-18 10:16:48,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=142740.0, ans=0.2 2023-06-18 10:17:10,205 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=142740.0, ans=0.125 2023-06-18 10:17:14,675 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 10:17:15,877 INFO [train.py:996] (3/4) Epoch 1, batch 23800, loss[loss=0.3511, simple_loss=0.4199, pruned_loss=0.1411, over 21733.00 frames. ], tot_loss[loss=0.3123, simple_loss=0.366, pruned_loss=0.1293, over 4274395.70 frames. ], batch size: 351, lr: 2.47e-02, grad_scale: 32.0 2023-06-18 10:17:19,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=142800.0, ans=0.125 2023-06-18 10:17:20,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=142800.0, ans=0.025 2023-06-18 10:17:41,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=142860.0, ans=0.125 2023-06-18 10:18:47,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=142920.0, ans=0.125 2023-06-18 10:19:06,968 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.827e+02 3.371e+02 4.073e+02 5.438e+02 8.873e+02, threshold=8.146e+02, percent-clipped=4.0 2023-06-18 10:20:03,706 INFO [train.py:996] (3/4) Epoch 1, batch 23850, loss[loss=0.468, simple_loss=0.4929, pruned_loss=0.2215, over 21360.00 frames. ], tot_loss[loss=0.3237, simple_loss=0.3791, pruned_loss=0.1341, over 4280093.87 frames. ], batch size: 507, lr: 2.46e-02, grad_scale: 32.0 2023-06-18 10:21:14,974 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=143220.0, ans=0.1 2023-06-18 10:21:25,423 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 10:21:27,386 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=15.0 2023-06-18 10:22:38,220 INFO [train.py:996] (3/4) Epoch 1, batch 23900, loss[loss=0.3101, simple_loss=0.3622, pruned_loss=0.129, over 21190.00 frames. ], tot_loss[loss=0.331, simple_loss=0.3869, pruned_loss=0.1375, over 4283366.33 frames. ], batch size: 159, lr: 2.46e-02, grad_scale: 32.0 2023-06-18 10:23:21,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=143460.0, ans=0.2 2023-06-18 10:23:51,574 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.64 vs. limit=15.0 2023-06-18 10:24:14,312 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.331e+02 3.646e+02 4.391e+02 5.259e+02 8.608e+02, threshold=8.781e+02, percent-clipped=3.0 2023-06-18 10:25:07,004 INFO [train.py:996] (3/4) Epoch 1, batch 23950, loss[loss=0.3089, simple_loss=0.3391, pruned_loss=0.1394, over 20066.00 frames. ], tot_loss[loss=0.3277, simple_loss=0.3803, pruned_loss=0.1375, over 4276694.12 frames. ], batch size: 702, lr: 2.46e-02, grad_scale: 32.0 2023-06-18 10:26:06,904 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=143820.0, ans=0.125 2023-06-18 10:26:11,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=143820.0, ans=0.015 2023-06-18 10:26:14,075 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=143820.0, ans=0.125 2023-06-18 10:27:39,009 INFO [train.py:996] (3/4) Epoch 1, batch 24000, loss[loss=0.3461, simple_loss=0.3956, pruned_loss=0.1483, over 21356.00 frames. ], tot_loss[loss=0.3301, simple_loss=0.38, pruned_loss=0.14, over 4276109.17 frames. ], batch size: 176, lr: 2.46e-02, grad_scale: 32.0 2023-06-18 10:27:39,009 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-18 10:28:35,856 INFO [train.py:1028] (3/4) Epoch 1, validation: loss=0.3093, simple_loss=0.4026, pruned_loss=0.108, over 1796401.00 frames. 2023-06-18 10:28:35,857 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-18 10:28:48,172 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=144000.0, ans=0.125 2023-06-18 10:29:54,933 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.268e+02 3.332e+02 4.254e+02 5.266e+02 8.160e+02, threshold=8.508e+02, percent-clipped=0.0 2023-06-18 10:30:11,955 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=144180.0, ans=0.125 2023-06-18 10:30:44,741 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 10:30:47,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=144240.0, ans=0.125 2023-06-18 10:30:57,460 INFO [train.py:996] (3/4) Epoch 1, batch 24050, loss[loss=0.2451, simple_loss=0.3174, pruned_loss=0.08639, over 21169.00 frames. ], tot_loss[loss=0.3306, simple_loss=0.381, pruned_loss=0.1401, over 4274133.08 frames. ], batch size: 143, lr: 2.46e-02, grad_scale: 32.0 2023-06-18 10:31:10,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=144300.0, ans=0.0 2023-06-18 10:31:37,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=144360.0, ans=0.125 2023-06-18 10:33:29,124 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=144540.0, ans=0.0 2023-06-18 10:33:33,025 INFO [train.py:996] (3/4) Epoch 1, batch 24100, loss[loss=0.2707, simple_loss=0.3269, pruned_loss=0.1072, over 16538.00 frames. ], tot_loss[loss=0.3274, simple_loss=0.3802, pruned_loss=0.1373, over 4271697.87 frames. ], batch size: 61, lr: 2.45e-02, grad_scale: 32.0 2023-06-18 10:34:25,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=144660.0, ans=0.125 2023-06-18 10:34:41,654 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=144720.0, ans=0.1 2023-06-18 10:35:11,617 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.131e+02 2.942e+02 3.455e+02 4.270e+02 6.520e+02, threshold=6.911e+02, percent-clipped=0.0 2023-06-18 10:35:41,206 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=18.76 vs. limit=22.5 2023-06-18 10:35:56,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=144840.0, ans=0.125 2023-06-18 10:36:07,970 INFO [train.py:996] (3/4) Epoch 1, batch 24150, loss[loss=0.3163, simple_loss=0.3551, pruned_loss=0.1387, over 21486.00 frames. ], tot_loss[loss=0.3298, simple_loss=0.3801, pruned_loss=0.1398, over 4281501.67 frames. ], batch size: 211, lr: 2.45e-02, grad_scale: 32.0 2023-06-18 10:36:23,719 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=144900.0, ans=0.125 2023-06-18 10:36:24,295 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.44 vs. limit=22.5 2023-06-18 10:36:37,181 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=144960.0, ans=0.125 2023-06-18 10:37:29,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=145020.0, ans=0.1 2023-06-18 10:37:49,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=145080.0, ans=0.125 2023-06-18 10:37:50,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=145080.0, ans=0.0 2023-06-18 10:38:33,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=145140.0, ans=0.0 2023-06-18 10:38:51,875 INFO [train.py:996] (3/4) Epoch 1, batch 24200, loss[loss=0.2834, simple_loss=0.3515, pruned_loss=0.1076, over 21456.00 frames. ], tot_loss[loss=0.3344, simple_loss=0.3834, pruned_loss=0.1427, over 4280598.59 frames. ], batch size: 211, lr: 2.45e-02, grad_scale: 32.0 2023-06-18 10:39:46,837 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 10:39:48,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=145260.0, ans=0.125 2023-06-18 10:40:14,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=145320.0, ans=0.125 2023-06-18 10:40:52,757 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.147e+02 3.211e+02 3.713e+02 4.358e+02 6.625e+02, threshold=7.425e+02, percent-clipped=0.0 2023-06-18 10:40:54,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=145380.0, ans=0.125 2023-06-18 10:40:57,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=145380.0, ans=0.125 2023-06-18 10:41:47,347 INFO [train.py:996] (3/4) Epoch 1, batch 24250, loss[loss=0.3256, simple_loss=0.3951, pruned_loss=0.1281, over 21507.00 frames. ], tot_loss[loss=0.3209, simple_loss=0.3773, pruned_loss=0.1323, over 4286250.25 frames. ], batch size: 471, lr: 2.45e-02, grad_scale: 32.0 2023-06-18 10:42:07,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=145500.0, ans=0.125 2023-06-18 10:42:09,711 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=145500.0, ans=0.1 2023-06-18 10:42:17,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=145500.0, ans=0.125 2023-06-18 10:42:39,367 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=145560.0, ans=0.125 2023-06-18 10:44:19,399 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=145740.0, ans=0.2 2023-06-18 10:44:28,668 INFO [train.py:996] (3/4) Epoch 1, batch 24300, loss[loss=0.1673, simple_loss=0.2395, pruned_loss=0.04754, over 21085.00 frames. ], tot_loss[loss=0.3054, simple_loss=0.3653, pruned_loss=0.1227, over 4286523.55 frames. ], batch size: 143, lr: 2.44e-02, grad_scale: 32.0 2023-06-18 10:46:15,197 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.527e+02 2.565e+02 3.464e+02 4.560e+02 9.381e+02, threshold=6.928e+02, percent-clipped=3.0 2023-06-18 10:46:31,537 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=146040.0, ans=0.125 2023-06-18 10:46:52,486 INFO [train.py:996] (3/4) Epoch 1, batch 24350, loss[loss=0.3673, simple_loss=0.4196, pruned_loss=0.1574, over 21893.00 frames. ], tot_loss[loss=0.3046, simple_loss=0.3621, pruned_loss=0.1235, over 4286977.09 frames. ], batch size: 118, lr: 2.44e-02, grad_scale: 32.0 2023-06-18 10:48:07,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=146160.0, ans=0.0 2023-06-18 10:48:36,597 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=146220.0, ans=0.125 2023-06-18 10:48:40,256 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.11 vs. limit=15.0 2023-06-18 10:48:59,545 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=146340.0, ans=0.2 2023-06-18 10:49:14,082 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-06-18 10:49:22,544 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.99 vs. limit=15.0 2023-06-18 10:49:48,870 INFO [train.py:996] (3/4) Epoch 1, batch 24400, loss[loss=0.3103, simple_loss=0.3599, pruned_loss=0.1304, over 21801.00 frames. ], tot_loss[loss=0.3156, simple_loss=0.3706, pruned_loss=0.1303, over 4285657.82 frames. ], batch size: 107, lr: 2.44e-02, grad_scale: 32.0 2023-06-18 10:49:49,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=146400.0, ans=0.0 2023-06-18 10:50:13,370 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=146460.0, ans=0.0 2023-06-18 10:51:03,070 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=146520.0, ans=0.2 2023-06-18 10:51:09,759 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.645e+02 3.327e+02 4.021e+02 4.896e+02 7.862e+02, threshold=8.041e+02, percent-clipped=3.0 2023-06-18 10:51:10,266 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=146580.0, ans=0.125 2023-06-18 10:51:11,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=146580.0, ans=0.0 2023-06-18 10:52:18,230 INFO [train.py:996] (3/4) Epoch 1, batch 24450, loss[loss=0.3366, simple_loss=0.4055, pruned_loss=0.1338, over 21713.00 frames. ], tot_loss[loss=0.3193, simple_loss=0.3739, pruned_loss=0.1324, over 4287674.58 frames. ], batch size: 389, lr: 2.44e-02, grad_scale: 32.0 2023-06-18 10:52:59,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=146760.0, ans=0.2 2023-06-18 10:53:27,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=146820.0, ans=0.125 2023-06-18 10:54:19,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=146880.0, ans=0.95 2023-06-18 10:54:22,777 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.07 vs. limit=15.0 2023-06-18 10:54:41,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=146940.0, ans=0.015 2023-06-18 10:54:49,354 INFO [train.py:996] (3/4) Epoch 1, batch 24500, loss[loss=0.3319, simple_loss=0.3897, pruned_loss=0.1371, over 21401.00 frames. ], tot_loss[loss=0.3163, simple_loss=0.3718, pruned_loss=0.1304, over 4286042.86 frames. ], batch size: 144, lr: 2.43e-02, grad_scale: 32.0 2023-06-18 10:56:26,606 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=147120.0, ans=0.1 2023-06-18 10:56:42,894 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=147180.0, ans=0.125 2023-06-18 10:56:45,340 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.296e+02 3.166e+02 3.756e+02 4.568e+02 6.399e+02, threshold=7.511e+02, percent-clipped=0.0 2023-06-18 10:56:47,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=147180.0, ans=0.125 2023-06-18 10:56:58,990 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=147240.0, ans=0.05 2023-06-18 10:57:23,051 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=147240.0, ans=0.125 2023-06-18 10:57:32,115 INFO [train.py:996] (3/4) Epoch 1, batch 24550, loss[loss=0.3795, simple_loss=0.4239, pruned_loss=0.1675, over 21547.00 frames. ], tot_loss[loss=0.32, simple_loss=0.3743, pruned_loss=0.1329, over 4284823.33 frames. ], batch size: 389, lr: 2.43e-02, grad_scale: 32.0 2023-06-18 10:57:56,181 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=147300.0, ans=0.0 2023-06-18 10:58:14,985 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=147360.0, ans=0.125 2023-06-18 10:58:26,195 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=147360.0, ans=0.0 2023-06-18 10:59:00,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=147480.0, ans=0.125 2023-06-18 11:00:07,913 INFO [train.py:996] (3/4) Epoch 1, batch 24600, loss[loss=0.3398, simple_loss=0.3616, pruned_loss=0.159, over 19970.00 frames. ], tot_loss[loss=0.3201, simple_loss=0.371, pruned_loss=0.1346, over 4278513.80 frames. ], batch size: 703, lr: 2.43e-02, grad_scale: 64.0 2023-06-18 11:00:14,597 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.45 vs. limit=5.0 2023-06-18 11:00:34,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=147660.0, ans=0.125 2023-06-18 11:01:00,280 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=147720.0, ans=0.125 2023-06-18 11:01:26,705 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.629e+02 3.281e+02 4.105e+02 4.989e+02 7.810e+02, threshold=8.210e+02, percent-clipped=2.0 2023-06-18 11:01:27,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=147780.0, ans=0.0 2023-06-18 11:02:16,244 INFO [train.py:996] (3/4) Epoch 1, batch 24650, loss[loss=0.3245, simple_loss=0.3507, pruned_loss=0.1492, over 21511.00 frames. ], tot_loss[loss=0.3141, simple_loss=0.3625, pruned_loss=0.1329, over 4277037.81 frames. ], batch size: 441, lr: 2.43e-02, grad_scale: 32.0 2023-06-18 11:02:53,667 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=147960.0, ans=0.125 2023-06-18 11:02:57,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=147960.0, ans=0.125 2023-06-18 11:03:28,503 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=148020.0, ans=0.2 2023-06-18 11:04:19,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=148140.0, ans=0.2 2023-06-18 11:04:47,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=148200.0, ans=0.125 2023-06-18 11:04:48,084 INFO [train.py:996] (3/4) Epoch 1, batch 24700, loss[loss=0.3271, simple_loss=0.3653, pruned_loss=0.1444, over 21244.00 frames. ], tot_loss[loss=0.3106, simple_loss=0.3603, pruned_loss=0.1305, over 4269864.78 frames. ], batch size: 471, lr: 2.43e-02, grad_scale: 32.0 2023-06-18 11:04:58,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=148200.0, ans=0.1 2023-06-18 11:05:34,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=148320.0, ans=0.1 2023-06-18 11:05:36,458 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.99 vs. limit=15.0 2023-06-18 11:06:10,604 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.880e+02 2.936e+02 3.514e+02 4.339e+02 5.892e+02, threshold=7.028e+02, percent-clipped=0.0 2023-06-18 11:06:36,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=148380.0, ans=0.0 2023-06-18 11:07:20,639 INFO [train.py:996] (3/4) Epoch 1, batch 24750, loss[loss=0.2604, simple_loss=0.3042, pruned_loss=0.1083, over 21453.00 frames. ], tot_loss[loss=0.3034, simple_loss=0.3527, pruned_loss=0.127, over 4260955.81 frames. ], batch size: 212, lr: 2.42e-02, grad_scale: 32.0 2023-06-18 11:07:46,521 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=148560.0, ans=0.125 2023-06-18 11:08:43,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=148680.0, ans=0.1 2023-06-18 11:09:13,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=148740.0, ans=0.0 2023-06-18 11:09:15,353 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=12.0 2023-06-18 11:09:41,618 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=148740.0, ans=0.125 2023-06-18 11:09:44,197 INFO [train.py:996] (3/4) Epoch 1, batch 24800, loss[loss=0.3849, simple_loss=0.3908, pruned_loss=0.1895, over 21626.00 frames. ], tot_loss[loss=0.3, simple_loss=0.3485, pruned_loss=0.1258, over 4260276.27 frames. ], batch size: 508, lr: 2.42e-02, grad_scale: 32.0 2023-06-18 11:10:16,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=148860.0, ans=0.0 2023-06-18 11:10:16,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=148860.0, ans=0.2 2023-06-18 11:10:56,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=148980.0, ans=0.05 2023-06-18 11:11:21,041 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.242e+02 3.310e+02 3.842e+02 5.008e+02 1.003e+03, threshold=7.684e+02, percent-clipped=5.0 2023-06-18 11:11:24,577 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 11:11:49,914 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=149040.0, ans=6.0 2023-06-18 11:12:17,396 INFO [train.py:996] (3/4) Epoch 1, batch 24850, loss[loss=0.2681, simple_loss=0.3223, pruned_loss=0.1069, over 21647.00 frames. ], tot_loss[loss=0.3025, simple_loss=0.3499, pruned_loss=0.1275, over 4262919.98 frames. ], batch size: 263, lr: 2.42e-02, grad_scale: 32.0 2023-06-18 11:12:21,608 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.60 vs. limit=12.0 2023-06-18 11:12:30,293 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=149100.0, ans=0.125 2023-06-18 11:12:31,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=149100.0, ans=0.2 2023-06-18 11:12:44,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=149160.0, ans=0.125 2023-06-18 11:12:46,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=149160.0, ans=0.125 2023-06-18 11:12:52,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=149160.0, ans=10.0 2023-06-18 11:13:35,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=149220.0, ans=0.0 2023-06-18 11:14:16,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=149340.0, ans=0.0 2023-06-18 11:14:30,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=149340.0, ans=0.125 2023-06-18 11:14:39,664 INFO [train.py:996] (3/4) Epoch 1, batch 24900, loss[loss=0.3665, simple_loss=0.4125, pruned_loss=0.1602, over 21784.00 frames. ], tot_loss[loss=0.3046, simple_loss=0.3523, pruned_loss=0.1285, over 4263303.22 frames. ], batch size: 124, lr: 2.42e-02, grad_scale: 32.0 2023-06-18 11:15:40,521 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=149520.0, ans=0.125 2023-06-18 11:16:07,803 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.18 vs. limit=10.0 2023-06-18 11:16:14,502 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=149580.0, ans=0.0 2023-06-18 11:16:37,380 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 11:16:38,255 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.173e+02 3.164e+02 3.727e+02 4.575e+02 6.932e+02, threshold=7.454e+02, percent-clipped=0.0 2023-06-18 11:16:41,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=149580.0, ans=0.1 2023-06-18 11:17:28,498 INFO [train.py:996] (3/4) Epoch 1, batch 24950, loss[loss=0.4189, simple_loss=0.4464, pruned_loss=0.1957, over 21792.00 frames. ], tot_loss[loss=0.3164, simple_loss=0.363, pruned_loss=0.1349, over 4270466.94 frames. ], batch size: 441, lr: 2.41e-02, grad_scale: 32.0 2023-06-18 11:18:27,582 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=16.15 vs. limit=15.0 2023-06-18 11:19:01,612 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=149820.0, ans=0.04949747468305833 2023-06-18 11:19:17,365 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.61 vs. limit=15.0 2023-06-18 11:19:31,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=149880.0, ans=0.0 2023-06-18 11:19:55,383 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.11 vs. limit=15.0 2023-06-18 11:20:06,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=149940.0, ans=0.0 2023-06-18 11:20:09,061 INFO [train.py:996] (3/4) Epoch 1, batch 25000, loss[loss=0.3349, simple_loss=0.3997, pruned_loss=0.1351, over 21612.00 frames. ], tot_loss[loss=0.3221, simple_loss=0.3699, pruned_loss=0.1371, over 4273109.58 frames. ], batch size: 263, lr: 2.41e-02, grad_scale: 32.0 2023-06-18 11:21:50,979 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.298e+02 3.123e+02 3.491e+02 4.208e+02 6.099e+02, threshold=6.982e+02, percent-clipped=0.0 2023-06-18 11:22:10,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=150180.0, ans=0.1 2023-06-18 11:22:26,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=150240.0, ans=0.0 2023-06-18 11:22:36,185 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=150240.0, ans=0.125 2023-06-18 11:22:50,208 INFO [train.py:996] (3/4) Epoch 1, batch 25050, loss[loss=0.3072, simple_loss=0.3418, pruned_loss=0.1363, over 21442.00 frames. ], tot_loss[loss=0.3164, simple_loss=0.3628, pruned_loss=0.1349, over 4271967.13 frames. ], batch size: 441, lr: 2.41e-02, grad_scale: 16.0 2023-06-18 11:24:14,452 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=150420.0, ans=0.0 2023-06-18 11:24:22,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=150420.0, ans=0.2 2023-06-18 11:24:54,357 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=150540.0, ans=0.125 2023-06-18 11:25:21,803 INFO [train.py:996] (3/4) Epoch 1, batch 25100, loss[loss=0.2585, simple_loss=0.2969, pruned_loss=0.11, over 20746.00 frames. ], tot_loss[loss=0.3101, simple_loss=0.3556, pruned_loss=0.1323, over 4267471.86 frames. ], batch size: 608, lr: 2.41e-02, grad_scale: 16.0 2023-06-18 11:25:33,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=150600.0, ans=0.0 2023-06-18 11:26:05,661 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=150660.0, ans=0.125 2023-06-18 11:26:14,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=150660.0, ans=0.2 2023-06-18 11:26:18,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=150720.0, ans=0.125 2023-06-18 11:27:00,085 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.897e+02 3.018e+02 3.792e+02 4.723e+02 7.366e+02, threshold=7.583e+02, percent-clipped=2.0 2023-06-18 11:27:12,054 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.56 vs. limit=10.0 2023-06-18 11:27:19,075 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.81 vs. limit=22.5 2023-06-18 11:27:52,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=150900.0, ans=0.0 2023-06-18 11:27:54,108 INFO [train.py:996] (3/4) Epoch 1, batch 25150, loss[loss=0.273, simple_loss=0.3586, pruned_loss=0.0937, over 21811.00 frames. ], tot_loss[loss=0.3064, simple_loss=0.3565, pruned_loss=0.1281, over 4259817.85 frames. ], batch size: 282, lr: 2.41e-02, grad_scale: 16.0 2023-06-18 11:28:28,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=151020.0, ans=0.1 2023-06-18 11:28:49,399 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 11:28:55,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=151020.0, ans=0.0 2023-06-18 11:29:11,896 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=151080.0, ans=0.1 2023-06-18 11:29:51,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=151140.0, ans=0.0 2023-06-18 11:30:17,529 INFO [train.py:996] (3/4) Epoch 1, batch 25200, loss[loss=0.2675, simple_loss=0.3403, pruned_loss=0.09732, over 21667.00 frames. ], tot_loss[loss=0.3024, simple_loss=0.3549, pruned_loss=0.125, over 4259447.66 frames. ], batch size: 230, lr: 2.40e-02, grad_scale: 32.0 2023-06-18 11:30:32,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=151260.0, ans=0.125 2023-06-18 11:31:22,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=151320.0, ans=0.0 2023-06-18 11:31:55,945 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.967e+02 2.820e+02 3.433e+02 4.407e+02 9.386e+02, threshold=6.866e+02, percent-clipped=4.0 2023-06-18 11:31:58,082 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=151380.0, ans=0.125 2023-06-18 11:32:06,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=151440.0, ans=0.04949747468305833 2023-06-18 11:32:40,294 INFO [train.py:996] (3/4) Epoch 1, batch 25250, loss[loss=0.246, simple_loss=0.2998, pruned_loss=0.09612, over 21219.00 frames. ], tot_loss[loss=0.2993, simple_loss=0.3528, pruned_loss=0.1229, over 4255220.11 frames. ], batch size: 176, lr: 2.40e-02, grad_scale: 32.0 2023-06-18 11:34:00,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=151620.0, ans=0.2 2023-06-18 11:34:33,537 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=151740.0, ans=0.125 2023-06-18 11:34:33,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=151740.0, ans=0.125 2023-06-18 11:34:48,179 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=151740.0, ans=0.125 2023-06-18 11:35:17,335 INFO [train.py:996] (3/4) Epoch 1, batch 25300, loss[loss=0.2887, simple_loss=0.3056, pruned_loss=0.1359, over 20296.00 frames. ], tot_loss[loss=0.2972, simple_loss=0.3492, pruned_loss=0.1227, over 4252640.19 frames. ], batch size: 703, lr: 2.40e-02, grad_scale: 32.0 2023-06-18 11:36:22,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=151920.0, ans=0.125 2023-06-18 11:36:37,260 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 11:37:01,678 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.195e+02 3.038e+02 3.574e+02 4.341e+02 5.884e+02, threshold=7.148e+02, percent-clipped=0.0 2023-06-18 11:37:19,715 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=152040.0, ans=0.125 2023-06-18 11:37:36,672 INFO [train.py:996] (3/4) Epoch 1, batch 25350, loss[loss=0.3518, simple_loss=0.3859, pruned_loss=0.1588, over 21393.00 frames. ], tot_loss[loss=0.3009, simple_loss=0.3544, pruned_loss=0.1237, over 4256861.17 frames. ], batch size: 507, lr: 2.40e-02, grad_scale: 32.0 2023-06-18 11:39:01,453 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=152220.0, ans=0.0 2023-06-18 11:39:09,278 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.30 vs. limit=15.0 2023-06-18 11:39:50,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=152340.0, ans=0.125 2023-06-18 11:39:52,764 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=152340.0, ans=0.2 2023-06-18 11:39:52,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=152340.0, ans=0.0 2023-06-18 11:40:09,474 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=152400.0, ans=0.04949747468305833 2023-06-18 11:40:10,522 INFO [train.py:996] (3/4) Epoch 1, batch 25400, loss[loss=0.2806, simple_loss=0.3275, pruned_loss=0.1168, over 21272.00 frames. ], tot_loss[loss=0.2972, simple_loss=0.3505, pruned_loss=0.122, over 4248275.76 frames. ], batch size: 159, lr: 2.39e-02, grad_scale: 32.0 2023-06-18 11:40:27,353 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=152400.0, ans=0.07 2023-06-18 11:40:58,509 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 11:41:10,053 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=152520.0, ans=0.2 2023-06-18 11:41:54,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=152580.0, ans=0.0 2023-06-18 11:41:58,759 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.156e+02 3.135e+02 3.772e+02 5.152e+02 8.447e+02, threshold=7.545e+02, percent-clipped=8.0 2023-06-18 11:42:00,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=152580.0, ans=0.125 2023-06-18 11:42:15,085 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.68 vs. limit=10.0 2023-06-18 11:42:39,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=152640.0, ans=0.125 2023-06-18 11:42:48,255 INFO [train.py:996] (3/4) Epoch 1, batch 25450, loss[loss=0.3137, simple_loss=0.364, pruned_loss=0.1317, over 21862.00 frames. ], tot_loss[loss=0.2992, simple_loss=0.3508, pruned_loss=0.1238, over 4243286.67 frames. ], batch size: 107, lr: 2.39e-02, grad_scale: 32.0 2023-06-18 11:43:29,865 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=152760.0, ans=0.1 2023-06-18 11:44:06,706 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=152820.0, ans=0.125 2023-06-18 11:45:02,151 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=152940.0, ans=0.125 2023-06-18 11:45:11,857 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=152940.0, ans=0.125 2023-06-18 11:45:17,148 INFO [train.py:996] (3/4) Epoch 1, batch 25500, loss[loss=0.3157, simple_loss=0.3793, pruned_loss=0.126, over 21888.00 frames. ], tot_loss[loss=0.2941, simple_loss=0.3484, pruned_loss=0.1198, over 4237847.49 frames. ], batch size: 372, lr: 2.39e-02, grad_scale: 32.0 2023-06-18 11:45:55,247 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.55 vs. limit=10.0 2023-06-18 11:46:46,130 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-06-18 11:47:14,258 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.987e+02 2.935e+02 3.590e+02 4.450e+02 7.910e+02, threshold=7.180e+02, percent-clipped=1.0 2023-06-18 11:48:00,039 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.43 vs. limit=15.0 2023-06-18 11:48:09,545 INFO [train.py:996] (3/4) Epoch 1, batch 25550, loss[loss=0.268, simple_loss=0.3277, pruned_loss=0.1042, over 21433.00 frames. ], tot_loss[loss=0.2978, simple_loss=0.355, pruned_loss=0.1203, over 4248114.53 frames. ], batch size: 131, lr: 2.39e-02, grad_scale: 32.0 2023-06-18 11:48:11,473 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=153300.0, ans=0.125 2023-06-18 11:48:53,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=153360.0, ans=0.125 2023-06-18 11:49:47,469 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=153420.0, ans=0.125 2023-06-18 11:50:37,651 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=153540.0, ans=0.0 2023-06-18 11:51:05,022 INFO [train.py:996] (3/4) Epoch 1, batch 25600, loss[loss=0.3742, simple_loss=0.4083, pruned_loss=0.1701, over 21367.00 frames. ], tot_loss[loss=0.3031, simple_loss=0.3616, pruned_loss=0.1223, over 4258967.30 frames. ], batch size: 548, lr: 2.39e-02, grad_scale: 32.0 2023-06-18 11:51:09,927 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=153600.0, ans=0.1 2023-06-18 11:52:57,322 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.942e+02 3.009e+02 3.848e+02 5.961e+02 1.110e+03, threshold=7.697e+02, percent-clipped=15.0 2023-06-18 11:53:14,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=153840.0, ans=0.125 2023-06-18 11:53:28,355 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=153840.0, ans=0.125 2023-06-18 11:53:35,682 INFO [train.py:996] (3/4) Epoch 1, batch 25650, loss[loss=0.3093, simple_loss=0.3428, pruned_loss=0.1379, over 21689.00 frames. ], tot_loss[loss=0.3094, simple_loss=0.3646, pruned_loss=0.1272, over 4260088.30 frames. ], batch size: 333, lr: 2.38e-02, grad_scale: 32.0 2023-06-18 11:53:45,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=153900.0, ans=0.0 2023-06-18 11:53:47,094 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.51 vs. limit=6.0 2023-06-18 11:54:03,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=153960.0, ans=0.1 2023-06-18 11:54:03,890 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.86 vs. limit=15.0 2023-06-18 11:56:09,170 INFO [train.py:996] (3/4) Epoch 1, batch 25700, loss[loss=0.3138, simple_loss=0.3682, pruned_loss=0.1297, over 21771.00 frames. ], tot_loss[loss=0.3119, simple_loss=0.3638, pruned_loss=0.13, over 4263012.08 frames. ], batch size: 112, lr: 2.38e-02, grad_scale: 32.0 2023-06-18 11:58:10,773 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.012e+02 3.158e+02 3.682e+02 4.301e+02 9.092e+02, threshold=7.363e+02, percent-clipped=1.0 2023-06-18 11:58:28,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=154440.0, ans=0.2 2023-06-18 11:59:00,241 INFO [train.py:996] (3/4) Epoch 1, batch 25750, loss[loss=0.3656, simple_loss=0.4217, pruned_loss=0.1547, over 21564.00 frames. ], tot_loss[loss=0.3198, simple_loss=0.3708, pruned_loss=0.1344, over 4261370.56 frames. ], batch size: 230, lr: 2.38e-02, grad_scale: 32.0 2023-06-18 11:59:24,000 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=154500.0, ans=0.1 2023-06-18 11:59:42,505 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=154560.0, ans=0.125 2023-06-18 12:01:02,986 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=154680.0, ans=0.125 2023-06-18 12:01:57,099 INFO [train.py:996] (3/4) Epoch 1, batch 25800, loss[loss=0.357, simple_loss=0.41, pruned_loss=0.152, over 21332.00 frames. ], tot_loss[loss=0.3322, simple_loss=0.3827, pruned_loss=0.1408, over 4266378.07 frames. ], batch size: 548, lr: 2.38e-02, grad_scale: 32.0 2023-06-18 12:03:41,310 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.52 vs. limit=15.0 2023-06-18 12:03:53,279 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.41 vs. limit=10.0 2023-06-18 12:03:55,965 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.51 vs. limit=15.0 2023-06-18 12:03:56,356 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.328e+02 3.535e+02 4.132e+02 5.213e+02 8.329e+02, threshold=8.265e+02, percent-clipped=2.0 2023-06-18 12:04:03,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=154980.0, ans=0.125 2023-06-18 12:04:03,691 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.47 vs. limit=15.0 2023-06-18 12:04:07,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=154980.0, ans=0.125 2023-06-18 12:05:02,214 INFO [train.py:996] (3/4) Epoch 1, batch 25850, loss[loss=0.3275, simple_loss=0.3847, pruned_loss=0.1352, over 19940.00 frames. ], tot_loss[loss=0.3316, simple_loss=0.385, pruned_loss=0.1391, over 4268601.76 frames. ], batch size: 702, lr: 2.38e-02, grad_scale: 32.0 2023-06-18 12:05:22,337 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.82 vs. limit=22.5 2023-06-18 12:05:31,440 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=155160.0, ans=0.1 2023-06-18 12:07:03,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=155340.0, ans=0.2 2023-06-18 12:07:06,368 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=155340.0, ans=0.1 2023-06-18 12:07:30,263 INFO [train.py:996] (3/4) Epoch 1, batch 25900, loss[loss=0.5107, simple_loss=0.5297, pruned_loss=0.2459, over 21572.00 frames. ], tot_loss[loss=0.3322, simple_loss=0.386, pruned_loss=0.1392, over 4275318.63 frames. ], batch size: 507, lr: 2.37e-02, grad_scale: 32.0 2023-06-18 12:08:16,950 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=155460.0, ans=0.125 2023-06-18 12:08:20,199 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=155460.0, ans=0.125 2023-06-18 12:08:38,166 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=155460.0, ans=0.125 2023-06-18 12:09:39,799 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.491e+02 3.281e+02 3.976e+02 5.347e+02 9.829e+02, threshold=7.952e+02, percent-clipped=3.0 2023-06-18 12:10:15,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=155640.0, ans=0.125 2023-06-18 12:10:18,126 INFO [train.py:996] (3/4) Epoch 1, batch 25950, loss[loss=0.3059, simple_loss=0.3644, pruned_loss=0.1237, over 21376.00 frames. ], tot_loss[loss=0.3359, simple_loss=0.3895, pruned_loss=0.1411, over 4274910.01 frames. ], batch size: 159, lr: 2.37e-02, grad_scale: 16.0 2023-06-18 12:12:18,079 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.48 vs. limit=10.0 2023-06-18 12:12:23,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=155940.0, ans=0.2 2023-06-18 12:12:33,071 INFO [train.py:996] (3/4) Epoch 1, batch 26000, loss[loss=0.3553, simple_loss=0.4097, pruned_loss=0.1505, over 21990.00 frames. ], tot_loss[loss=0.3351, simple_loss=0.3915, pruned_loss=0.1393, over 4274142.44 frames. ], batch size: 317, lr: 2.37e-02, grad_scale: 32.0 2023-06-18 12:12:58,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=156000.0, ans=0.125 2023-06-18 12:13:27,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=156060.0, ans=0.1 2023-06-18 12:14:21,525 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.86 vs. limit=22.5 2023-06-18 12:14:44,906 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.158e+02 3.144e+02 3.654e+02 4.482e+02 8.156e+02, threshold=7.307e+02, percent-clipped=1.0 2023-06-18 12:14:54,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=156240.0, ans=0.0 2023-06-18 12:15:22,837 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=156300.0, ans=0.125 2023-06-18 12:15:24,717 INFO [train.py:996] (3/4) Epoch 1, batch 26050, loss[loss=0.3372, simple_loss=0.4046, pruned_loss=0.1349, over 17754.00 frames. ], tot_loss[loss=0.3365, simple_loss=0.3913, pruned_loss=0.1409, over 4271079.58 frames. ], batch size: 60, lr: 2.37e-02, grad_scale: 32.0 2023-06-18 12:16:11,505 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=156360.0, ans=0.0 2023-06-18 12:18:04,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=156600.0, ans=0.1 2023-06-18 12:18:05,269 INFO [train.py:996] (3/4) Epoch 1, batch 26100, loss[loss=0.2797, simple_loss=0.3342, pruned_loss=0.1125, over 20122.00 frames. ], tot_loss[loss=0.3327, simple_loss=0.3855, pruned_loss=0.1399, over 4268372.45 frames. ], batch size: 702, lr: 2.36e-02, grad_scale: 32.0 2023-06-18 12:18:38,896 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=156600.0, ans=0.04949747468305833 2023-06-18 12:18:48,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=156660.0, ans=0.125 2023-06-18 12:19:41,403 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.250e+02 3.360e+02 3.963e+02 4.876e+02 8.517e+02, threshold=7.926e+02, percent-clipped=3.0 2023-06-18 12:19:44,312 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.61 vs. limit=22.5 2023-06-18 12:20:46,682 INFO [train.py:996] (3/4) Epoch 1, batch 26150, loss[loss=0.3139, simple_loss=0.3479, pruned_loss=0.14, over 19991.00 frames. ], tot_loss[loss=0.3296, simple_loss=0.3804, pruned_loss=0.1394, over 4266055.50 frames. ], batch size: 702, lr: 2.36e-02, grad_scale: 32.0 2023-06-18 12:22:22,567 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.76 vs. limit=15.0 2023-06-18 12:23:24,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=157140.0, ans=0.0 2023-06-18 12:23:34,851 INFO [train.py:996] (3/4) Epoch 1, batch 26200, loss[loss=0.3111, simple_loss=0.3673, pruned_loss=0.1274, over 21848.00 frames. ], tot_loss[loss=0.3266, simple_loss=0.3804, pruned_loss=0.1364, over 4271572.29 frames. ], batch size: 118, lr: 2.36e-02, grad_scale: 32.0 2023-06-18 12:25:05,670 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.78 vs. limit=15.0 2023-06-18 12:25:21,219 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 3.203e+02 3.929e+02 5.143e+02 8.013e+02, threshold=7.858e+02, percent-clipped=1.0 2023-06-18 12:26:24,268 INFO [train.py:996] (3/4) Epoch 1, batch 26250, loss[loss=0.3436, simple_loss=0.3945, pruned_loss=0.1463, over 21968.00 frames. ], tot_loss[loss=0.3277, simple_loss=0.3853, pruned_loss=0.1351, over 4276481.21 frames. ], batch size: 124, lr: 2.36e-02, grad_scale: 32.0 2023-06-18 12:26:37,294 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.56 vs. limit=22.5 2023-06-18 12:27:04,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=157560.0, ans=0.1 2023-06-18 12:27:20,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=157560.0, ans=0.125 2023-06-18 12:27:25,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=157620.0, ans=0.5 2023-06-18 12:27:26,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=157620.0, ans=0.125 2023-06-18 12:28:05,442 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-18 12:28:52,332 INFO [train.py:996] (3/4) Epoch 1, batch 26300, loss[loss=0.2893, simple_loss=0.3458, pruned_loss=0.1164, over 21739.00 frames. ], tot_loss[loss=0.3267, simple_loss=0.3813, pruned_loss=0.136, over 4282800.94 frames. ], batch size: 112, lr: 2.36e-02, grad_scale: 32.0 2023-06-18 12:29:34,618 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.05 vs. limit=15.0 2023-06-18 12:29:36,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=157860.0, ans=0.125 2023-06-18 12:29:48,790 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.20 vs. limit=15.0 2023-06-18 12:29:52,713 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=157860.0, ans=0.1 2023-06-18 12:30:46,900 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.240e+02 3.042e+02 3.851e+02 4.638e+02 8.463e+02, threshold=7.702e+02, percent-clipped=2.0 2023-06-18 12:31:30,221 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=158040.0, ans=0.0 2023-06-18 12:31:47,945 INFO [train.py:996] (3/4) Epoch 1, batch 26350, loss[loss=0.3585, simple_loss=0.413, pruned_loss=0.152, over 21822.00 frames. ], tot_loss[loss=0.3263, simple_loss=0.3793, pruned_loss=0.1367, over 4290633.57 frames. ], batch size: 118, lr: 2.35e-02, grad_scale: 32.0 2023-06-18 12:32:35,736 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 12:32:51,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=158220.0, ans=0.0 2023-06-18 12:32:53,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=158220.0, ans=0.0 2023-06-18 12:34:12,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=158340.0, ans=0.125 2023-06-18 12:34:19,819 INFO [train.py:996] (3/4) Epoch 1, batch 26400, loss[loss=0.3156, simple_loss=0.3379, pruned_loss=0.1467, over 21521.00 frames. ], tot_loss[loss=0.3234, simple_loss=0.3735, pruned_loss=0.1366, over 4274667.71 frames. ], batch size: 441, lr: 2.35e-02, grad_scale: 32.0 2023-06-18 12:34:46,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=158460.0, ans=0.1 2023-06-18 12:35:12,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=158460.0, ans=0.0 2023-06-18 12:35:17,635 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.49 vs. limit=22.5 2023-06-18 12:36:06,752 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 3.087e+02 3.915e+02 4.820e+02 6.818e+02, threshold=7.829e+02, percent-clipped=0.0 2023-06-18 12:36:22,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=158580.0, ans=0.125 2023-06-18 12:36:54,688 INFO [train.py:996] (3/4) Epoch 1, batch 26450, loss[loss=0.2675, simple_loss=0.3094, pruned_loss=0.1128, over 21833.00 frames. ], tot_loss[loss=0.3225, simple_loss=0.3723, pruned_loss=0.1364, over 4268205.30 frames. ], batch size: 107, lr: 2.35e-02, grad_scale: 32.0 2023-06-18 12:37:38,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=158760.0, ans=0.05 2023-06-18 12:37:40,707 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.04 vs. limit=15.0 2023-06-18 12:38:49,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=158880.0, ans=0.2 2023-06-18 12:38:50,330 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=158880.0, ans=0.1 2023-06-18 12:40:01,473 INFO [train.py:996] (3/4) Epoch 1, batch 26500, loss[loss=0.2585, simple_loss=0.3176, pruned_loss=0.09971, over 21782.00 frames. ], tot_loss[loss=0.3223, simple_loss=0.375, pruned_loss=0.1348, over 4268465.82 frames. ], batch size: 247, lr: 2.35e-02, grad_scale: 32.0 2023-06-18 12:40:33,583 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=159060.0, ans=15.0 2023-06-18 12:40:47,710 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=159060.0, ans=0.2 2023-06-18 12:41:57,607 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.151e+02 3.526e+02 4.481e+02 5.627e+02 1.219e+03, threshold=8.961e+02, percent-clipped=9.0 2023-06-18 12:42:24,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=159180.0, ans=0.5 2023-06-18 12:42:27,028 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=159240.0, ans=0.125 2023-06-18 12:42:42,862 INFO [train.py:996] (3/4) Epoch 1, batch 26550, loss[loss=0.2411, simple_loss=0.3032, pruned_loss=0.08944, over 21520.00 frames. ], tot_loss[loss=0.3138, simple_loss=0.3691, pruned_loss=0.1293, over 4266170.79 frames. ], batch size: 212, lr: 2.35e-02, grad_scale: 32.0 2023-06-18 12:42:49,082 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=159300.0, ans=0.125 2023-06-18 12:43:45,053 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=159360.0, ans=0.1 2023-06-18 12:44:05,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=159420.0, ans=0.0 2023-06-18 12:45:00,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=159480.0, ans=0.0 2023-06-18 12:45:44,645 INFO [train.py:996] (3/4) Epoch 1, batch 26600, loss[loss=0.3675, simple_loss=0.3943, pruned_loss=0.1703, over 21520.00 frames. ], tot_loss[loss=0.3089, simple_loss=0.3672, pruned_loss=0.1253, over 4266027.49 frames. ], batch size: 441, lr: 2.34e-02, grad_scale: 16.0 2023-06-18 12:46:03,577 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=159600.0, ans=0.125 2023-06-18 12:46:36,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=159660.0, ans=0.0 2023-06-18 12:47:17,851 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 12:47:42,435 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.014e+02 2.824e+02 3.393e+02 4.003e+02 5.713e+02, threshold=6.786e+02, percent-clipped=0.0 2023-06-18 12:48:17,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=159840.0, ans=0.125 2023-06-18 12:48:24,329 INFO [train.py:996] (3/4) Epoch 1, batch 26650, loss[loss=0.204, simple_loss=0.2805, pruned_loss=0.06377, over 21530.00 frames. ], tot_loss[loss=0.3041, simple_loss=0.36, pruned_loss=0.1241, over 4257189.01 frames. ], batch size: 230, lr: 2.34e-02, grad_scale: 16.0 2023-06-18 12:48:48,956 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.22 vs. limit=15.0 2023-06-18 12:48:59,385 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.99 vs. limit=15.0 2023-06-18 12:49:41,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=160020.0, ans=0.0 2023-06-18 12:50:47,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=160140.0, ans=0.04949747468305833 2023-06-18 12:50:50,999 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.54 vs. limit=15.0 2023-06-18 12:50:54,361 INFO [train.py:996] (3/4) Epoch 1, batch 26700, loss[loss=0.3448, simple_loss=0.3757, pruned_loss=0.1569, over 21721.00 frames. ], tot_loss[loss=0.2969, simple_loss=0.3525, pruned_loss=0.1206, over 4255228.66 frames. ], batch size: 473, lr: 2.34e-02, grad_scale: 16.0 2023-06-18 12:50:59,998 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.11 vs. limit=15.0 2023-06-18 12:51:07,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=160200.0, ans=0.125 2023-06-18 12:51:18,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=160260.0, ans=0.95 2023-06-18 12:51:45,004 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.82 vs. limit=15.0 2023-06-18 12:52:36,693 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.125e+02 2.796e+02 3.328e+02 4.126e+02 7.819e+02, threshold=6.657e+02, percent-clipped=1.0 2023-06-18 12:53:35,173 INFO [train.py:996] (3/4) Epoch 1, batch 26750, loss[loss=0.2737, simple_loss=0.3544, pruned_loss=0.09648, over 21739.00 frames. ], tot_loss[loss=0.2941, simple_loss=0.3511, pruned_loss=0.1185, over 4259878.02 frames. ], batch size: 298, lr: 2.34e-02, grad_scale: 16.0 2023-06-18 12:53:35,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=160500.0, ans=0.0 2023-06-18 12:53:56,618 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=160500.0, ans=0.07 2023-06-18 12:54:03,033 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.87 vs. limit=15.0 2023-06-18 12:55:08,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=160620.0, ans=0.125 2023-06-18 12:55:27,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=160680.0, ans=15.0 2023-06-18 12:55:28,442 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=160680.0, ans=0.2 2023-06-18 12:55:44,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=160740.0, ans=0.125 2023-06-18 12:55:54,717 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=160740.0, ans=0.125 2023-06-18 12:55:57,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=160740.0, ans=0.0 2023-06-18 12:56:10,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=160740.0, ans=0.1 2023-06-18 12:56:16,275 INFO [train.py:996] (3/4) Epoch 1, batch 26800, loss[loss=0.3436, simple_loss=0.3848, pruned_loss=0.1512, over 20676.00 frames. ], tot_loss[loss=0.3065, simple_loss=0.3613, pruned_loss=0.1258, over 4269971.86 frames. ], batch size: 607, lr: 2.34e-02, grad_scale: 32.0 2023-06-18 12:56:19,912 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.83 vs. limit=15.0 2023-06-18 12:57:33,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=160920.0, ans=0.125 2023-06-18 12:57:45,162 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.76 vs. limit=8.0 2023-06-18 12:57:56,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=160980.0, ans=0.0 2023-06-18 12:57:56,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=160980.0, ans=0.2 2023-06-18 12:58:01,658 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.459e+02 3.380e+02 3.968e+02 4.895e+02 8.004e+02, threshold=7.935e+02, percent-clipped=7.0 2023-06-18 12:58:45,435 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=161100.0, ans=0.0 2023-06-18 12:58:46,431 INFO [train.py:996] (3/4) Epoch 1, batch 26850, loss[loss=0.316, simple_loss=0.3442, pruned_loss=0.1439, over 21642.00 frames. ], tot_loss[loss=0.3122, simple_loss=0.3645, pruned_loss=0.1299, over 4273929.91 frames. ], batch size: 415, lr: 2.33e-02, grad_scale: 32.0 2023-06-18 12:59:19,581 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=161100.0, ans=0.0 2023-06-18 12:59:22,916 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=161160.0, ans=0.1 2023-06-18 12:59:58,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=161220.0, ans=0.125 2023-06-18 13:00:54,392 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=161340.0, ans=0.125 2023-06-18 13:01:10,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=161340.0, ans=15.0 2023-06-18 13:01:17,654 INFO [train.py:996] (3/4) Epoch 1, batch 26900, loss[loss=0.2818, simple_loss=0.3191, pruned_loss=0.1222, over 21699.00 frames. ], tot_loss[loss=0.3052, simple_loss=0.355, pruned_loss=0.1277, over 4278488.38 frames. ], batch size: 417, lr: 2.33e-02, grad_scale: 32.0 2023-06-18 13:01:41,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=161460.0, ans=0.2 2023-06-18 13:02:57,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=161580.0, ans=0.2 2023-06-18 13:02:58,286 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.176e+02 2.755e+02 3.329e+02 3.765e+02 8.557e+02, threshold=6.658e+02, percent-clipped=2.0 2023-06-18 13:03:01,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=161580.0, ans=0.0 2023-06-18 13:03:37,563 INFO [train.py:996] (3/4) Epoch 1, batch 26950, loss[loss=0.3838, simple_loss=0.4384, pruned_loss=0.1646, over 21614.00 frames. ], tot_loss[loss=0.3058, simple_loss=0.3565, pruned_loss=0.1276, over 4281050.60 frames. ], batch size: 441, lr: 2.33e-02, grad_scale: 16.0 2023-06-18 13:04:08,929 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=161700.0, ans=0.125 2023-06-18 13:05:43,710 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=161940.0, ans=0.05 2023-06-18 13:05:57,432 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.68 vs. limit=15.0 2023-06-18 13:06:17,432 INFO [train.py:996] (3/4) Epoch 1, batch 27000, loss[loss=0.2917, simple_loss=0.364, pruned_loss=0.1097, over 21661.00 frames. ], tot_loss[loss=0.3008, simple_loss=0.3551, pruned_loss=0.1233, over 4276944.45 frames. ], batch size: 414, lr: 2.33e-02, grad_scale: 16.0 2023-06-18 13:06:17,433 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-18 13:06:51,857 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.1.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([4.8857, 4.9933, 2.0857, 4.3473], device='cuda:3') 2023-06-18 13:06:56,172 INFO [train.py:1028] (3/4) Epoch 1, validation: loss=0.2848, simple_loss=0.3741, pruned_loss=0.09774, over 1796401.00 frames. 2023-06-18 13:06:56,174 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-18 13:07:02,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=162000.0, ans=0.0 2023-06-18 13:08:05,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=162120.0, ans=0.0 2023-06-18 13:08:26,180 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.200e+02 2.842e+02 3.323e+02 4.144e+02 6.549e+02, threshold=6.646e+02, percent-clipped=0.0 2023-06-18 13:09:12,789 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=15.0 2023-06-18 13:09:19,643 INFO [train.py:996] (3/4) Epoch 1, batch 27050, loss[loss=0.3323, simple_loss=0.3924, pruned_loss=0.1361, over 21718.00 frames. ], tot_loss[loss=0.2966, simple_loss=0.3556, pruned_loss=0.1188, over 4274693.13 frames. ], batch size: 389, lr: 2.33e-02, grad_scale: 16.0 2023-06-18 13:09:24,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=162300.0, ans=0.07 2023-06-18 13:09:28,958 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=162300.0, ans=0.0 2023-06-18 13:10:45,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=162420.0, ans=0.125 2023-06-18 13:11:04,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=162480.0, ans=0.125 2023-06-18 13:11:17,205 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=162480.0, ans=0.125 2023-06-18 13:11:34,183 INFO [train.py:996] (3/4) Epoch 1, batch 27100, loss[loss=0.2738, simple_loss=0.3505, pruned_loss=0.09856, over 21490.00 frames. ], tot_loss[loss=0.3014, simple_loss=0.3587, pruned_loss=0.1221, over 4283724.69 frames. ], batch size: 211, lr: 2.32e-02, grad_scale: 16.0 2023-06-18 13:13:18,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=162720.0, ans=0.125 2023-06-18 13:13:43,618 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-06-18 13:13:45,528 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.096e+02 2.912e+02 3.397e+02 4.397e+02 9.217e+02, threshold=6.794e+02, percent-clipped=5.0 2023-06-18 13:14:25,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=162840.0, ans=0.07 2023-06-18 13:14:30,548 INFO [train.py:996] (3/4) Epoch 1, batch 27150, loss[loss=0.3243, simple_loss=0.3937, pruned_loss=0.1275, over 21668.00 frames. ], tot_loss[loss=0.3114, simple_loss=0.3703, pruned_loss=0.1263, over 4285245.12 frames. ], batch size: 263, lr: 2.32e-02, grad_scale: 16.0 2023-06-18 13:15:28,929 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=162960.0, ans=10.0 2023-06-18 13:15:31,490 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=162960.0, ans=0.1 2023-06-18 13:15:51,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=163020.0, ans=0.125 2023-06-18 13:15:51,881 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.64 vs. limit=12.0 2023-06-18 13:16:40,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=163080.0, ans=0.125 2023-06-18 13:17:31,636 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.06 vs. limit=12.0 2023-06-18 13:17:33,479 INFO [train.py:996] (3/4) Epoch 1, batch 27200, loss[loss=0.3699, simple_loss=0.4072, pruned_loss=0.1663, over 21414.00 frames. ], tot_loss[loss=0.3196, simple_loss=0.3794, pruned_loss=0.1299, over 4279777.27 frames. ], batch size: 131, lr: 2.32e-02, grad_scale: 32.0 2023-06-18 13:17:54,494 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.02 vs. limit=15.0 2023-06-18 13:18:14,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=163260.0, ans=0.0 2023-06-18 13:19:26,856 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=163380.0, ans=0.0 2023-06-18 13:19:30,610 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.925e+02 3.450e+02 3.753e+02 4.440e+02 7.540e+02, threshold=7.506e+02, percent-clipped=2.0 2023-06-18 13:20:13,109 INFO [train.py:996] (3/4) Epoch 1, batch 27250, loss[loss=0.3356, simple_loss=0.3627, pruned_loss=0.1543, over 19986.00 frames. ], tot_loss[loss=0.3272, simple_loss=0.3834, pruned_loss=0.1355, over 4282308.95 frames. ], batch size: 703, lr: 2.32e-02, grad_scale: 32.0 2023-06-18 13:20:44,200 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.30 vs. limit=12.0 2023-06-18 13:21:04,316 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=163560.0, ans=0.07 2023-06-18 13:21:04,329 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=163560.0, ans=0.125 2023-06-18 13:21:16,791 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.15 vs. limit=10.0 2023-06-18 13:22:53,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=163740.0, ans=0.0 2023-06-18 13:23:10,377 INFO [train.py:996] (3/4) Epoch 1, batch 27300, loss[loss=0.3112, simple_loss=0.3818, pruned_loss=0.1203, over 21802.00 frames. ], tot_loss[loss=0.3302, simple_loss=0.3863, pruned_loss=0.137, over 4287273.17 frames. ], batch size: 282, lr: 2.32e-02, grad_scale: 32.0 2023-06-18 13:25:00,155 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.27 vs. limit=10.0 2023-06-18 13:25:12,793 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.449e+02 3.700e+02 4.629e+02 5.941e+02 1.159e+03, threshold=9.258e+02, percent-clipped=7.0 2023-06-18 13:25:43,394 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.88 vs. limit=22.5 2023-06-18 13:26:07,931 INFO [train.py:996] (3/4) Epoch 1, batch 27350, loss[loss=0.3006, simple_loss=0.3781, pruned_loss=0.1115, over 21581.00 frames. ], tot_loss[loss=0.3337, simple_loss=0.3894, pruned_loss=0.139, over 4283192.50 frames. ], batch size: 230, lr: 2.31e-02, grad_scale: 32.0 2023-06-18 13:26:16,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=164100.0, ans=0.2 2023-06-18 13:26:40,435 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=164160.0, ans=0.125 2023-06-18 13:27:08,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=164220.0, ans=0.0 2023-06-18 13:27:11,565 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=164220.0, ans=0.125 2023-06-18 13:27:12,368 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.04 vs. limit=8.0 2023-06-18 13:28:34,422 INFO [train.py:996] (3/4) Epoch 1, batch 27400, loss[loss=0.2932, simple_loss=0.3362, pruned_loss=0.1251, over 21719.00 frames. ], tot_loss[loss=0.329, simple_loss=0.3831, pruned_loss=0.1375, over 4289760.19 frames. ], batch size: 230, lr: 2.31e-02, grad_scale: 32.0 2023-06-18 13:29:18,181 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=164460.0, ans=0.125 2023-06-18 13:29:18,250 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=164460.0, ans=0.125 2023-06-18 13:30:13,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=164580.0, ans=0.0 2023-06-18 13:30:24,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=164580.0, ans=0.2 2023-06-18 13:30:34,864 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.370e+02 2.849e+02 3.387e+02 3.962e+02 7.059e+02, threshold=6.774e+02, percent-clipped=0.0 2023-06-18 13:30:53,894 INFO [train.py:996] (3/4) Epoch 1, batch 27450, loss[loss=0.2782, simple_loss=0.3334, pruned_loss=0.1115, over 21768.00 frames. ], tot_loss[loss=0.3221, simple_loss=0.3756, pruned_loss=0.1343, over 4291301.79 frames. ], batch size: 124, lr: 2.31e-02, grad_scale: 32.0 2023-06-18 13:30:55,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=164700.0, ans=0.125 2023-06-18 13:30:59,004 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.66 vs. limit=15.0 2023-06-18 13:30:59,042 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.32 vs. limit=15.0 2023-06-18 13:31:25,023 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=164700.0, ans=0.125 2023-06-18 13:32:25,300 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=164820.0, ans=0.2 2023-06-18 13:33:27,784 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.21 vs. limit=6.0 2023-06-18 13:33:34,190 INFO [train.py:996] (3/4) Epoch 1, batch 27500, loss[loss=0.2881, simple_loss=0.3419, pruned_loss=0.1172, over 21315.00 frames. ], tot_loss[loss=0.3202, simple_loss=0.3728, pruned_loss=0.1338, over 4292752.84 frames. ], batch size: 143, lr: 2.31e-02, grad_scale: 32.0 2023-06-18 13:34:25,815 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=165060.0, ans=0.1 2023-06-18 13:34:33,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=165120.0, ans=0.0 2023-06-18 13:35:25,333 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.039e+02 2.982e+02 3.278e+02 3.839e+02 6.377e+02, threshold=6.555e+02, percent-clipped=0.0 2023-06-18 13:36:05,222 INFO [train.py:996] (3/4) Epoch 1, batch 27550, loss[loss=0.2985, simple_loss=0.3446, pruned_loss=0.1262, over 21381.00 frames. ], tot_loss[loss=0.3154, simple_loss=0.3692, pruned_loss=0.1308, over 4290784.33 frames. ], batch size: 131, lr: 2.31e-02, grad_scale: 32.0 2023-06-18 13:37:11,891 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=165360.0, ans=0.125 2023-06-18 13:37:14,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=165420.0, ans=0.2 2023-06-18 13:37:48,207 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.11 vs. limit=15.0 2023-06-18 13:37:52,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=165480.0, ans=0.2 2023-06-18 13:38:15,286 INFO [train.py:996] (3/4) Epoch 1, batch 27600, loss[loss=0.2631, simple_loss=0.3067, pruned_loss=0.1098, over 21543.00 frames. ], tot_loss[loss=0.3107, simple_loss=0.362, pruned_loss=0.1297, over 4283639.77 frames. ], batch size: 247, lr: 2.30e-02, grad_scale: 32.0 2023-06-18 13:38:15,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=165600.0, ans=0.2 2023-06-18 13:38:35,277 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.20 vs. limit=15.0 2023-06-18 13:39:13,181 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.24 vs. limit=10.0 2023-06-18 13:40:02,821 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.227e+02 3.185e+02 3.559e+02 4.117e+02 6.813e+02, threshold=7.118e+02, percent-clipped=1.0 2023-06-18 13:40:24,909 INFO [train.py:996] (3/4) Epoch 1, batch 27650, loss[loss=0.3037, simple_loss=0.3651, pruned_loss=0.1212, over 21210.00 frames. ], tot_loss[loss=0.3054, simple_loss=0.3548, pruned_loss=0.128, over 4277766.22 frames. ], batch size: 159, lr: 2.30e-02, grad_scale: 32.0 2023-06-18 13:41:29,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=165960.0, ans=0.125 2023-06-18 13:41:37,449 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.89 vs. limit=22.5 2023-06-18 13:43:02,027 INFO [train.py:996] (3/4) Epoch 1, batch 27700, loss[loss=0.3449, simple_loss=0.4036, pruned_loss=0.1431, over 21712.00 frames. ], tot_loss[loss=0.3008, simple_loss=0.3529, pruned_loss=0.1244, over 4281189.58 frames. ], batch size: 351, lr: 2.30e-02, grad_scale: 32.0 2023-06-18 13:43:38,558 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=166200.0, ans=0.0 2023-06-18 13:43:57,894 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=166320.0, ans=0.1 2023-06-18 13:44:22,041 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=166320.0, ans=0.2 2023-06-18 13:44:31,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=166320.0, ans=0.0 2023-06-18 13:45:09,540 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.136e+02 3.145e+02 3.672e+02 4.225e+02 7.825e+02, threshold=7.344e+02, percent-clipped=2.0 2023-06-18 13:45:36,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=166440.0, ans=0.05 2023-06-18 13:45:42,143 INFO [train.py:996] (3/4) Epoch 1, batch 27750, loss[loss=0.3405, simple_loss=0.3877, pruned_loss=0.1467, over 21745.00 frames. ], tot_loss[loss=0.302, simple_loss=0.3567, pruned_loss=0.1236, over 4281006.33 frames. ], batch size: 441, lr: 2.30e-02, grad_scale: 32.0 2023-06-18 13:45:48,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=166500.0, ans=0.0 2023-06-18 13:47:21,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=166680.0, ans=0.125 2023-06-18 13:48:23,260 INFO [train.py:996] (3/4) Epoch 1, batch 27800, loss[loss=0.3532, simple_loss=0.381, pruned_loss=0.1627, over 21634.00 frames. ], tot_loss[loss=0.3015, simple_loss=0.3549, pruned_loss=0.1241, over 4274103.57 frames. ], batch size: 471, lr: 2.30e-02, grad_scale: 32.0 2023-06-18 13:48:46,985 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=166800.0, ans=15.0 2023-06-18 13:49:00,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=166860.0, ans=0.0 2023-06-18 13:49:01,163 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.97 vs. limit=15.0 2023-06-18 13:49:03,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=166860.0, ans=0.0 2023-06-18 13:49:06,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=166860.0, ans=0.04949747468305833 2023-06-18 13:49:25,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=166920.0, ans=0.125 2023-06-18 13:49:37,279 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=166920.0, ans=0.0 2023-06-18 13:49:52,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=166920.0, ans=0.1 2023-06-18 13:50:11,377 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.644e+02 3.084e+02 3.621e+02 4.481e+02 7.199e+02, threshold=7.242e+02, percent-clipped=0.0 2023-06-18 13:50:21,862 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.29 vs. limit=15.0 2023-06-18 13:50:42,354 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.47 vs. limit=15.0 2023-06-18 13:50:45,598 INFO [train.py:996] (3/4) Epoch 1, batch 27850, loss[loss=0.2811, simple_loss=0.3491, pruned_loss=0.1065, over 21499.00 frames. ], tot_loss[loss=0.3031, simple_loss=0.355, pruned_loss=0.1256, over 4281127.20 frames. ], batch size: 131, lr: 2.29e-02, grad_scale: 32.0 2023-06-18 13:51:11,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=167100.0, ans=0.0 2023-06-18 13:52:13,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=167220.0, ans=0.125 2023-06-18 13:53:15,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=167340.0, ans=0.2 2023-06-18 13:53:43,402 INFO [train.py:996] (3/4) Epoch 1, batch 27900, loss[loss=0.3033, simple_loss=0.3814, pruned_loss=0.1126, over 21862.00 frames. ], tot_loss[loss=0.3091, simple_loss=0.3635, pruned_loss=0.1274, over 4279723.62 frames. ], batch size: 372, lr: 2.29e-02, grad_scale: 32.0 2023-06-18 13:54:31,686 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=167460.0, ans=0.0 2023-06-18 13:55:22,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=167520.0, ans=0.125 2023-06-18 13:55:58,691 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.130e+02 3.095e+02 3.904e+02 5.239e+02 9.245e+02, threshold=7.808e+02, percent-clipped=6.0 2023-06-18 13:56:15,447 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 13:56:16,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=167700.0, ans=0.125 2023-06-18 13:56:23,310 INFO [train.py:996] (3/4) Epoch 1, batch 27950, loss[loss=0.2814, simple_loss=0.3584, pruned_loss=0.1022, over 21646.00 frames. ], tot_loss[loss=0.3031, simple_loss=0.3622, pruned_loss=0.122, over 4279047.10 frames. ], batch size: 263, lr: 2.29e-02, grad_scale: 32.0 2023-06-18 13:59:22,441 INFO [train.py:996] (3/4) Epoch 1, batch 28000, loss[loss=0.2166, simple_loss=0.2906, pruned_loss=0.07129, over 21853.00 frames. ], tot_loss[loss=0.2985, simple_loss=0.36, pruned_loss=0.1185, over 4283746.70 frames. ], batch size: 98, lr: 2.29e-02, grad_scale: 32.0 2023-06-18 14:01:05,451 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=168180.0, ans=0.125 2023-06-18 14:01:10,893 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.731e+02 2.776e+02 3.408e+02 4.603e+02 6.959e+02, threshold=6.817e+02, percent-clipped=0.0 2023-06-18 14:01:23,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=168240.0, ans=0.125 2023-06-18 14:01:24,865 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=168240.0, ans=0.0 2023-06-18 14:01:42,837 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.11 vs. limit=15.0 2023-06-18 14:01:43,199 INFO [train.py:996] (3/4) Epoch 1, batch 28050, loss[loss=0.3241, simple_loss=0.3836, pruned_loss=0.1323, over 21651.00 frames. ], tot_loss[loss=0.3012, simple_loss=0.3593, pruned_loss=0.1216, over 4287065.59 frames. ], batch size: 441, lr: 2.29e-02, grad_scale: 32.0 2023-06-18 14:02:54,023 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=168360.0, ans=0.0 2023-06-18 14:04:04,653 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.56 vs. limit=15.0 2023-06-18 14:04:28,784 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=168540.0, ans=0.125 2023-06-18 14:04:31,082 INFO [train.py:996] (3/4) Epoch 1, batch 28100, loss[loss=0.2828, simple_loss=0.322, pruned_loss=0.1218, over 21528.00 frames. ], tot_loss[loss=0.3004, simple_loss=0.357, pruned_loss=0.1219, over 4281034.03 frames. ], batch size: 263, lr: 2.29e-02, grad_scale: 32.0 2023-06-18 14:05:02,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=168660.0, ans=0.125 2023-06-18 14:05:08,941 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.71 vs. limit=15.0 2023-06-18 14:05:11,167 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=168660.0, ans=0.125 2023-06-18 14:05:44,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=168720.0, ans=0.125 2023-06-18 14:06:12,275 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.273e+02 3.173e+02 3.819e+02 4.992e+02 1.067e+03, threshold=7.638e+02, percent-clipped=9.0 2023-06-18 14:06:55,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=168840.0, ans=0.05 2023-06-18 14:06:59,260 INFO [train.py:996] (3/4) Epoch 1, batch 28150, loss[loss=0.2498, simple_loss=0.28, pruned_loss=0.1098, over 20696.00 frames. ], tot_loss[loss=0.2971, simple_loss=0.3504, pruned_loss=0.1219, over 4274300.08 frames. ], batch size: 608, lr: 2.28e-02, grad_scale: 32.0 2023-06-18 14:07:12,172 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=168900.0, ans=0.125 2023-06-18 14:09:29,052 INFO [train.py:996] (3/4) Epoch 1, batch 28200, loss[loss=0.2673, simple_loss=0.3084, pruned_loss=0.1131, over 21550.00 frames. ], tot_loss[loss=0.2999, simple_loss=0.3507, pruned_loss=0.1245, over 4270474.07 frames. ], batch size: 263, lr: 2.28e-02, grad_scale: 32.0 2023-06-18 14:10:55,353 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=169320.0, ans=0.2 2023-06-18 14:11:38,005 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.406e+02 3.366e+02 3.817e+02 4.750e+02 7.946e+02, threshold=7.633e+02, percent-clipped=3.0 2023-06-18 14:11:41,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=169440.0, ans=0.125 2023-06-18 14:11:47,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=169440.0, ans=0.0 2023-06-18 14:12:06,078 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=169500.0, ans=0.0 2023-06-18 14:12:07,344 INFO [train.py:996] (3/4) Epoch 1, batch 28250, loss[loss=0.3111, simple_loss=0.3522, pruned_loss=0.135, over 21641.00 frames. ], tot_loss[loss=0.308, simple_loss=0.3563, pruned_loss=0.1299, over 4271826.25 frames. ], batch size: 298, lr: 2.28e-02, grad_scale: 32.0 2023-06-18 14:14:23,362 INFO [train.py:996] (3/4) Epoch 1, batch 28300, loss[loss=0.21, simple_loss=0.2836, pruned_loss=0.06825, over 21364.00 frames. ], tot_loss[loss=0.3019, simple_loss=0.3523, pruned_loss=0.1258, over 4272555.75 frames. ], batch size: 194, lr: 2.28e-02, grad_scale: 32.0 2023-06-18 14:14:47,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=169800.0, ans=0.1 2023-06-18 14:14:54,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=169800.0, ans=10.0 2023-06-18 14:15:27,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=169860.0, ans=0.125 2023-06-18 14:16:42,456 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.730e+02 2.780e+02 3.405e+02 4.225e+02 7.738e+02, threshold=6.811e+02, percent-clipped=1.0 2023-06-18 14:16:52,580 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.58 vs. limit=15.0 2023-06-18 14:17:20,903 INFO [train.py:996] (3/4) Epoch 1, batch 28350, loss[loss=0.2699, simple_loss=0.3204, pruned_loss=0.1097, over 21623.00 frames. ], tot_loss[loss=0.2925, simple_loss=0.3482, pruned_loss=0.1184, over 4270407.01 frames. ], batch size: 282, lr: 2.28e-02, grad_scale: 32.0 2023-06-18 14:17:25,023 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.98 vs. limit=10.0 2023-06-18 14:17:43,376 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=170160.0, ans=0.0 2023-06-18 14:18:05,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=170220.0, ans=0.2 2023-06-18 14:19:08,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=170280.0, ans=0.0 2023-06-18 14:19:09,188 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.53 vs. limit=22.5 2023-06-18 14:19:32,446 INFO [train.py:996] (3/4) Epoch 1, batch 28400, loss[loss=0.2738, simple_loss=0.3209, pruned_loss=0.1133, over 21537.00 frames. ], tot_loss[loss=0.2902, simple_loss=0.3439, pruned_loss=0.1183, over 4267968.70 frames. ], batch size: 263, lr: 2.27e-02, grad_scale: 32.0 2023-06-18 14:19:34,260 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=170400.0, ans=0.0 2023-06-18 14:19:35,720 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=170400.0, ans=0.125 2023-06-18 14:20:00,121 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=170400.0, ans=0.0 2023-06-18 14:21:34,402 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 3.372e+02 4.044e+02 4.842e+02 8.605e+02, threshold=8.089e+02, percent-clipped=5.0 2023-06-18 14:21:36,813 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.55 vs. limit=15.0 2023-06-18 14:21:59,324 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=170640.0, ans=0.5 2023-06-18 14:22:19,297 INFO [train.py:996] (3/4) Epoch 1, batch 28450, loss[loss=0.2651, simple_loss=0.2922, pruned_loss=0.119, over 20010.00 frames. ], tot_loss[loss=0.2989, simple_loss=0.3506, pruned_loss=0.1236, over 4266057.96 frames. ], batch size: 703, lr: 2.27e-02, grad_scale: 32.0 2023-06-18 14:23:02,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=170760.0, ans=0.2 2023-06-18 14:23:55,025 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 14:24:18,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=170880.0, ans=0.5 2023-06-18 14:24:25,198 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=170940.0, ans=0.0 2023-06-18 14:24:28,107 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=170940.0, ans=0.0 2023-06-18 14:24:45,896 INFO [train.py:996] (3/4) Epoch 1, batch 28500, loss[loss=0.3294, simple_loss=0.3761, pruned_loss=0.1413, over 21757.00 frames. ], tot_loss[loss=0.3032, simple_loss=0.3534, pruned_loss=0.1265, over 4274657.16 frames. ], batch size: 298, lr: 2.27e-02, grad_scale: 32.0 2023-06-18 14:25:23,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=171060.0, ans=0.0 2023-06-18 14:25:45,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=171060.0, ans=0.125 2023-06-18 14:26:36,674 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=171180.0, ans=0.0 2023-06-18 14:26:36,707 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=171180.0, ans=0.1 2023-06-18 14:26:37,756 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.263e+02 3.096e+02 3.441e+02 4.526e+02 8.440e+02, threshold=6.881e+02, percent-clipped=1.0 2023-06-18 14:27:28,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=171300.0, ans=0.125 2023-06-18 14:27:38,029 INFO [train.py:996] (3/4) Epoch 1, batch 28550, loss[loss=0.4297, simple_loss=0.4725, pruned_loss=0.1934, over 21727.00 frames. ], tot_loss[loss=0.3133, simple_loss=0.3641, pruned_loss=0.1312, over 4276851.21 frames. ], batch size: 441, lr: 2.27e-02, grad_scale: 32.0 2023-06-18 14:28:03,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=171360.0, ans=0.0 2023-06-18 14:29:26,424 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.55 vs. limit=15.0 2023-06-18 14:29:59,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=171540.0, ans=0.2 2023-06-18 14:30:04,833 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 14:30:17,192 INFO [train.py:996] (3/4) Epoch 1, batch 28600, loss[loss=0.3961, simple_loss=0.4301, pruned_loss=0.181, over 21780.00 frames. ], tot_loss[loss=0.3184, simple_loss=0.3712, pruned_loss=0.1329, over 4275677.02 frames. ], batch size: 441, lr: 2.27e-02, grad_scale: 32.0 2023-06-18 14:30:42,758 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=171600.0, ans=0.125 2023-06-18 14:30:50,529 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.75 vs. limit=15.0 2023-06-18 14:31:58,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=171780.0, ans=0.0 2023-06-18 14:32:15,155 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.287e+02 3.104e+02 3.641e+02 4.688e+02 7.003e+02, threshold=7.282e+02, percent-clipped=1.0 2023-06-18 14:32:24,322 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=171840.0, ans=10.0 2023-06-18 14:32:57,303 INFO [train.py:996] (3/4) Epoch 1, batch 28650, loss[loss=0.2789, simple_loss=0.3211, pruned_loss=0.1183, over 21704.00 frames. ], tot_loss[loss=0.3145, simple_loss=0.3651, pruned_loss=0.132, over 4274072.12 frames. ], batch size: 334, lr: 2.26e-02, grad_scale: 32.0 2023-06-18 14:33:13,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=171900.0, ans=0.0 2023-06-18 14:33:21,387 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.10 vs. limit=15.0 2023-06-18 14:33:27,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=171960.0, ans=0.125 2023-06-18 14:35:35,840 INFO [train.py:996] (3/4) Epoch 1, batch 28700, loss[loss=0.2983, simple_loss=0.3293, pruned_loss=0.1337, over 20207.00 frames. ], tot_loss[loss=0.3162, simple_loss=0.3651, pruned_loss=0.1336, over 4265745.81 frames. ], batch size: 707, lr: 2.26e-02, grad_scale: 32.0 2023-06-18 14:35:46,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=172200.0, ans=0.125 2023-06-18 14:35:47,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=172200.0, ans=0.125 2023-06-18 14:36:01,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=172260.0, ans=0.0 2023-06-18 14:36:45,945 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 14:37:34,077 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.430e+02 3.041e+02 3.554e+02 4.437e+02 9.213e+02, threshold=7.107e+02, percent-clipped=2.0 2023-06-18 14:38:08,592 INFO [train.py:996] (3/4) Epoch 1, batch 28750, loss[loss=0.3823, simple_loss=0.4238, pruned_loss=0.1704, over 21545.00 frames. ], tot_loss[loss=0.3159, simple_loss=0.3635, pruned_loss=0.1342, over 4273057.22 frames. ], batch size: 507, lr: 2.26e-02, grad_scale: 32.0 2023-06-18 14:39:27,569 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=172620.0, ans=0.125 2023-06-18 14:39:57,085 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=172680.0, ans=0.2 2023-06-18 14:40:46,664 INFO [train.py:996] (3/4) Epoch 1, batch 28800, loss[loss=0.4122, simple_loss=0.4496, pruned_loss=0.1874, over 21834.00 frames. ], tot_loss[loss=0.3184, simple_loss=0.3681, pruned_loss=0.1344, over 4269662.18 frames. ], batch size: 124, lr: 2.26e-02, grad_scale: 32.0 2023-06-18 14:40:47,836 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.41 vs. limit=15.0 2023-06-18 14:41:14,653 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=172860.0, ans=0.0 2023-06-18 14:42:41,101 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.872e+02 3.537e+02 4.377e+02 5.475e+02 1.000e+03, threshold=8.754e+02, percent-clipped=7.0 2023-06-18 14:43:28,267 INFO [train.py:996] (3/4) Epoch 1, batch 28850, loss[loss=0.2912, simple_loss=0.341, pruned_loss=0.1207, over 21809.00 frames. ], tot_loss[loss=0.3213, simple_loss=0.3699, pruned_loss=0.1364, over 4279636.09 frames. ], batch size: 247, lr: 2.26e-02, grad_scale: 32.0 2023-06-18 14:44:43,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=173220.0, ans=0.125 2023-06-18 14:46:14,987 INFO [train.py:996] (3/4) Epoch 1, batch 28900, loss[loss=0.3427, simple_loss=0.3955, pruned_loss=0.1449, over 21746.00 frames. ], tot_loss[loss=0.3254, simple_loss=0.3732, pruned_loss=0.1388, over 4283007.45 frames. ], batch size: 332, lr: 2.26e-02, grad_scale: 32.0 2023-06-18 14:47:06,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=173460.0, ans=0.2 2023-06-18 14:48:08,326 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.72 vs. limit=6.0 2023-06-18 14:48:16,019 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.258e+02 3.311e+02 3.872e+02 4.627e+02 1.015e+03, threshold=7.745e+02, percent-clipped=2.0 2023-06-18 14:48:23,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=173580.0, ans=0.0 2023-06-18 14:48:59,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=173640.0, ans=0.0 2023-06-18 14:49:06,892 INFO [train.py:996] (3/4) Epoch 1, batch 28950, loss[loss=0.451, simple_loss=0.5187, pruned_loss=0.1917, over 19705.00 frames. ], tot_loss[loss=0.3251, simple_loss=0.3742, pruned_loss=0.138, over 4278031.45 frames. ], batch size: 702, lr: 2.25e-02, grad_scale: 64.0 2023-06-18 14:49:07,237 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff2.min_abs, batch_count=173700.0, ans=0.1 2023-06-18 14:49:40,150 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=173760.0, ans=0.0 2023-06-18 14:51:09,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=173880.0, ans=0.125 2023-06-18 14:51:10,873 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=173880.0, ans=0.125 2023-06-18 14:51:52,107 INFO [train.py:996] (3/4) Epoch 1, batch 29000, loss[loss=0.3441, simple_loss=0.38, pruned_loss=0.1541, over 19889.00 frames. ], tot_loss[loss=0.327, simple_loss=0.3795, pruned_loss=0.1372, over 4275963.26 frames. ], batch size: 702, lr: 2.25e-02, grad_scale: 64.0 2023-06-18 14:51:55,446 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=174000.0, ans=0.09899494936611666 2023-06-18 14:51:55,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=174000.0, ans=0.1 2023-06-18 14:52:31,014 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=174000.0, ans=0.0 2023-06-18 14:52:43,612 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=174060.0, ans=0.015 2023-06-18 14:53:52,719 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.000e+02 3.056e+02 3.518e+02 4.354e+02 7.767e+02, threshold=7.035e+02, percent-clipped=1.0 2023-06-18 14:54:47,782 INFO [train.py:996] (3/4) Epoch 1, batch 29050, loss[loss=0.3161, simple_loss=0.3574, pruned_loss=0.1374, over 21862.00 frames. ], tot_loss[loss=0.3258, simple_loss=0.3766, pruned_loss=0.1375, over 4278324.75 frames. ], batch size: 298, lr: 2.25e-02, grad_scale: 32.0 2023-06-18 14:54:50,151 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.91 vs. limit=15.0 2023-06-18 14:55:01,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=174300.0, ans=0.125 2023-06-18 14:55:21,580 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.21 vs. limit=22.5 2023-06-18 14:55:23,143 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-18 14:55:42,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=174360.0, ans=0.125 2023-06-18 14:55:45,577 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=174420.0, ans=0.125 2023-06-18 14:56:22,642 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.28 vs. limit=15.0 2023-06-18 14:56:50,917 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=174540.0, ans=0.125 2023-06-18 14:57:11,201 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=174600.0, ans=0.125 2023-06-18 14:57:12,111 INFO [train.py:996] (3/4) Epoch 1, batch 29100, loss[loss=0.2852, simple_loss=0.3248, pruned_loss=0.1228, over 21751.00 frames. ], tot_loss[loss=0.3163, simple_loss=0.3656, pruned_loss=0.1335, over 4279850.52 frames. ], batch size: 351, lr: 2.25e-02, grad_scale: 32.0 2023-06-18 14:57:49,603 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=174660.0, ans=0.1 2023-06-18 14:59:05,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=174780.0, ans=0.1 2023-06-18 14:59:07,693 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.030e+02 3.175e+02 3.774e+02 4.756e+02 7.058e+02, threshold=7.548e+02, percent-clipped=1.0 2023-06-18 14:59:31,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=174840.0, ans=0.0 2023-06-18 14:59:38,222 INFO [train.py:996] (3/4) Epoch 1, batch 29150, loss[loss=0.2815, simple_loss=0.332, pruned_loss=0.1155, over 21954.00 frames. ], tot_loss[loss=0.3108, simple_loss=0.3619, pruned_loss=0.1299, over 4275290.53 frames. ], batch size: 103, lr: 2.25e-02, grad_scale: 32.0 2023-06-18 15:00:44,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=175020.0, ans=0.2 2023-06-18 15:02:04,912 INFO [train.py:996] (3/4) Epoch 1, batch 29200, loss[loss=0.2783, simple_loss=0.3301, pruned_loss=0.1132, over 21767.00 frames. ], tot_loss[loss=0.3069, simple_loss=0.3565, pruned_loss=0.1286, over 4267017.04 frames. ], batch size: 371, lr: 2.24e-02, grad_scale: 32.0 2023-06-18 15:02:48,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=175260.0, ans=0.1 2023-06-18 15:03:11,124 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=175260.0, ans=0.125 2023-06-18 15:04:00,403 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 2.999e+02 3.311e+02 3.772e+02 7.580e+02, threshold=6.622e+02, percent-clipped=1.0 2023-06-18 15:04:23,883 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=175440.0, ans=0.04949747468305833 2023-06-18 15:04:39,751 INFO [train.py:996] (3/4) Epoch 1, batch 29250, loss[loss=0.2383, simple_loss=0.3018, pruned_loss=0.08742, over 21700.00 frames. ], tot_loss[loss=0.3014, simple_loss=0.3537, pruned_loss=0.1246, over 4255239.04 frames. ], batch size: 112, lr: 2.24e-02, grad_scale: 32.0 2023-06-18 15:05:52,259 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.73 vs. limit=15.0 2023-06-18 15:06:30,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=175680.0, ans=0.125 2023-06-18 15:07:21,913 INFO [train.py:996] (3/4) Epoch 1, batch 29300, loss[loss=0.314, simple_loss=0.3618, pruned_loss=0.1331, over 21584.00 frames. ], tot_loss[loss=0.3009, simple_loss=0.355, pruned_loss=0.1234, over 4258186.58 frames. ], batch size: 441, lr: 2.24e-02, grad_scale: 32.0 2023-06-18 15:07:22,906 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.56 vs. limit=6.0 2023-06-18 15:07:32,984 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.41 vs. limit=15.0 2023-06-18 15:07:37,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=175800.0, ans=0.07 2023-06-18 15:07:57,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=175860.0, ans=0.0 2023-06-18 15:08:11,615 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=175920.0, ans=0.125 2023-06-18 15:08:53,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=175980.0, ans=0.125 2023-06-18 15:09:16,988 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.230e+02 3.106e+02 3.713e+02 4.603e+02 7.072e+02, threshold=7.425e+02, percent-clipped=2.0 2023-06-18 15:09:28,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=176040.0, ans=0.125 2023-06-18 15:09:31,430 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=176040.0, ans=0.0 2023-06-18 15:09:47,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=176040.0, ans=0.125 2023-06-18 15:09:50,098 INFO [train.py:996] (3/4) Epoch 1, batch 29350, loss[loss=0.3366, simple_loss=0.388, pruned_loss=0.1426, over 21530.00 frames. ], tot_loss[loss=0.2985, simple_loss=0.3513, pruned_loss=0.1229, over 4261355.93 frames. ], batch size: 441, lr: 2.24e-02, grad_scale: 32.0 2023-06-18 15:10:32,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=176160.0, ans=0.0 2023-06-18 15:11:21,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=176280.0, ans=0.125 2023-06-18 15:11:23,597 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.26 vs. limit=15.0 2023-06-18 15:11:36,990 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=176280.0, ans=0.0 2023-06-18 15:12:23,753 INFO [train.py:996] (3/4) Epoch 1, batch 29400, loss[loss=0.2357, simple_loss=0.3036, pruned_loss=0.08395, over 21704.00 frames. ], tot_loss[loss=0.2939, simple_loss=0.3494, pruned_loss=0.1192, over 4262888.23 frames. ], batch size: 247, lr: 2.24e-02, grad_scale: 32.0 2023-06-18 15:12:26,540 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-18 15:12:38,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=176400.0, ans=0.05 2023-06-18 15:14:08,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=176580.0, ans=0.0 2023-06-18 15:14:09,464 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.963e+02 2.984e+02 3.574e+02 4.623e+02 7.193e+02, threshold=7.148e+02, percent-clipped=0.0 2023-06-18 15:15:00,994 INFO [train.py:996] (3/4) Epoch 1, batch 29450, loss[loss=0.3215, simple_loss=0.371, pruned_loss=0.136, over 21593.00 frames. ], tot_loss[loss=0.2889, simple_loss=0.3451, pruned_loss=0.1163, over 4266692.71 frames. ], batch size: 263, lr: 2.24e-02, grad_scale: 32.0 2023-06-18 15:15:22,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=176700.0, ans=0.2 2023-06-18 15:16:16,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=176820.0, ans=0.2 2023-06-18 15:17:41,660 INFO [train.py:996] (3/4) Epoch 1, batch 29500, loss[loss=0.3614, simple_loss=0.3952, pruned_loss=0.1637, over 21830.00 frames. ], tot_loss[loss=0.2962, simple_loss=0.3511, pruned_loss=0.1206, over 4272906.83 frames. ], batch size: 441, lr: 2.23e-02, grad_scale: 32.0 2023-06-18 15:18:33,262 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=177120.0, ans=0.07 2023-06-18 15:18:40,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=177120.0, ans=0.5 2023-06-18 15:19:28,515 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.412e+02 3.232e+02 3.818e+02 4.731e+02 1.137e+03, threshold=7.636e+02, percent-clipped=8.0 2023-06-18 15:20:13,110 INFO [train.py:996] (3/4) Epoch 1, batch 29550, loss[loss=0.3082, simple_loss=0.3524, pruned_loss=0.132, over 21857.00 frames. ], tot_loss[loss=0.3007, simple_loss=0.3532, pruned_loss=0.124, over 4283246.73 frames. ], batch size: 298, lr: 2.23e-02, grad_scale: 32.0 2023-06-18 15:20:28,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=177360.0, ans=0.125 2023-06-18 15:22:02,116 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=177480.0, ans=0.125 2023-06-18 15:22:03,728 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=177480.0, ans=0.0 2023-06-18 15:22:18,924 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=177540.0, ans=0.0 2023-06-18 15:22:31,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=177540.0, ans=0.125 2023-06-18 15:22:47,024 INFO [train.py:996] (3/4) Epoch 1, batch 29600, loss[loss=0.3261, simple_loss=0.3929, pruned_loss=0.1296, over 21629.00 frames. ], tot_loss[loss=0.3087, simple_loss=0.3617, pruned_loss=0.1278, over 4285692.43 frames. ], batch size: 263, lr: 2.23e-02, grad_scale: 32.0 2023-06-18 15:24:35,234 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=177720.0, ans=0.1 2023-06-18 15:24:44,165 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=177780.0, ans=0.0 2023-06-18 15:24:54,980 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 2.992e+02 3.426e+02 4.261e+02 6.477e+02, threshold=6.851e+02, percent-clipped=0.0 2023-06-18 15:25:26,155 INFO [train.py:996] (3/4) Epoch 1, batch 29650, loss[loss=0.2463, simple_loss=0.3025, pruned_loss=0.09507, over 21139.00 frames. ], tot_loss[loss=0.3023, simple_loss=0.3583, pruned_loss=0.1232, over 4276147.41 frames. ], batch size: 159, lr: 2.23e-02, grad_scale: 32.0 2023-06-18 15:26:01,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=177960.0, ans=0.125 2023-06-18 15:26:53,880 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.62 vs. limit=10.0 2023-06-18 15:27:41,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=178140.0, ans=0.1 2023-06-18 15:28:08,842 INFO [train.py:996] (3/4) Epoch 1, batch 29700, loss[loss=0.2436, simple_loss=0.3119, pruned_loss=0.08766, over 21775.00 frames. ], tot_loss[loss=0.3021, simple_loss=0.3588, pruned_loss=0.1227, over 4278036.02 frames. ], batch size: 298, lr: 2.23e-02, grad_scale: 32.0 2023-06-18 15:28:09,265 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=178200.0, ans=0.04949747468305833 2023-06-18 15:29:59,051 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=178380.0, ans=0.0 2023-06-18 15:30:14,474 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.257e+02 3.216e+02 3.981e+02 5.488e+02 8.524e+02, threshold=7.962e+02, percent-clipped=7.0 2023-06-18 15:30:31,207 INFO [train.py:996] (3/4) Epoch 1, batch 29750, loss[loss=0.3844, simple_loss=0.4209, pruned_loss=0.174, over 21570.00 frames. ], tot_loss[loss=0.3043, simple_loss=0.363, pruned_loss=0.1228, over 4276470.97 frames. ], batch size: 471, lr: 2.22e-02, grad_scale: 32.0 2023-06-18 15:33:17,749 INFO [train.py:996] (3/4) Epoch 1, batch 29800, loss[loss=0.3201, simple_loss=0.4054, pruned_loss=0.1174, over 20873.00 frames. ], tot_loss[loss=0.3076, simple_loss=0.3656, pruned_loss=0.1248, over 4284503.78 frames. ], batch size: 608, lr: 2.22e-02, grad_scale: 32.0 2023-06-18 15:33:46,566 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=178860.0, ans=0.2 2023-06-18 15:34:09,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=178860.0, ans=0.0 2023-06-18 15:34:47,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=178920.0, ans=0.125 2023-06-18 15:35:23,481 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.951e+02 3.012e+02 3.384e+02 3.968e+02 6.046e+02, threshold=6.767e+02, percent-clipped=0.0 2023-06-18 15:35:46,771 INFO [train.py:996] (3/4) Epoch 1, batch 29850, loss[loss=0.3245, simple_loss=0.3643, pruned_loss=0.1424, over 21854.00 frames. ], tot_loss[loss=0.3037, simple_loss=0.3619, pruned_loss=0.1227, over 4278977.64 frames. ], batch size: 414, lr: 2.22e-02, grad_scale: 32.0 2023-06-18 15:37:55,388 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=179340.0, ans=0.1 2023-06-18 15:38:20,890 INFO [train.py:996] (3/4) Epoch 1, batch 29900, loss[loss=0.3201, simple_loss=0.3655, pruned_loss=0.1373, over 21251.00 frames. ], tot_loss[loss=0.3035, simple_loss=0.3595, pruned_loss=0.1238, over 4288644.09 frames. ], batch size: 143, lr: 2.22e-02, grad_scale: 32.0 2023-06-18 15:39:19,950 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 15:40:00,961 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=179580.0, ans=0.1 2023-06-18 15:40:12,444 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.413e+02 3.008e+02 3.612e+02 4.366e+02 7.864e+02, threshold=7.225e+02, percent-clipped=1.0 2023-06-18 15:40:49,210 INFO [train.py:996] (3/4) Epoch 1, batch 29950, loss[loss=0.3091, simple_loss=0.3663, pruned_loss=0.1259, over 21267.00 frames. ], tot_loss[loss=0.3116, simple_loss=0.3648, pruned_loss=0.1292, over 4285362.35 frames. ], batch size: 143, lr: 2.22e-02, grad_scale: 32.0 2023-06-18 15:40:56,199 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.84 vs. limit=22.5 2023-06-18 15:41:58,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=179820.0, ans=0.125 2023-06-18 15:43:22,195 INFO [train.py:996] (3/4) Epoch 1, batch 30000, loss[loss=0.2576, simple_loss=0.3362, pruned_loss=0.08952, over 21448.00 frames. ], tot_loss[loss=0.314, simple_loss=0.3682, pruned_loss=0.1299, over 4287484.78 frames. ], batch size: 131, lr: 2.22e-02, grad_scale: 32.0 2023-06-18 15:43:22,196 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-18 15:44:13,669 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([4.5150, 2.3478, 3.7613, 2.1704], device='cuda:3') 2023-06-18 15:44:17,106 INFO [train.py:1028] (3/4) Epoch 1, validation: loss=0.2715, simple_loss=0.3724, pruned_loss=0.08526, over 1796401.00 frames. 2023-06-18 15:44:17,107 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-18 15:45:05,965 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 15:46:21,062 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.102e+02 2.874e+02 3.529e+02 4.809e+02 8.528e+02, threshold=7.059e+02, percent-clipped=3.0 2023-06-18 15:46:45,386 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=180240.0, ans=0.0 2023-06-18 15:46:49,222 INFO [train.py:996] (3/4) Epoch 1, batch 30050, loss[loss=0.4298, simple_loss=0.498, pruned_loss=0.1808, over 21543.00 frames. ], tot_loss[loss=0.312, simple_loss=0.3715, pruned_loss=0.1263, over 4288416.01 frames. ], batch size: 471, lr: 2.21e-02, grad_scale: 32.0 2023-06-18 15:47:24,006 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=180300.0, ans=0.125 2023-06-18 15:47:25,334 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=180360.0, ans=0.2 2023-06-18 15:47:32,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=180360.0, ans=0.0 2023-06-18 15:48:18,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=180480.0, ans=0.125 2023-06-18 15:48:53,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=180480.0, ans=0.1 2023-06-18 15:49:30,835 INFO [train.py:996] (3/4) Epoch 1, batch 30100, loss[loss=0.2905, simple_loss=0.3294, pruned_loss=0.1258, over 21166.00 frames. ], tot_loss[loss=0.3101, simple_loss=0.3691, pruned_loss=0.1255, over 4278752.39 frames. ], batch size: 159, lr: 2.21e-02, grad_scale: 32.0 2023-06-18 15:50:27,056 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.13 vs. limit=22.5 2023-06-18 15:50:36,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=180720.0, ans=10.0 2023-06-18 15:50:39,763 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=180720.0, ans=0.125 2023-06-18 15:51:04,659 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=15.0 2023-06-18 15:51:12,531 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.173e+02 3.465e+02 4.064e+02 4.791e+02 7.822e+02, threshold=8.128e+02, percent-clipped=3.0 2023-06-18 15:51:59,485 INFO [train.py:996] (3/4) Epoch 1, batch 30150, loss[loss=0.3005, simple_loss=0.3548, pruned_loss=0.1231, over 16248.00 frames. ], tot_loss[loss=0.3111, simple_loss=0.3663, pruned_loss=0.128, over 4270766.95 frames. ], batch size: 62, lr: 2.21e-02, grad_scale: 32.0 2023-06-18 15:52:15,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=180900.0, ans=0.035 2023-06-18 15:52:25,878 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.10 vs. limit=15.0 2023-06-18 15:52:31,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=180960.0, ans=0.2 2023-06-18 15:53:14,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=181020.0, ans=0.0 2023-06-18 15:53:14,226 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=181020.0, ans=0.125 2023-06-18 15:54:21,293 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 15:54:25,756 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=181140.0, ans=0.1 2023-06-18 15:54:32,746 INFO [train.py:996] (3/4) Epoch 1, batch 30200, loss[loss=0.2812, simple_loss=0.3223, pruned_loss=0.1201, over 21795.00 frames. ], tot_loss[loss=0.3095, simple_loss=0.3678, pruned_loss=0.1256, over 4267053.02 frames. ], batch size: 102, lr: 2.21e-02, grad_scale: 32.0 2023-06-18 15:54:53,731 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.89 vs. limit=15.0 2023-06-18 15:55:10,353 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=181260.0, ans=0.125 2023-06-18 15:56:44,953 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.290e+02 2.883e+02 3.649e+02 4.824e+02 8.627e+02, threshold=7.297e+02, percent-clipped=2.0 2023-06-18 15:57:15,237 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=8.46 vs. limit=12.0 2023-06-18 15:57:37,562 INFO [train.py:996] (3/4) Epoch 1, batch 30250, loss[loss=0.3828, simple_loss=0.4727, pruned_loss=0.1464, over 20798.00 frames. ], tot_loss[loss=0.318, simple_loss=0.3767, pruned_loss=0.1296, over 4264348.68 frames. ], batch size: 607, lr: 2.21e-02, grad_scale: 32.0 2023-06-18 15:57:39,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=181500.0, ans=0.0 2023-06-18 15:59:01,237 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.38 vs. limit=10.0 2023-06-18 15:59:37,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=181740.0, ans=0.0 2023-06-18 15:59:46,668 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=181740.0, ans=0.125 2023-06-18 16:00:04,192 INFO [train.py:996] (3/4) Epoch 1, batch 30300, loss[loss=0.2623, simple_loss=0.3056, pruned_loss=0.1095, over 21163.00 frames. ], tot_loss[loss=0.3161, simple_loss=0.373, pruned_loss=0.1296, over 4260371.97 frames. ], batch size: 176, lr: 2.21e-02, grad_scale: 32.0 2023-06-18 16:00:12,276 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.33 vs. limit=15.0 2023-06-18 16:01:28,899 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=181920.0, ans=0.2 2023-06-18 16:02:11,994 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.326e+02 3.225e+02 4.059e+02 5.033e+02 9.732e+02, threshold=8.118e+02, percent-clipped=4.0 2023-06-18 16:02:58,930 INFO [train.py:996] (3/4) Epoch 1, batch 30350, loss[loss=0.3014, simple_loss=0.363, pruned_loss=0.1199, over 21709.00 frames. ], tot_loss[loss=0.3175, simple_loss=0.3731, pruned_loss=0.131, over 4262282.35 frames. ], batch size: 298, lr: 2.20e-02, grad_scale: 32.0 2023-06-18 16:03:13,320 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.59 vs. limit=22.5 2023-06-18 16:03:31,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=182160.0, ans=0.125 2023-06-18 16:04:38,731 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=182280.0, ans=0.1 2023-06-18 16:04:39,427 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.08 vs. limit=15.0 2023-06-18 16:05:15,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=182340.0, ans=0.125 2023-06-18 16:05:57,808 INFO [train.py:996] (3/4) Epoch 1, batch 30400, loss[loss=0.3121, simple_loss=0.3324, pruned_loss=0.1459, over 20207.00 frames. ], tot_loss[loss=0.3095, simple_loss=0.3636, pruned_loss=0.1278, over 4243333.85 frames. ], batch size: 702, lr: 2.20e-02, grad_scale: 32.0 2023-06-18 16:06:06,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=182400.0, ans=0.125 2023-06-18 16:06:30,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=182400.0, ans=0.125 2023-06-18 16:06:46,241 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.13 vs. limit=12.0 2023-06-18 16:09:13,614 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=182580.0, ans=0.125 2023-06-18 16:09:39,332 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.83 vs. limit=15.0 2023-06-18 16:09:50,249 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.520e+02 3.733e+02 4.742e+02 5.817e+02 2.279e+03, threshold=9.485e+02, percent-clipped=8.0 2023-06-18 16:11:10,018 INFO [train.py:996] (3/4) Epoch 1, batch 30450, loss[loss=0.4321, simple_loss=0.5217, pruned_loss=0.1713, over 19796.00 frames. ], tot_loss[loss=0.3158, simple_loss=0.3689, pruned_loss=0.1314, over 4187314.68 frames. ], batch size: 702, lr: 2.20e-02, grad_scale: 32.0 2023-06-18 16:12:08,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=182760.0, ans=0.09899494936611666 2023-06-18 16:12:09,109 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.73 vs. limit=15.0 2023-06-18 16:12:17,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=182760.0, ans=0.04949747468305833 2023-06-18 16:12:52,505 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.23 vs. limit=10.0 2023-06-18 16:12:58,785 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=182760.0, ans=0.2 2023-06-18 16:13:08,534 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.61 vs. limit=22.5 2023-06-18 16:13:40,045 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=182820.0, ans=0.1 2023-06-18 16:14:34,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=182880.0, ans=0.0 2023-06-18 16:14:57,613 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.62 vs. limit=5.0 2023-06-18 16:17:54,938 INFO [train.py:996] (3/4) Epoch 2, batch 0, loss[loss=0.3619, simple_loss=0.3856, pruned_loss=0.1691, over 21742.00 frames. ], tot_loss[loss=0.3619, simple_loss=0.3856, pruned_loss=0.1691, over 21742.00 frames. ], batch size: 112, lr: 2.01e-02, grad_scale: 32.0 2023-06-18 16:17:54,939 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-18 16:18:53,129 INFO [train.py:1028] (3/4) Epoch 2, validation: loss=0.2985, simple_loss=0.394, pruned_loss=0.1016, over 1796401.00 frames. 2023-06-18 16:18:53,130 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-18 16:18:54,388 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.whiten.whitening_limit, batch_count=182970.0, ans=12.0 2023-06-18 16:19:09,674 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.42 vs. limit=22.5 2023-06-18 16:19:09,716 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-18 16:19:29,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=183090.0, ans=0.5 2023-06-18 16:20:34,006 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.022e+02 3.662e+02 5.243e+02 7.950e+02 2.244e+03, threshold=1.049e+03, percent-clipped=17.0 2023-06-18 16:20:37,370 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=183210.0, ans=0.1 2023-06-18 16:20:43,990 INFO [train.py:996] (3/4) Epoch 2, batch 50, loss[loss=0.3313, simple_loss=0.3914, pruned_loss=0.1355, over 19928.00 frames. ], tot_loss[loss=0.3156, simple_loss=0.3707, pruned_loss=0.1303, over 966496.83 frames. ], batch size: 702, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 16:21:20,815 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.04 vs. limit=22.5 2023-06-18 16:22:10,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=183390.0, ans=0.125 2023-06-18 16:22:17,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=183390.0, ans=0.125 2023-06-18 16:22:47,849 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.32 vs. limit=15.0 2023-06-18 16:23:17,530 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.30 vs. limit=22.5 2023-06-18 16:23:20,714 INFO [train.py:996] (3/4) Epoch 2, batch 100, loss[loss=0.4667, simple_loss=0.5031, pruned_loss=0.2152, over 21426.00 frames. ], tot_loss[loss=0.3233, simple_loss=0.3862, pruned_loss=0.1302, over 1696418.80 frames. ], batch size: 507, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 16:23:43,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=183630.0, ans=0.0 2023-06-18 16:25:19,000 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=183810.0, ans=0.125 2023-06-18 16:25:24,687 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.109e+02 2.827e+02 3.490e+02 4.158e+02 9.308e+02, threshold=6.980e+02, percent-clipped=0.0 2023-06-18 16:25:44,218 INFO [train.py:996] (3/4) Epoch 2, batch 150, loss[loss=0.2682, simple_loss=0.3509, pruned_loss=0.09273, over 21609.00 frames. ], tot_loss[loss=0.324, simple_loss=0.3881, pruned_loss=0.13, over 2270367.00 frames. ], batch size: 230, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 16:25:55,736 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=183870.0, ans=0.0 2023-06-18 16:26:13,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=183930.0, ans=0.125 2023-06-18 16:26:15,812 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.25 vs. limit=10.0 2023-06-18 16:26:16,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=183930.0, ans=0.0 2023-06-18 16:27:08,221 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=184050.0, ans=0.015 2023-06-18 16:27:29,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=184110.0, ans=0.125 2023-06-18 16:28:08,794 INFO [train.py:996] (3/4) Epoch 2, batch 200, loss[loss=0.3425, simple_loss=0.3992, pruned_loss=0.1429, over 21573.00 frames. ], tot_loss[loss=0.3174, simple_loss=0.382, pruned_loss=0.1264, over 2714655.83 frames. ], batch size: 389, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 16:28:45,286 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=184230.0, ans=15.0 2023-06-18 16:28:58,496 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=184230.0, ans=0.125 2023-06-18 16:29:04,451 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=184290.0, ans=0.2 2023-06-18 16:30:03,482 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.067e+02 2.883e+02 3.671e+02 4.524e+02 7.455e+02, threshold=7.342e+02, percent-clipped=3.0 2023-06-18 16:30:31,429 INFO [train.py:996] (3/4) Epoch 2, batch 250, loss[loss=0.3128, simple_loss=0.369, pruned_loss=0.1283, over 21246.00 frames. ], tot_loss[loss=0.3143, simple_loss=0.3766, pruned_loss=0.126, over 3064748.78 frames. ], batch size: 176, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 16:30:31,859 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=184470.0, ans=0.125 2023-06-18 16:30:37,930 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=184470.0, ans=0.125 2023-06-18 16:31:51,720 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=184590.0, ans=10.0 2023-06-18 16:33:00,000 INFO [train.py:996] (3/4) Epoch 2, batch 300, loss[loss=0.3352, simple_loss=0.3789, pruned_loss=0.1458, over 21418.00 frames. ], tot_loss[loss=0.3137, simple_loss=0.3731, pruned_loss=0.1271, over 3336645.07 frames. ], batch size: 159, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 16:33:51,809 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.20 vs. limit=22.5 2023-06-18 16:33:58,779 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.75 vs. limit=15.0 2023-06-18 16:34:55,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=184950.0, ans=0.1 2023-06-18 16:35:16,557 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 2.885e+02 3.453e+02 4.241e+02 7.673e+02, threshold=6.906e+02, percent-clipped=1.0 2023-06-18 16:35:32,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=185010.0, ans=0.125 2023-06-18 16:35:39,463 INFO [train.py:996] (3/4) Epoch 2, batch 350, loss[loss=0.2879, simple_loss=0.3277, pruned_loss=0.124, over 21595.00 frames. ], tot_loss[loss=0.3082, simple_loss=0.3657, pruned_loss=0.1254, over 3550088.36 frames. ], batch size: 415, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 16:36:07,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=185070.0, ans=0.125 2023-06-18 16:36:07,808 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.25 vs. limit=15.0 2023-06-18 16:36:34,287 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.55 vs. limit=22.5 2023-06-18 16:36:47,376 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.34 vs. limit=22.5 2023-06-18 16:37:08,679 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=185190.0, ans=0.125 2023-06-18 16:37:22,473 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=185250.0, ans=0.1 2023-06-18 16:37:30,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=185250.0, ans=10.0 2023-06-18 16:37:50,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=185310.0, ans=0.1 2023-06-18 16:38:05,896 INFO [train.py:996] (3/4) Epoch 2, batch 400, loss[loss=0.2489, simple_loss=0.3061, pruned_loss=0.09585, over 21624.00 frames. ], tot_loss[loss=0.3001, simple_loss=0.3564, pruned_loss=0.122, over 3714429.13 frames. ], batch size: 298, lr: 1.99e-02, grad_scale: 32.0 2023-06-18 16:38:51,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=185370.0, ans=0.125 2023-06-18 16:39:21,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=185490.0, ans=0.125 2023-06-18 16:39:48,856 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.13 vs. limit=10.0 2023-06-18 16:40:25,137 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.968e+02 2.835e+02 3.626e+02 4.923e+02 9.458e+02, threshold=7.251e+02, percent-clipped=2.0 2023-06-18 16:40:42,453 INFO [train.py:996] (3/4) Epoch 2, batch 450, loss[loss=0.307, simple_loss=0.3414, pruned_loss=0.1363, over 21546.00 frames. ], tot_loss[loss=0.2947, simple_loss=0.3512, pruned_loss=0.1191, over 3840646.10 frames. ], batch size: 442, lr: 1.99e-02, grad_scale: 32.0 2023-06-18 16:41:11,809 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=185670.0, ans=0.0 2023-06-18 16:42:25,803 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=185850.0, ans=0.0 2023-06-18 16:42:42,697 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=185910.0, ans=0.0 2023-06-18 16:43:08,789 INFO [train.py:996] (3/4) Epoch 2, batch 500, loss[loss=0.3015, simple_loss=0.3493, pruned_loss=0.1268, over 21296.00 frames. ], tot_loss[loss=0.2995, simple_loss=0.3578, pruned_loss=0.1206, over 3939692.32 frames. ], batch size: 143, lr: 1.99e-02, grad_scale: 32.0 2023-06-18 16:43:32,472 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=185970.0, ans=0.125 2023-06-18 16:43:55,789 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=186030.0, ans=0.0 2023-06-18 16:45:09,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=186210.0, ans=0.125 2023-06-18 16:45:18,131 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.895e+02 2.995e+02 3.727e+02 5.025e+02 8.561e+02, threshold=7.454e+02, percent-clipped=5.0 2023-06-18 16:45:20,566 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.44 vs. limit=12.0 2023-06-18 16:45:45,178 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.64 vs. limit=22.5 2023-06-18 16:45:49,781 INFO [train.py:996] (3/4) Epoch 2, batch 550, loss[loss=0.2965, simple_loss=0.3466, pruned_loss=0.1232, over 21737.00 frames. ], tot_loss[loss=0.2986, simple_loss=0.3591, pruned_loss=0.119, over 4013588.63 frames. ], batch size: 112, lr: 1.99e-02, grad_scale: 32.0 2023-06-18 16:46:17,178 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.60 vs. limit=10.0 2023-06-18 16:46:20,440 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=186330.0, ans=0.0 2023-06-18 16:46:59,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=186390.0, ans=0.0 2023-06-18 16:47:29,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=186450.0, ans=0.125 2023-06-18 16:48:05,035 INFO [train.py:996] (3/4) Epoch 2, batch 600, loss[loss=0.2781, simple_loss=0.3293, pruned_loss=0.1135, over 21523.00 frames. ], tot_loss[loss=0.2996, simple_loss=0.3615, pruned_loss=0.1189, over 4078298.98 frames. ], batch size: 195, lr: 1.99e-02, grad_scale: 64.0 2023-06-18 16:49:00,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=186630.0, ans=0.125 2023-06-18 16:49:53,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_na.min_abs, batch_count=186750.0, ans=0.02 2023-06-18 16:50:17,217 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.150e+02 3.275e+02 3.817e+02 5.405e+02 1.141e+03, threshold=7.634e+02, percent-clipped=7.0 2023-06-18 16:50:26,072 INFO [train.py:996] (3/4) Epoch 2, batch 650, loss[loss=0.2717, simple_loss=0.3257, pruned_loss=0.1089, over 21860.00 frames. ], tot_loss[loss=0.2997, simple_loss=0.3614, pruned_loss=0.119, over 4120548.03 frames. ], batch size: 98, lr: 1.99e-02, grad_scale: 32.0 2023-06-18 16:50:38,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=186870.0, ans=0.2 2023-06-18 16:50:38,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=186870.0, ans=0.0 2023-06-18 16:51:26,476 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=186990.0, ans=0.125 2023-06-18 16:51:52,756 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.46 vs. limit=6.0 2023-06-18 16:52:39,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=187110.0, ans=0.0 2023-06-18 16:52:52,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=187110.0, ans=0.125 2023-06-18 16:53:00,522 INFO [train.py:996] (3/4) Epoch 2, batch 700, loss[loss=0.2506, simple_loss=0.2987, pruned_loss=0.1012, over 21436.00 frames. ], tot_loss[loss=0.3013, simple_loss=0.362, pruned_loss=0.1203, over 4163426.60 frames. ], batch size: 212, lr: 1.98e-02, grad_scale: 32.0 2023-06-18 16:53:32,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=187170.0, ans=0.125 2023-06-18 16:53:40,304 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=187230.0, ans=0.04949747468305833 2023-06-18 16:54:37,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=187350.0, ans=0.125 2023-06-18 16:55:08,912 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.227e+02 3.018e+02 3.846e+02 4.638e+02 8.103e+02, threshold=7.692e+02, percent-clipped=2.0 2023-06-18 16:55:17,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=187410.0, ans=0.0 2023-06-18 16:55:22,986 INFO [train.py:996] (3/4) Epoch 2, batch 750, loss[loss=0.3285, simple_loss=0.3685, pruned_loss=0.1443, over 21660.00 frames. ], tot_loss[loss=0.3025, simple_loss=0.3625, pruned_loss=0.1212, over 4195362.96 frames. ], batch size: 389, lr: 1.98e-02, grad_scale: 32.0 2023-06-18 16:55:24,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=187470.0, ans=0.0 2023-06-18 16:55:39,230 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=187470.0, ans=0.125 2023-06-18 16:56:09,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=187530.0, ans=0.125 2023-06-18 16:56:12,998 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.76 vs. limit=22.5 2023-06-18 16:57:14,893 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=187710.0, ans=0.0 2023-06-18 16:57:34,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=187710.0, ans=0.0 2023-06-18 16:57:49,843 INFO [train.py:996] (3/4) Epoch 2, batch 800, loss[loss=0.2969, simple_loss=0.3491, pruned_loss=0.1223, over 21523.00 frames. ], tot_loss[loss=0.301, simple_loss=0.3602, pruned_loss=0.1209, over 4217810.39 frames. ], batch size: 441, lr: 1.98e-02, grad_scale: 32.0 2023-06-18 16:58:18,343 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=187830.0, ans=0.125 2023-06-18 16:59:20,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=188010.0, ans=0.125 2023-06-18 16:59:24,158 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.222e+02 3.112e+02 3.737e+02 4.664e+02 8.129e+02, threshold=7.474e+02, percent-clipped=2.0 2023-06-18 16:59:45,117 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.97 vs. limit=15.0 2023-06-18 16:59:45,758 INFO [train.py:996] (3/4) Epoch 2, batch 850, loss[loss=0.3724, simple_loss=0.3848, pruned_loss=0.18, over 21668.00 frames. ], tot_loss[loss=0.2993, simple_loss=0.3577, pruned_loss=0.1205, over 4231693.71 frames. ], batch size: 508, lr: 1.98e-02, grad_scale: 32.0 2023-06-18 16:59:59,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=188130.0, ans=0.125 2023-06-18 17:00:35,583 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=188190.0, ans=0.0 2023-06-18 17:01:59,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=188370.0, ans=0.125 2023-06-18 17:02:00,149 INFO [train.py:996] (3/4) Epoch 2, batch 900, loss[loss=0.3134, simple_loss=0.4277, pruned_loss=0.09952, over 19759.00 frames. ], tot_loss[loss=0.2963, simple_loss=0.3538, pruned_loss=0.1194, over 4245987.40 frames. ], batch size: 703, lr: 1.98e-02, grad_scale: 32.0 2023-06-18 17:02:08,353 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.08 vs. limit=6.0 2023-06-18 17:02:09,074 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=188370.0, ans=0.0 2023-06-18 17:03:01,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=188490.0, ans=0.125 2023-06-18 17:03:34,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=188610.0, ans=0.125 2023-06-18 17:03:44,627 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.055e+02 2.912e+02 3.443e+02 3.950e+02 7.749e+02, threshold=6.886e+02, percent-clipped=2.0 2023-06-18 17:03:58,131 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.99 vs. limit=15.0 2023-06-18 17:04:02,593 INFO [train.py:996] (3/4) Epoch 2, batch 950, loss[loss=0.284, simple_loss=0.3516, pruned_loss=0.1082, over 21793.00 frames. ], tot_loss[loss=0.2929, simple_loss=0.351, pruned_loss=0.1174, over 4258028.75 frames. ], batch size: 371, lr: 1.98e-02, grad_scale: 16.0 2023-06-18 17:04:51,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=188730.0, ans=0.125 2023-06-18 17:04:52,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=188730.0, ans=0.125 2023-06-18 17:04:57,063 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=188790.0, ans=0.07 2023-06-18 17:04:59,940 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=188790.0, ans=0.125 2023-06-18 17:05:27,330 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.69 vs. limit=15.0 2023-06-18 17:05:57,533 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 17:05:57,538 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=188910.0, ans=0.125 2023-06-18 17:05:58,073 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.74 vs. limit=15.0 2023-06-18 17:06:18,723 INFO [train.py:996] (3/4) Epoch 2, batch 1000, loss[loss=0.3551, simple_loss=0.3874, pruned_loss=0.1615, over 21601.00 frames. ], tot_loss[loss=0.296, simple_loss=0.3536, pruned_loss=0.1192, over 4262797.60 frames. ], batch size: 471, lr: 1.98e-02, grad_scale: 16.0 2023-06-18 17:06:27,365 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.20 vs. limit=22.5 2023-06-18 17:06:34,523 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.28 vs. limit=15.0 2023-06-18 17:07:08,884 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.12 vs. limit=15.0 2023-06-18 17:07:30,848 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.75 vs. limit=15.0 2023-06-18 17:08:20,459 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.261e+02 3.046e+02 3.757e+02 4.775e+02 8.015e+02, threshold=7.515e+02, percent-clipped=4.0 2023-06-18 17:08:20,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=189210.0, ans=0.0 2023-06-18 17:08:27,771 INFO [train.py:996] (3/4) Epoch 2, batch 1050, loss[loss=0.3232, simple_loss=0.3704, pruned_loss=0.138, over 21291.00 frames. ], tot_loss[loss=0.2949, simple_loss=0.3523, pruned_loss=0.1188, over 4269480.65 frames. ], batch size: 159, lr: 1.97e-02, grad_scale: 16.0 2023-06-18 17:09:09,558 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=189330.0, ans=0.1 2023-06-18 17:09:41,367 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=189390.0, ans=0.125 2023-06-18 17:09:53,108 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 17:10:35,357 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=189510.0, ans=0.2 2023-06-18 17:10:46,337 INFO [train.py:996] (3/4) Epoch 2, batch 1100, loss[loss=0.2838, simple_loss=0.3408, pruned_loss=0.1134, over 21835.00 frames. ], tot_loss[loss=0.2956, simple_loss=0.3532, pruned_loss=0.119, over 4269543.45 frames. ], batch size: 298, lr: 1.97e-02, grad_scale: 16.0 2023-06-18 17:11:07,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=189570.0, ans=0.1 2023-06-18 17:11:27,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=189630.0, ans=0.125 2023-06-18 17:11:43,575 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=189630.0, ans=0.0 2023-06-18 17:12:30,856 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.301e+02 3.384e+02 3.965e+02 4.947e+02 9.506e+02, threshold=7.929e+02, percent-clipped=4.0 2023-06-18 17:12:48,323 INFO [train.py:996] (3/4) Epoch 2, batch 1150, loss[loss=0.2423, simple_loss=0.3228, pruned_loss=0.08089, over 21728.00 frames. ], tot_loss[loss=0.2949, simple_loss=0.3533, pruned_loss=0.1182, over 4272267.52 frames. ], batch size: 247, lr: 1.97e-02, grad_scale: 16.0 2023-06-18 17:13:53,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=189990.0, ans=10.0 2023-06-18 17:14:05,534 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.74 vs. limit=15.0 2023-06-18 17:14:06,323 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 17:14:16,742 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2023-06-18 17:14:17,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=190050.0, ans=0.125 2023-06-18 17:14:54,343 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=190110.0, ans=0.1 2023-06-18 17:15:15,718 INFO [train.py:996] (3/4) Epoch 2, batch 1200, loss[loss=0.3046, simple_loss=0.364, pruned_loss=0.1226, over 21901.00 frames. ], tot_loss[loss=0.2965, simple_loss=0.3548, pruned_loss=0.1191, over 4278134.00 frames. ], batch size: 316, lr: 1.97e-02, grad_scale: 32.0 2023-06-18 17:16:56,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=190410.0, ans=0.0 2023-06-18 17:17:01,579 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=190410.0, ans=0.0 2023-06-18 17:17:04,009 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.383e+02 3.111e+02 4.019e+02 4.816e+02 7.916e+02, threshold=8.038e+02, percent-clipped=0.0 2023-06-18 17:17:25,271 INFO [train.py:996] (3/4) Epoch 2, batch 1250, loss[loss=0.279, simple_loss=0.3262, pruned_loss=0.1159, over 21493.00 frames. ], tot_loss[loss=0.2986, simple_loss=0.3562, pruned_loss=0.1205, over 4270473.31 frames. ], batch size: 194, lr: 1.97e-02, grad_scale: 32.0 2023-06-18 17:18:39,533 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.44 vs. limit=15.0 2023-06-18 17:18:40,302 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=190590.0, ans=0.125 2023-06-18 17:19:06,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=190650.0, ans=0.0 2023-06-18 17:19:08,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=190650.0, ans=0.125 2023-06-18 17:19:41,851 INFO [train.py:996] (3/4) Epoch 2, batch 1300, loss[loss=0.2928, simple_loss=0.3651, pruned_loss=0.1103, over 21829.00 frames. ], tot_loss[loss=0.2999, simple_loss=0.3576, pruned_loss=0.1211, over 4279495.18 frames. ], batch size: 332, lr: 1.97e-02, grad_scale: 32.0 2023-06-18 17:20:49,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=190890.0, ans=0.2 2023-06-18 17:20:52,369 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 17:21:41,132 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.254e+02 3.359e+02 4.196e+02 5.340e+02 8.417e+02, threshold=8.392e+02, percent-clipped=1.0 2023-06-18 17:21:54,837 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=191010.0, ans=0.1 2023-06-18 17:21:58,778 INFO [train.py:996] (3/4) Epoch 2, batch 1350, loss[loss=0.2841, simple_loss=0.3567, pruned_loss=0.1058, over 21670.00 frames. ], tot_loss[loss=0.3015, simple_loss=0.3589, pruned_loss=0.122, over 4285391.98 frames. ], batch size: 247, lr: 1.97e-02, grad_scale: 32.0 2023-06-18 17:22:03,960 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.99 vs. limit=15.0 2023-06-18 17:22:23,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=191070.0, ans=0.125 2023-06-18 17:22:56,322 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=191190.0, ans=0.125 2023-06-18 17:23:49,926 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.67 vs. limit=12.0 2023-06-18 17:24:00,734 INFO [train.py:996] (3/4) Epoch 2, batch 1400, loss[loss=0.3333, simple_loss=0.3882, pruned_loss=0.1392, over 21349.00 frames. ], tot_loss[loss=0.2989, simple_loss=0.3562, pruned_loss=0.1208, over 4289102.28 frames. ], batch size: 548, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 17:24:08,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=191370.0, ans=0.025 2023-06-18 17:24:51,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=191430.0, ans=0.2 2023-06-18 17:24:52,582 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.15 vs. limit=6.0 2023-06-18 17:26:00,685 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.196e+02 3.213e+02 3.659e+02 4.697e+02 8.447e+02, threshold=7.317e+02, percent-clipped=1.0 2023-06-18 17:26:07,941 INFO [train.py:996] (3/4) Epoch 2, batch 1450, loss[loss=0.3048, simple_loss=0.3598, pruned_loss=0.1249, over 21647.00 frames. ], tot_loss[loss=0.3002, simple_loss=0.3567, pruned_loss=0.1218, over 4288894.79 frames. ], batch size: 112, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 17:26:23,357 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=191670.0, ans=0.125 2023-06-18 17:26:35,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=191730.0, ans=0.1 2023-06-18 17:27:08,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=191790.0, ans=0.0 2023-06-18 17:27:34,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=191850.0, ans=0.125 2023-06-18 17:27:45,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=191910.0, ans=0.125 2023-06-18 17:28:22,319 INFO [train.py:996] (3/4) Epoch 2, batch 1500, loss[loss=0.2647, simple_loss=0.357, pruned_loss=0.08616, over 19797.00 frames. ], tot_loss[loss=0.3048, simple_loss=0.3604, pruned_loss=0.1246, over 4294199.39 frames. ], batch size: 702, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 17:28:45,431 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.52 vs. limit=15.0 2023-06-18 17:29:19,146 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 17:29:33,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=192090.0, ans=15.0 2023-06-18 17:30:25,615 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.059e+02 3.308e+02 4.250e+02 5.444e+02 1.048e+03, threshold=8.500e+02, percent-clipped=7.0 2023-06-18 17:30:36,565 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=192210.0, ans=0.035 2023-06-18 17:30:39,390 INFO [train.py:996] (3/4) Epoch 2, batch 1550, loss[loss=0.2925, simple_loss=0.349, pruned_loss=0.118, over 21011.00 frames. ], tot_loss[loss=0.303, simple_loss=0.3587, pruned_loss=0.1237, over 4292998.72 frames. ], batch size: 608, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 17:31:02,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=192330.0, ans=0.125 2023-06-18 17:32:15,346 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.71 vs. limit=15.0 2023-06-18 17:32:53,938 INFO [train.py:996] (3/4) Epoch 2, batch 1600, loss[loss=0.2635, simple_loss=0.3397, pruned_loss=0.0936, over 21724.00 frames. ], tot_loss[loss=0.3008, simple_loss=0.3564, pruned_loss=0.1226, over 4284466.99 frames. ], batch size: 391, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 17:33:18,224 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=192630.0, ans=0.2 2023-06-18 17:34:47,617 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.005e+02 3.115e+02 3.710e+02 4.975e+02 8.119e+02, threshold=7.421e+02, percent-clipped=0.0 2023-06-18 17:34:54,925 INFO [train.py:996] (3/4) Epoch 2, batch 1650, loss[loss=0.3027, simple_loss=0.3574, pruned_loss=0.124, over 21122.00 frames. ], tot_loss[loss=0.2989, simple_loss=0.3552, pruned_loss=0.1213, over 4283788.51 frames. ], batch size: 608, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 17:35:44,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=192930.0, ans=0.2 2023-06-18 17:36:08,858 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.42 vs. limit=6.0 2023-06-18 17:37:12,716 INFO [train.py:996] (3/4) Epoch 2, batch 1700, loss[loss=0.2978, simple_loss=0.3797, pruned_loss=0.108, over 21854.00 frames. ], tot_loss[loss=0.3012, simple_loss=0.3581, pruned_loss=0.1221, over 4282858.16 frames. ], batch size: 316, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 17:37:16,185 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=193170.0, ans=0.2 2023-06-18 17:37:17,660 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=193170.0, ans=0.2 2023-06-18 17:37:51,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=193230.0, ans=0.2 2023-06-18 17:38:28,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=193290.0, ans=0.2 2023-06-18 17:38:37,645 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=193290.0, ans=0.125 2023-06-18 17:38:58,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=193350.0, ans=0.07 2023-06-18 17:39:16,041 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.313e+02 3.411e+02 4.217e+02 5.262e+02 9.170e+02, threshold=8.435e+02, percent-clipped=4.0 2023-06-18 17:39:17,904 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=193410.0, ans=0.125 2023-06-18 17:39:23,247 INFO [train.py:996] (3/4) Epoch 2, batch 1750, loss[loss=0.2228, simple_loss=0.2964, pruned_loss=0.07467, over 21400.00 frames. ], tot_loss[loss=0.3, simple_loss=0.358, pruned_loss=0.121, over 4266601.60 frames. ], batch size: 194, lr: 1.95e-02, grad_scale: 32.0 2023-06-18 17:40:18,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=193530.0, ans=0.1 2023-06-18 17:40:20,958 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=193530.0, ans=0.125 2023-06-18 17:41:02,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=193650.0, ans=0.125 2023-06-18 17:41:44,368 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.90 vs. limit=15.0 2023-06-18 17:41:59,275 INFO [train.py:996] (3/4) Epoch 2, batch 1800, loss[loss=0.1957, simple_loss=0.2682, pruned_loss=0.06161, over 21429.00 frames. ], tot_loss[loss=0.294, simple_loss=0.3545, pruned_loss=0.1168, over 4267271.20 frames. ], batch size: 211, lr: 1.95e-02, grad_scale: 32.0 2023-06-18 17:42:26,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=193770.0, ans=0.125 2023-06-18 17:43:16,112 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.27 vs. limit=10.0 2023-06-18 17:43:17,574 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=47.08 vs. limit=15.0 2023-06-18 17:43:41,193 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=193950.0, ans=0.125 2023-06-18 17:44:08,481 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.933e+02 2.686e+02 3.358e+02 3.965e+02 7.229e+02, threshold=6.717e+02, percent-clipped=0.0 2023-06-18 17:44:15,777 INFO [train.py:996] (3/4) Epoch 2, batch 1850, loss[loss=0.352, simple_loss=0.4116, pruned_loss=0.1462, over 21453.00 frames. ], tot_loss[loss=0.2907, simple_loss=0.3535, pruned_loss=0.1139, over 4262232.55 frames. ], batch size: 507, lr: 1.95e-02, grad_scale: 32.0 2023-06-18 17:45:20,520 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=12.0 2023-06-18 17:45:30,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=194190.0, ans=0.2 2023-06-18 17:45:34,650 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=194250.0, ans=0.125 2023-06-18 17:45:37,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=194250.0, ans=0.0 2023-06-18 17:46:25,741 INFO [train.py:996] (3/4) Epoch 2, batch 1900, loss[loss=0.2736, simple_loss=0.3411, pruned_loss=0.103, over 21838.00 frames. ], tot_loss[loss=0.2927, simple_loss=0.354, pruned_loss=0.1157, over 4273976.22 frames. ], batch size: 371, lr: 1.95e-02, grad_scale: 32.0 2023-06-18 17:46:39,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=194370.0, ans=0.125 2023-06-18 17:47:25,158 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=194490.0, ans=0.0 2023-06-18 17:47:44,192 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.79 vs. limit=15.0 2023-06-18 17:48:18,324 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.132e+02 3.031e+02 3.569e+02 4.348e+02 9.339e+02, threshold=7.139e+02, percent-clipped=5.0 2023-06-18 17:48:24,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=194670.0, ans=0.125 2023-06-18 17:48:25,697 INFO [train.py:996] (3/4) Epoch 2, batch 1950, loss[loss=0.3959, simple_loss=0.4307, pruned_loss=0.1806, over 21742.00 frames. ], tot_loss[loss=0.29, simple_loss=0.3494, pruned_loss=0.1153, over 4281478.60 frames. ], batch size: 441, lr: 1.95e-02, grad_scale: 32.0 2023-06-18 17:48:48,074 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=194670.0, ans=0.2 2023-06-18 17:49:00,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=194730.0, ans=0.125 2023-06-18 17:49:25,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=194790.0, ans=0.0 2023-06-18 17:50:31,142 INFO [train.py:996] (3/4) Epoch 2, batch 2000, loss[loss=0.2435, simple_loss=0.3129, pruned_loss=0.08709, over 21820.00 frames. ], tot_loss[loss=0.2872, simple_loss=0.3459, pruned_loss=0.1142, over 4279771.55 frames. ], batch size: 333, lr: 1.95e-02, grad_scale: 32.0 2023-06-18 17:51:08,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=195030.0, ans=10.0 2023-06-18 17:51:11,680 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=195030.0, ans=0.125 2023-06-18 17:51:50,324 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=195150.0, ans=0.95 2023-06-18 17:51:51,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=195150.0, ans=0.04949747468305833 2023-06-18 17:52:30,336 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.735e+02 2.996e+02 3.369e+02 4.467e+02 7.434e+02, threshold=6.738e+02, percent-clipped=1.0 2023-06-18 17:52:35,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=195210.0, ans=0.125 2023-06-18 17:52:37,738 INFO [train.py:996] (3/4) Epoch 2, batch 2050, loss[loss=0.3095, simple_loss=0.3619, pruned_loss=0.1285, over 21881.00 frames. ], tot_loss[loss=0.2891, simple_loss=0.3487, pruned_loss=0.1148, over 4284683.62 frames. ], batch size: 351, lr: 1.95e-02, grad_scale: 32.0 2023-06-18 17:53:01,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=195330.0, ans=0.0 2023-06-18 17:53:17,005 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=195330.0, ans=0.125 2023-06-18 17:53:20,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=195330.0, ans=0.2 2023-06-18 17:53:52,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=195390.0, ans=0.125 2023-06-18 17:53:59,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=195450.0, ans=0.125 2023-06-18 17:54:09,437 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=195510.0, ans=0.2 2023-06-18 17:54:48,085 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=195570.0, ans=0.125 2023-06-18 17:54:52,689 INFO [train.py:996] (3/4) Epoch 2, batch 2100, loss[loss=0.2674, simple_loss=0.3406, pruned_loss=0.09707, over 21823.00 frames. ], tot_loss[loss=0.2928, simple_loss=0.3527, pruned_loss=0.1165, over 4286266.54 frames. ], batch size: 102, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 17:54:55,041 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.94 vs. limit=22.5 2023-06-18 17:55:01,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=195570.0, ans=0.125 2023-06-18 17:55:36,400 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=195690.0, ans=0.1 2023-06-18 17:56:07,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=195750.0, ans=0.2 2023-06-18 17:56:42,380 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.86 vs. limit=10.0 2023-06-18 17:56:47,251 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.163e+02 2.911e+02 3.457e+02 4.182e+02 7.593e+02, threshold=6.915e+02, percent-clipped=4.0 2023-06-18 17:56:52,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=195810.0, ans=0.125 2023-06-18 17:56:54,803 INFO [train.py:996] (3/4) Epoch 2, batch 2150, loss[loss=0.271, simple_loss=0.3279, pruned_loss=0.107, over 21608.00 frames. ], tot_loss[loss=0.2961, simple_loss=0.3545, pruned_loss=0.1188, over 4292074.60 frames. ], batch size: 298, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 17:57:12,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=195870.0, ans=0.2 2023-06-18 17:57:24,083 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=195870.0, ans=0.2 2023-06-18 17:57:30,911 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=195930.0, ans=0.035 2023-06-18 17:58:00,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=195990.0, ans=0.0 2023-06-18 17:58:06,565 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.09 vs. limit=12.0 2023-06-18 17:58:42,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=196050.0, ans=0.0 2023-06-18 17:59:17,874 INFO [train.py:996] (3/4) Epoch 2, batch 2200, loss[loss=0.2904, simple_loss=0.36, pruned_loss=0.1104, over 21727.00 frames. ], tot_loss[loss=0.2979, simple_loss=0.3569, pruned_loss=0.1194, over 4288968.46 frames. ], batch size: 298, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 17:59:23,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=196170.0, ans=0.2 2023-06-18 17:59:59,875 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.24 vs. limit=22.5 2023-06-18 18:00:54,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=196350.0, ans=0.125 2023-06-18 18:00:56,583 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.45 vs. limit=6.0 2023-06-18 18:01:21,171 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.54 vs. limit=15.0 2023-06-18 18:01:21,554 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.177e+02 3.132e+02 3.798e+02 4.923e+02 7.724e+02, threshold=7.595e+02, percent-clipped=3.0 2023-06-18 18:01:23,951 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.99 vs. limit=15.0 2023-06-18 18:01:29,015 INFO [train.py:996] (3/4) Epoch 2, batch 2250, loss[loss=0.2876, simple_loss=0.3374, pruned_loss=0.1189, over 21580.00 frames. ], tot_loss[loss=0.2906, simple_loss=0.3509, pruned_loss=0.1152, over 4281562.52 frames. ], batch size: 263, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 18:01:40,114 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.77 vs. limit=15.0 2023-06-18 18:01:40,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=196470.0, ans=0.125 2023-06-18 18:03:25,025 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.47 vs. limit=15.0 2023-06-18 18:03:31,110 INFO [train.py:996] (3/4) Epoch 2, batch 2300, loss[loss=0.2553, simple_loss=0.3026, pruned_loss=0.104, over 21759.00 frames. ], tot_loss[loss=0.2884, simple_loss=0.3463, pruned_loss=0.1153, over 4286306.60 frames. ], batch size: 124, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 18:03:31,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=196770.0, ans=0.0 2023-06-18 18:03:41,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=196770.0, ans=0.125 2023-06-18 18:03:56,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=196830.0, ans=0.125 2023-06-18 18:05:05,570 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=196950.0, ans=0.015 2023-06-18 18:05:16,920 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 3.103e+02 3.931e+02 4.901e+02 7.209e+02, threshold=7.862e+02, percent-clipped=0.0 2023-06-18 18:05:28,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=197070.0, ans=0.1 2023-06-18 18:05:29,587 INFO [train.py:996] (3/4) Epoch 2, batch 2350, loss[loss=0.2568, simple_loss=0.3034, pruned_loss=0.1051, over 21613.00 frames. ], tot_loss[loss=0.2868, simple_loss=0.3416, pruned_loss=0.116, over 4293806.16 frames. ], batch size: 298, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 18:05:59,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=197070.0, ans=0.2 2023-06-18 18:07:41,971 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=197310.0, ans=0.2 2023-06-18 18:07:48,757 INFO [train.py:996] (3/4) Epoch 2, batch 2400, loss[loss=0.3544, simple_loss=0.3995, pruned_loss=0.1546, over 21577.00 frames. ], tot_loss[loss=0.2937, simple_loss=0.3485, pruned_loss=0.1194, over 4290329.45 frames. ], batch size: 415, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 18:08:32,254 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=197430.0, ans=0.0 2023-06-18 18:08:41,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=197490.0, ans=0.125 2023-06-18 18:09:26,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=197550.0, ans=10.0 2023-06-18 18:09:43,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=197610.0, ans=0.02 2023-06-18 18:09:47,273 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.237e+02 3.219e+02 3.807e+02 5.185e+02 8.292e+02, threshold=7.615e+02, percent-clipped=0.0 2023-06-18 18:10:11,965 INFO [train.py:996] (3/4) Epoch 2, batch 2450, loss[loss=0.2817, simple_loss=0.3353, pruned_loss=0.1141, over 21337.00 frames. ], tot_loss[loss=0.2984, simple_loss=0.354, pruned_loss=0.1214, over 4292465.34 frames. ], batch size: 194, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 18:10:43,053 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=197730.0, ans=0.125 2023-06-18 18:10:53,705 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.99 vs. limit=22.5 2023-06-18 18:11:13,067 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=197790.0, ans=0.125 2023-06-18 18:11:40,645 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 18:12:02,894 INFO [train.py:996] (3/4) Epoch 2, batch 2500, loss[loss=0.2954, simple_loss=0.3695, pruned_loss=0.1107, over 21706.00 frames. ], tot_loss[loss=0.2968, simple_loss=0.3515, pruned_loss=0.1211, over 4283562.84 frames. ], batch size: 247, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 18:12:08,667 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=197970.0, ans=0.125 2023-06-18 18:12:15,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=197970.0, ans=0.2 2023-06-18 18:12:34,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=198030.0, ans=0.0 2023-06-18 18:12:55,629 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.45 vs. limit=15.0 2023-06-18 18:13:46,147 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.97 vs. limit=15.0 2023-06-18 18:13:55,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=198150.0, ans=0.125 2023-06-18 18:14:05,620 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.119e+02 3.063e+02 3.765e+02 4.527e+02 8.351e+02, threshold=7.530e+02, percent-clipped=1.0 2023-06-18 18:14:20,533 INFO [train.py:996] (3/4) Epoch 2, batch 2550, loss[loss=0.3005, simple_loss=0.3401, pruned_loss=0.1305, over 21309.00 frames. ], tot_loss[loss=0.2948, simple_loss=0.3502, pruned_loss=0.1197, over 4274407.65 frames. ], batch size: 471, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 18:15:32,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=198390.0, ans=0.2 2023-06-18 18:16:35,994 INFO [train.py:996] (3/4) Epoch 2, batch 2600, loss[loss=0.3306, simple_loss=0.3813, pruned_loss=0.1399, over 21729.00 frames. ], tot_loss[loss=0.2967, simple_loss=0.3509, pruned_loss=0.1212, over 4265605.04 frames. ], batch size: 415, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 18:17:07,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=198630.0, ans=0.1 2023-06-18 18:18:35,133 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.261e+02 3.099e+02 3.578e+02 4.381e+02 8.745e+02, threshold=7.157e+02, percent-clipped=2.0 2023-06-18 18:18:47,910 INFO [train.py:996] (3/4) Epoch 2, batch 2650, loss[loss=0.3208, simple_loss=0.3542, pruned_loss=0.1437, over 19996.00 frames. ], tot_loss[loss=0.301, simple_loss=0.3545, pruned_loss=0.1237, over 4276330.79 frames. ], batch size: 702, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 18:18:52,050 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.85 vs. limit=6.0 2023-06-18 18:20:47,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=199110.0, ans=0.1 2023-06-18 18:20:56,645 INFO [train.py:996] (3/4) Epoch 2, batch 2700, loss[loss=0.3136, simple_loss=0.4212, pruned_loss=0.103, over 19845.00 frames. ], tot_loss[loss=0.2974, simple_loss=0.353, pruned_loss=0.1209, over 4270736.31 frames. ], batch size: 703, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 18:21:29,258 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.89 vs. limit=6.0 2023-06-18 18:21:40,071 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=199290.0, ans=0.125 2023-06-18 18:22:12,399 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-18 18:22:21,286 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=199410.0, ans=0.1 2023-06-18 18:22:39,837 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.312e+02 2.957e+02 3.708e+02 4.749e+02 8.354e+02, threshold=7.415e+02, percent-clipped=2.0 2023-06-18 18:22:53,012 INFO [train.py:996] (3/4) Epoch 2, batch 2750, loss[loss=0.3169, simple_loss=0.3577, pruned_loss=0.138, over 21878.00 frames. ], tot_loss[loss=0.2979, simple_loss=0.3537, pruned_loss=0.1211, over 4270842.13 frames. ], batch size: 351, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 18:22:53,545 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=199470.0, ans=0.0 2023-06-18 18:23:01,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=199470.0, ans=0.0 2023-06-18 18:23:09,689 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.52 vs. limit=15.0 2023-06-18 18:23:12,067 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=199530.0, ans=0.125 2023-06-18 18:23:18,300 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=199530.0, ans=0.125 2023-06-18 18:23:19,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=199530.0, ans=0.125 2023-06-18 18:24:56,954 INFO [train.py:996] (3/4) Epoch 2, batch 2800, loss[loss=0.3817, simple_loss=0.4437, pruned_loss=0.1598, over 21673.00 frames. ], tot_loss[loss=0.3001, simple_loss=0.3577, pruned_loss=0.1213, over 4272991.51 frames. ], batch size: 389, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 18:26:25,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=199950.0, ans=10.0 2023-06-18 18:26:25,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=199950.0, ans=0.0 2023-06-18 18:27:05,419 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.554e+02 3.576e+02 4.220e+02 5.445e+02 9.845e+02, threshold=8.440e+02, percent-clipped=5.0 2023-06-18 18:27:13,002 INFO [train.py:996] (3/4) Epoch 2, batch 2850, loss[loss=0.2837, simple_loss=0.3516, pruned_loss=0.1079, over 21652.00 frames. ], tot_loss[loss=0.2998, simple_loss=0.3575, pruned_loss=0.121, over 4273267.27 frames. ], batch size: 414, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 18:28:15,024 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=200190.0, ans=0.125 2023-06-18 18:28:34,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=200250.0, ans=0.125 2023-06-18 18:29:02,263 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.78 vs. limit=22.5 2023-06-18 18:29:23,641 INFO [train.py:996] (3/4) Epoch 2, batch 2900, loss[loss=0.3221, simple_loss=0.3657, pruned_loss=0.1393, over 21853.00 frames. ], tot_loss[loss=0.2979, simple_loss=0.3551, pruned_loss=0.1204, over 4275888.29 frames. ], batch size: 371, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 18:30:57,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=200550.0, ans=0.2 2023-06-18 18:31:26,824 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.117e+02 3.217e+02 3.802e+02 4.716e+02 7.016e+02, threshold=7.604e+02, percent-clipped=0.0 2023-06-18 18:31:34,050 INFO [train.py:996] (3/4) Epoch 2, batch 2950, loss[loss=0.2449, simple_loss=0.3052, pruned_loss=0.09233, over 21613.00 frames. ], tot_loss[loss=0.2968, simple_loss=0.3539, pruned_loss=0.1198, over 4279300.73 frames. ], batch size: 263, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 18:32:01,614 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=200670.0, ans=0.125 2023-06-18 18:32:37,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=200730.0, ans=0.125 2023-06-18 18:33:14,526 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-06-18 18:33:33,627 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.71 vs. limit=6.0 2023-06-18 18:34:01,565 INFO [train.py:996] (3/4) Epoch 2, batch 3000, loss[loss=0.2893, simple_loss=0.3634, pruned_loss=0.1076, over 21653.00 frames. ], tot_loss[loss=0.3007, simple_loss=0.3586, pruned_loss=0.1214, over 4282508.27 frames. ], batch size: 230, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 18:34:01,566 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-18 18:34:49,992 INFO [train.py:1028] (3/4) Epoch 2, validation: loss=0.277, simple_loss=0.3697, pruned_loss=0.09215, over 1796401.00 frames. 2023-06-18 18:34:49,994 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-18 18:34:52,331 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=200970.0, ans=0.05 2023-06-18 18:35:07,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=200970.0, ans=0.125 2023-06-18 18:35:12,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=201030.0, ans=0.125 2023-06-18 18:35:34,401 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=201090.0, ans=0.125 2023-06-18 18:35:34,434 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=201090.0, ans=0.125 2023-06-18 18:36:10,648 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 18:36:37,205 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=201210.0, ans=0.125 2023-06-18 18:36:50,972 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.184e+02 2.746e+02 3.199e+02 3.819e+02 7.191e+02, threshold=6.398e+02, percent-clipped=0.0 2023-06-18 18:36:52,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=201210.0, ans=0.0 2023-06-18 18:37:01,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=201270.0, ans=0.0 2023-06-18 18:37:02,541 INFO [train.py:996] (3/4) Epoch 2, batch 3050, loss[loss=0.2273, simple_loss=0.314, pruned_loss=0.0703, over 21745.00 frames. ], tot_loss[loss=0.2998, simple_loss=0.3584, pruned_loss=0.1206, over 4283511.28 frames. ], batch size: 298, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 18:37:36,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=201330.0, ans=0.125 2023-06-18 18:39:10,279 INFO [train.py:996] (3/4) Epoch 2, batch 3100, loss[loss=0.277, simple_loss=0.3599, pruned_loss=0.09708, over 21680.00 frames. ], tot_loss[loss=0.2985, simple_loss=0.3581, pruned_loss=0.1195, over 4288057.51 frames. ], batch size: 441, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 18:39:12,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=201570.0, ans=0.125 2023-06-18 18:39:53,420 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=201630.0, ans=0.125 2023-06-18 18:40:02,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=201630.0, ans=0.125 2023-06-18 18:40:24,440 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.34 vs. limit=15.0 2023-06-18 18:40:50,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=201750.0, ans=0.125 2023-06-18 18:41:08,929 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.143e+02 2.803e+02 3.511e+02 4.276e+02 6.392e+02, threshold=7.021e+02, percent-clipped=0.0 2023-06-18 18:41:28,770 INFO [train.py:996] (3/4) Epoch 2, batch 3150, loss[loss=0.3245, simple_loss=0.3882, pruned_loss=0.1304, over 21667.00 frames. ], tot_loss[loss=0.2976, simple_loss=0.3579, pruned_loss=0.1186, over 4286194.55 frames. ], batch size: 441, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 18:42:02,913 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=201930.0, ans=0.0 2023-06-18 18:43:11,802 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 18:43:20,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=202110.0, ans=0.0 2023-06-18 18:43:38,663 INFO [train.py:996] (3/4) Epoch 2, batch 3200, loss[loss=0.2588, simple_loss=0.3275, pruned_loss=0.09508, over 21746.00 frames. ], tot_loss[loss=0.2988, simple_loss=0.359, pruned_loss=0.1193, over 4287585.72 frames. ], batch size: 247, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 18:43:56,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=202170.0, ans=0.2 2023-06-18 18:44:24,072 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.23 vs. limit=6.0 2023-06-18 18:44:27,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=202230.0, ans=0.025 2023-06-18 18:44:27,952 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=202230.0, ans=0.125 2023-06-18 18:45:06,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=202350.0, ans=0.1 2023-06-18 18:45:50,458 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.948e+02 3.485e+02 4.090e+02 5.140e+02 8.255e+02, threshold=8.180e+02, percent-clipped=4.0 2023-06-18 18:45:59,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=202410.0, ans=0.0 2023-06-18 18:46:02,318 INFO [train.py:996] (3/4) Epoch 2, batch 3250, loss[loss=0.3219, simple_loss=0.361, pruned_loss=0.1414, over 21491.00 frames. ], tot_loss[loss=0.3033, simple_loss=0.3622, pruned_loss=0.1221, over 4282273.11 frames. ], batch size: 389, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 18:46:26,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=202530.0, ans=0.0 2023-06-18 18:47:10,028 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.52 vs. limit=22.5 2023-06-18 18:48:10,953 INFO [train.py:996] (3/4) Epoch 2, batch 3300, loss[loss=0.3311, simple_loss=0.3529, pruned_loss=0.1547, over 21315.00 frames. ], tot_loss[loss=0.3015, simple_loss=0.3581, pruned_loss=0.1224, over 4283377.56 frames. ], batch size: 507, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 18:50:26,641 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.293e+02 3.159e+02 3.765e+02 4.410e+02 7.992e+02, threshold=7.530e+02, percent-clipped=0.0 2023-06-18 18:50:32,589 INFO [train.py:996] (3/4) Epoch 2, batch 3350, loss[loss=0.3198, simple_loss=0.3598, pruned_loss=0.1399, over 21583.00 frames. ], tot_loss[loss=0.3024, simple_loss=0.3615, pruned_loss=0.1216, over 4276171.32 frames. ], batch size: 548, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 18:50:37,773 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.32 vs. limit=22.5 2023-06-18 18:50:40,203 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 18:51:38,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=203190.0, ans=0.0 2023-06-18 18:52:11,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=203250.0, ans=0.125 2023-06-18 18:52:47,707 INFO [train.py:996] (3/4) Epoch 2, batch 3400, loss[loss=0.2778, simple_loss=0.3405, pruned_loss=0.1075, over 21508.00 frames. ], tot_loss[loss=0.3032, simple_loss=0.3614, pruned_loss=0.1225, over 4286267.29 frames. ], batch size: 230, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 18:52:48,793 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.85 vs. limit=22.5 2023-06-18 18:54:52,078 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.825e+02 2.921e+02 3.464e+02 4.240e+02 8.326e+02, threshold=6.928e+02, percent-clipped=2.0 2023-06-18 18:54:58,314 INFO [train.py:996] (3/4) Epoch 2, batch 3450, loss[loss=0.4688, simple_loss=0.488, pruned_loss=0.2248, over 21383.00 frames. ], tot_loss[loss=0.2996, simple_loss=0.3554, pruned_loss=0.1219, over 4281879.13 frames. ], batch size: 507, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 18:55:20,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=203670.0, ans=0.2 2023-06-18 18:55:22,008 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=203670.0, ans=0.1 2023-06-18 18:55:56,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=203730.0, ans=0.0 2023-06-18 18:56:22,710 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=203790.0, ans=0.125 2023-06-18 18:56:31,431 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff3.min_abs, batch_count=203790.0, ans=0.2 2023-06-18 18:56:49,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=203850.0, ans=0.0 2023-06-18 18:56:57,686 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.40 vs. limit=10.0 2023-06-18 18:57:28,385 INFO [train.py:996] (3/4) Epoch 2, batch 3500, loss[loss=0.4274, simple_loss=0.4616, pruned_loss=0.1966, over 21427.00 frames. ], tot_loss[loss=0.3089, simple_loss=0.3644, pruned_loss=0.1268, over 4278977.45 frames. ], batch size: 471, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 18:57:40,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=203970.0, ans=0.1 2023-06-18 18:57:42,699 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.47 vs. limit=6.0 2023-06-18 18:58:08,053 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.19 vs. limit=15.0 2023-06-18 18:59:25,788 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.040e+02 3.105e+02 3.610e+02 4.528e+02 8.050e+02, threshold=7.220e+02, percent-clipped=2.0 2023-06-18 18:59:46,323 INFO [train.py:996] (3/4) Epoch 2, batch 3550, loss[loss=0.2746, simple_loss=0.3193, pruned_loss=0.1149, over 21615.00 frames. ], tot_loss[loss=0.312, simple_loss=0.3674, pruned_loss=0.1283, over 4283899.74 frames. ], batch size: 298, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 19:00:08,828 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.86 vs. limit=15.0 2023-06-18 19:00:13,716 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.05 vs. limit=10.0 2023-06-18 19:00:32,946 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.69 vs. limit=22.5 2023-06-18 19:01:36,486 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.67 vs. limit=10.0 2023-06-18 19:01:43,299 INFO [train.py:996] (3/4) Epoch 2, batch 3600, loss[loss=0.3738, simple_loss=0.402, pruned_loss=0.1728, over 21365.00 frames. ], tot_loss[loss=0.3082, simple_loss=0.3618, pruned_loss=0.1273, over 4280319.11 frames. ], batch size: 471, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 19:03:29,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=204750.0, ans=0.0 2023-06-18 19:03:58,753 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.112e+02 2.966e+02 3.675e+02 4.582e+02 9.024e+02, threshold=7.350e+02, percent-clipped=3.0 2023-06-18 19:04:11,490 INFO [train.py:996] (3/4) Epoch 2, batch 3650, loss[loss=0.2621, simple_loss=0.3363, pruned_loss=0.09393, over 21772.00 frames. ], tot_loss[loss=0.3083, simple_loss=0.3621, pruned_loss=0.1273, over 4281430.23 frames. ], batch size: 247, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 19:04:12,500 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.00 vs. limit=22.5 2023-06-18 19:05:03,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=204930.0, ans=0.0 2023-06-18 19:05:16,651 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.07 vs. limit=10.0 2023-06-18 19:05:36,923 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.11 vs. limit=15.0 2023-06-18 19:05:43,941 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.04 vs. limit=15.0 2023-06-18 19:06:10,720 INFO [train.py:996] (3/4) Epoch 2, batch 3700, loss[loss=0.3211, simple_loss=0.3771, pruned_loss=0.1326, over 21783.00 frames. ], tot_loss[loss=0.306, simple_loss=0.36, pruned_loss=0.126, over 4274939.31 frames. ], batch size: 389, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 19:06:41,728 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=205230.0, ans=0.125 2023-06-18 19:06:48,710 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=205230.0, ans=0.2 2023-06-18 19:08:22,244 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.170e+02 2.719e+02 3.274e+02 3.792e+02 6.740e+02, threshold=6.549e+02, percent-clipped=0.0 2023-06-18 19:08:33,927 INFO [train.py:996] (3/4) Epoch 2, batch 3750, loss[loss=0.3148, simple_loss=0.3667, pruned_loss=0.1314, over 21569.00 frames. ], tot_loss[loss=0.3042, simple_loss=0.3582, pruned_loss=0.1251, over 4278136.90 frames. ], batch size: 471, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 19:09:13,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=205530.0, ans=0.125 2023-06-18 19:09:16,792 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=205530.0, ans=0.0 2023-06-18 19:09:25,415 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=205530.0, ans=0.125 2023-06-18 19:09:31,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=205590.0, ans=0.1 2023-06-18 19:09:34,197 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=205590.0, ans=0.2 2023-06-18 19:09:34,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=205590.0, ans=0.125 2023-06-18 19:10:34,498 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.69 vs. limit=22.5 2023-06-18 19:10:35,532 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=205710.0, ans=0.125 2023-06-18 19:11:06,949 INFO [train.py:996] (3/4) Epoch 2, batch 3800, loss[loss=0.3785, simple_loss=0.412, pruned_loss=0.1725, over 21803.00 frames. ], tot_loss[loss=0.3002, simple_loss=0.3555, pruned_loss=0.1225, over 4278549.39 frames. ], batch size: 441, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 19:11:50,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=205890.0, ans=0.125 2023-06-18 19:12:30,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=205950.0, ans=0.125 2023-06-18 19:12:49,555 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.002e+02 3.000e+02 3.869e+02 4.812e+02 8.338e+02, threshold=7.738e+02, percent-clipped=10.0 2023-06-18 19:12:53,547 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.20 vs. limit=15.0 2023-06-18 19:12:59,568 INFO [train.py:996] (3/4) Epoch 2, batch 3850, loss[loss=0.2524, simple_loss=0.2998, pruned_loss=0.1025, over 21566.00 frames. ], tot_loss[loss=0.2983, simple_loss=0.3528, pruned_loss=0.1219, over 4276943.54 frames. ], batch size: 231, lr: 1.90e-02, grad_scale: 16.0 2023-06-18 19:13:41,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=206130.0, ans=0.1 2023-06-18 19:13:53,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=206190.0, ans=0.0 2023-06-18 19:14:45,068 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=206310.0, ans=0.1 2023-06-18 19:14:54,898 INFO [train.py:996] (3/4) Epoch 2, batch 3900, loss[loss=0.2887, simple_loss=0.3397, pruned_loss=0.1189, over 21903.00 frames. ], tot_loss[loss=0.294, simple_loss=0.3468, pruned_loss=0.1206, over 4284267.28 frames. ], batch size: 351, lr: 1.89e-02, grad_scale: 16.0 2023-06-18 19:16:05,927 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.38 vs. limit=22.5 2023-06-18 19:16:19,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=206550.0, ans=0.125 2023-06-18 19:16:38,542 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 19:16:47,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=206550.0, ans=0.0 2023-06-18 19:17:07,612 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.133e+02 2.842e+02 3.227e+02 4.001e+02 6.107e+02, threshold=6.453e+02, percent-clipped=0.0 2023-06-18 19:17:08,613 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.69 vs. limit=15.0 2023-06-18 19:17:23,773 INFO [train.py:996] (3/4) Epoch 2, batch 3950, loss[loss=0.274, simple_loss=0.3461, pruned_loss=0.101, over 21647.00 frames. ], tot_loss[loss=0.2943, simple_loss=0.3494, pruned_loss=0.1196, over 4289415.73 frames. ], batch size: 414, lr: 1.89e-02, grad_scale: 16.0 2023-06-18 19:18:39,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=206850.0, ans=0.125 2023-06-18 19:18:47,630 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.68 vs. limit=15.0 2023-06-18 19:19:28,695 INFO [train.py:996] (3/4) Epoch 2, batch 4000, loss[loss=0.2261, simple_loss=0.2806, pruned_loss=0.0858, over 21377.00 frames. ], tot_loss[loss=0.2893, simple_loss=0.3455, pruned_loss=0.1166, over 4285168.87 frames. ], batch size: 131, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 19:19:50,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=206970.0, ans=0.0 2023-06-18 19:20:01,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=207030.0, ans=0.125 2023-06-18 19:20:14,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=207090.0, ans=0.2 2023-06-18 19:20:25,761 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.27 vs. limit=22.5 2023-06-18 19:21:10,193 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=207150.0, ans=0.125 2023-06-18 19:21:38,177 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 2.725e+02 3.231e+02 3.855e+02 7.242e+02, threshold=6.463e+02, percent-clipped=2.0 2023-06-18 19:21:42,612 INFO [train.py:996] (3/4) Epoch 2, batch 4050, loss[loss=0.3056, simple_loss=0.3778, pruned_loss=0.1167, over 21514.00 frames. ], tot_loss[loss=0.2876, simple_loss=0.3452, pruned_loss=0.115, over 4280183.29 frames. ], batch size: 471, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 19:21:54,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=207270.0, ans=0.2 2023-06-18 19:21:55,359 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.63 vs. limit=5.0 2023-06-18 19:22:18,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=207330.0, ans=0.0 2023-06-18 19:23:53,138 INFO [train.py:996] (3/4) Epoch 2, batch 4100, loss[loss=0.322, simple_loss=0.3585, pruned_loss=0.1427, over 20087.00 frames. ], tot_loss[loss=0.2888, simple_loss=0.3462, pruned_loss=0.1157, over 4281894.33 frames. ], batch size: 703, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 19:25:34,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=207810.0, ans=0.0 2023-06-18 19:25:49,502 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.115e+02 2.985e+02 3.648e+02 4.743e+02 7.155e+02, threshold=7.295e+02, percent-clipped=4.0 2023-06-18 19:25:50,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=207810.0, ans=0.125 2023-06-18 19:26:12,885 INFO [train.py:996] (3/4) Epoch 2, batch 4150, loss[loss=0.2697, simple_loss=0.3244, pruned_loss=0.1075, over 16040.00 frames. ], tot_loss[loss=0.2848, simple_loss=0.3452, pruned_loss=0.1121, over 4266798.30 frames. ], batch size: 60, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 19:26:48,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=207990.0, ans=0.125 2023-06-18 19:27:49,533 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.73 vs. limit=12.0 2023-06-18 19:27:52,219 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=208110.0, ans=0.1 2023-06-18 19:28:14,150 INFO [train.py:996] (3/4) Epoch 2, batch 4200, loss[loss=0.2492, simple_loss=0.3034, pruned_loss=0.09751, over 21265.00 frames. ], tot_loss[loss=0.2831, simple_loss=0.3441, pruned_loss=0.1111, over 4270552.11 frames. ], batch size: 176, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 19:29:19,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=208290.0, ans=0.125 2023-06-18 19:29:27,309 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=208290.0, ans=0.125 2023-06-18 19:30:21,679 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.86 vs. limit=15.0 2023-06-18 19:30:27,999 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.663e+02 3.201e+02 3.847e+02 4.698e+02 6.915e+02, threshold=7.694e+02, percent-clipped=0.0 2023-06-18 19:30:32,579 INFO [train.py:996] (3/4) Epoch 2, batch 4250, loss[loss=0.3037, simple_loss=0.3665, pruned_loss=0.1205, over 21740.00 frames. ], tot_loss[loss=0.293, simple_loss=0.354, pruned_loss=0.116, over 4270962.43 frames. ], batch size: 247, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 19:30:54,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=208470.0, ans=0.2 2023-06-18 19:32:19,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=208650.0, ans=0.0 2023-06-18 19:32:24,413 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=208650.0, ans=0.1 2023-06-18 19:32:40,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=208710.0, ans=0.125 2023-06-18 19:32:47,719 INFO [train.py:996] (3/4) Epoch 2, batch 4300, loss[loss=0.3192, simple_loss=0.3605, pruned_loss=0.1389, over 20094.00 frames. ], tot_loss[loss=0.2986, simple_loss=0.3599, pruned_loss=0.1186, over 4273430.70 frames. ], batch size: 702, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 19:32:55,133 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-06-18 19:33:28,300 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.45 vs. limit=15.0 2023-06-18 19:33:48,877 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=208830.0, ans=0.1 2023-06-18 19:34:03,024 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.23 vs. limit=22.5 2023-06-18 19:34:21,400 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=208890.0, ans=0.125 2023-06-18 19:34:43,562 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.09 vs. limit=10.0 2023-06-18 19:35:14,046 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.035e+02 3.172e+02 3.778e+02 4.372e+02 7.557e+02, threshold=7.556e+02, percent-clipped=0.0 2023-06-18 19:35:25,692 INFO [train.py:996] (3/4) Epoch 2, batch 4350, loss[loss=0.3128, simple_loss=0.3738, pruned_loss=0.1259, over 21054.00 frames. ], tot_loss[loss=0.2946, simple_loss=0.3563, pruned_loss=0.1165, over 4271399.23 frames. ], batch size: 608, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 19:35:27,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=209070.0, ans=0.0 2023-06-18 19:35:34,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=209070.0, ans=0.0 2023-06-18 19:36:27,716 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=209190.0, ans=0.125 2023-06-18 19:36:55,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=209310.0, ans=0.0 2023-06-18 19:37:38,250 INFO [train.py:996] (3/4) Epoch 2, batch 4400, loss[loss=0.3322, simple_loss=0.4029, pruned_loss=0.1307, over 21606.00 frames. ], tot_loss[loss=0.2916, simple_loss=0.3522, pruned_loss=0.1155, over 4261987.46 frames. ], batch size: 414, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 19:38:25,566 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=209430.0, ans=0.0 2023-06-18 19:38:26,267 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.15 vs. limit=6.0 2023-06-18 19:38:33,090 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=209430.0, ans=0.125 2023-06-18 19:39:14,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=209550.0, ans=0.125 2023-06-18 19:39:25,198 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=209550.0, ans=0.1 2023-06-18 19:39:47,529 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.163e+02 2.882e+02 3.350e+02 4.158e+02 7.095e+02, threshold=6.699e+02, percent-clipped=0.0 2023-06-18 19:39:57,952 INFO [train.py:996] (3/4) Epoch 2, batch 4450, loss[loss=0.3003, simple_loss=0.3595, pruned_loss=0.1206, over 21199.00 frames. ], tot_loss[loss=0.2968, simple_loss=0.3592, pruned_loss=0.1173, over 4260010.43 frames. ], batch size: 159, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 19:40:54,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=209790.0, ans=0.04949747468305833 2023-06-18 19:41:38,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=209910.0, ans=0.125 2023-06-18 19:42:11,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=209910.0, ans=0.125 2023-06-18 19:42:12,651 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=209910.0, ans=0.0 2023-06-18 19:42:14,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=209970.0, ans=0.0 2023-06-18 19:42:15,038 INFO [train.py:996] (3/4) Epoch 2, batch 4500, loss[loss=0.2808, simple_loss=0.3633, pruned_loss=0.09921, over 21412.00 frames. ], tot_loss[loss=0.2994, simple_loss=0.3608, pruned_loss=0.1189, over 4264977.28 frames. ], batch size: 211, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 19:42:20,182 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=209970.0, ans=0.125 2023-06-18 19:44:03,766 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.17 vs. limit=22.5 2023-06-18 19:44:16,215 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.046e+02 2.865e+02 3.524e+02 4.358e+02 7.119e+02, threshold=7.048e+02, percent-clipped=2.0 2023-06-18 19:44:38,592 INFO [train.py:996] (3/4) Epoch 2, batch 4550, loss[loss=0.2897, simple_loss=0.3645, pruned_loss=0.1074, over 21796.00 frames. ], tot_loss[loss=0.3017, simple_loss=0.3648, pruned_loss=0.1193, over 4268021.72 frames. ], batch size: 282, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 19:45:16,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=210330.0, ans=0.1 2023-06-18 19:45:19,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=210390.0, ans=0.125 2023-06-18 19:45:44,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=210390.0, ans=15.0 2023-06-18 19:46:42,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=210510.0, ans=0.0 2023-06-18 19:46:49,079 INFO [train.py:996] (3/4) Epoch 2, batch 4600, loss[loss=0.3035, simple_loss=0.3702, pruned_loss=0.1184, over 21685.00 frames. ], tot_loss[loss=0.3076, simple_loss=0.3701, pruned_loss=0.1226, over 4272976.21 frames. ], batch size: 389, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 19:46:58,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=210570.0, ans=10.0 2023-06-18 19:48:17,022 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.89 vs. limit=22.5 2023-06-18 19:48:19,891 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.66 vs. limit=15.0 2023-06-18 19:48:40,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=210750.0, ans=0.125 2023-06-18 19:48:59,911 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.188e+02 3.115e+02 3.942e+02 4.901e+02 8.215e+02, threshold=7.883e+02, percent-clipped=5.0 2023-06-18 19:49:07,882 INFO [train.py:996] (3/4) Epoch 2, batch 4650, loss[loss=0.3381, simple_loss=0.3803, pruned_loss=0.1479, over 19974.00 frames. ], tot_loss[loss=0.3015, simple_loss=0.3644, pruned_loss=0.1193, over 4276045.43 frames. ], batch size: 702, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 19:49:18,498 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 19:50:13,494 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=210990.0, ans=0.0 2023-06-18 19:50:20,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=210990.0, ans=0.125 2023-06-18 19:50:58,222 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=211110.0, ans=0.07 2023-06-18 19:51:18,614 INFO [train.py:996] (3/4) Epoch 2, batch 4700, loss[loss=0.2652, simple_loss=0.3106, pruned_loss=0.1099, over 21494.00 frames. ], tot_loss[loss=0.2928, simple_loss=0.3536, pruned_loss=0.116, over 4269184.94 frames. ], batch size: 230, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 19:51:57,404 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=211290.0, ans=0.125 2023-06-18 19:52:28,539 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=211290.0, ans=0.125 2023-06-18 19:52:31,942 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.18 vs. limit=15.0 2023-06-18 19:52:44,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=211350.0, ans=0.2 2023-06-18 19:53:12,756 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.869e+02 2.814e+02 3.366e+02 4.246e+02 7.056e+02, threshold=6.733e+02, percent-clipped=0.0 2023-06-18 19:53:13,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=211410.0, ans=0.125 2023-06-18 19:53:17,078 INFO [train.py:996] (3/4) Epoch 2, batch 4750, loss[loss=0.2659, simple_loss=0.3127, pruned_loss=0.1096, over 21862.00 frames. ], tot_loss[loss=0.2891, simple_loss=0.3486, pruned_loss=0.1148, over 4266127.01 frames. ], batch size: 373, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 19:53:46,643 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.17 vs. limit=22.5 2023-06-18 19:54:57,620 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.01 vs. limit=6.0 2023-06-18 19:55:31,825 INFO [train.py:996] (3/4) Epoch 2, batch 4800, loss[loss=0.263, simple_loss=0.3167, pruned_loss=0.1046, over 21646.00 frames. ], tot_loss[loss=0.2925, simple_loss=0.3506, pruned_loss=0.1172, over 4271779.20 frames. ], batch size: 263, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 19:55:41,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=211770.0, ans=0.1 2023-06-18 19:55:59,332 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=211830.0, ans=0.125 2023-06-18 19:56:30,002 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 19:56:46,001 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=211950.0, ans=0.125 2023-06-18 19:57:24,063 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.995e+02 2.885e+02 3.329e+02 4.193e+02 7.094e+02, threshold=6.657e+02, percent-clipped=1.0 2023-06-18 19:57:28,257 INFO [train.py:996] (3/4) Epoch 2, batch 4850, loss[loss=0.2609, simple_loss=0.3171, pruned_loss=0.1024, over 21469.00 frames. ], tot_loss[loss=0.2915, simple_loss=0.3498, pruned_loss=0.1166, over 4274439.22 frames. ], batch size: 212, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 19:57:30,716 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.85 vs. limit=15.0 2023-06-18 19:58:11,102 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=212130.0, ans=0.0 2023-06-18 19:58:34,082 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=212190.0, ans=0.125 2023-06-18 19:58:44,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=212250.0, ans=0.05 2023-06-18 19:59:11,067 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=212310.0, ans=0.0 2023-06-18 19:59:18,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=212310.0, ans=0.125 2023-06-18 19:59:36,625 INFO [train.py:996] (3/4) Epoch 2, batch 4900, loss[loss=0.3046, simple_loss=0.3818, pruned_loss=0.1137, over 21803.00 frames. ], tot_loss[loss=0.2938, simple_loss=0.3527, pruned_loss=0.1175, over 4286238.39 frames. ], batch size: 282, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 20:00:36,138 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=212490.0, ans=0.0 2023-06-18 20:00:37,892 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.31 vs. limit=15.0 2023-06-18 20:01:48,727 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.330e+02 3.116e+02 3.732e+02 4.478e+02 7.040e+02, threshold=7.463e+02, percent-clipped=1.0 2023-06-18 20:01:53,123 INFO [train.py:996] (3/4) Epoch 2, batch 4950, loss[loss=0.2461, simple_loss=0.3382, pruned_loss=0.07697, over 21592.00 frames. ], tot_loss[loss=0.2926, simple_loss=0.3545, pruned_loss=0.1154, over 4283854.58 frames. ], batch size: 230, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 20:01:54,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=212670.0, ans=0.125 2023-06-18 20:02:22,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=212730.0, ans=0.125 2023-06-18 20:04:13,526 INFO [train.py:996] (3/4) Epoch 2, batch 5000, loss[loss=0.2402, simple_loss=0.3156, pruned_loss=0.08245, over 21218.00 frames. ], tot_loss[loss=0.2857, simple_loss=0.3505, pruned_loss=0.1104, over 4282666.91 frames. ], batch size: 176, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 20:04:36,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=213030.0, ans=0.035 2023-06-18 20:05:14,092 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 20:05:27,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=213150.0, ans=0.0 2023-06-18 20:05:28,018 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.10 vs. limit=22.5 2023-06-18 20:05:55,839 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=213210.0, ans=0.125 2023-06-18 20:06:07,001 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=213210.0, ans=0.025 2023-06-18 20:06:09,534 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.908e+02 2.689e+02 3.169e+02 3.752e+02 6.029e+02, threshold=6.337e+02, percent-clipped=0.0 2023-06-18 20:06:19,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=213270.0, ans=0.1 2023-06-18 20:06:20,640 INFO [train.py:996] (3/4) Epoch 2, batch 5050, loss[loss=0.2717, simple_loss=0.3295, pruned_loss=0.1069, over 21363.00 frames. ], tot_loss[loss=0.2894, simple_loss=0.3512, pruned_loss=0.1138, over 4282993.21 frames. ], batch size: 143, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 20:06:26,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=213270.0, ans=0.1 2023-06-18 20:06:35,538 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=15.0 2023-06-18 20:07:14,579 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=22.5 2023-06-18 20:07:30,417 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=213390.0, ans=0.2 2023-06-18 20:08:31,988 INFO [train.py:996] (3/4) Epoch 2, batch 5100, loss[loss=0.2983, simple_loss=0.3504, pruned_loss=0.1231, over 21742.00 frames. ], tot_loss[loss=0.2896, simple_loss=0.3492, pruned_loss=0.115, over 4291280.17 frames. ], batch size: 441, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 20:10:34,771 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.030e+02 2.968e+02 3.545e+02 4.685e+02 7.236e+02, threshold=7.090e+02, percent-clipped=4.0 2023-06-18 20:10:35,828 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.38 vs. limit=22.5 2023-06-18 20:10:43,522 INFO [train.py:996] (3/4) Epoch 2, batch 5150, loss[loss=0.3988, simple_loss=0.4322, pruned_loss=0.1827, over 21625.00 frames. ], tot_loss[loss=0.2907, simple_loss=0.3487, pruned_loss=0.1164, over 4297437.43 frames. ], batch size: 508, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 20:11:37,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=213990.0, ans=0.125 2023-06-18 20:11:43,374 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.50 vs. limit=6.0 2023-06-18 20:11:57,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=213990.0, ans=0.125 2023-06-18 20:12:18,426 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.77 vs. limit=6.0 2023-06-18 20:12:36,477 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=214050.0, ans=0.125 2023-06-18 20:13:01,829 INFO [train.py:996] (3/4) Epoch 2, batch 5200, loss[loss=0.3464, simple_loss=0.4232, pruned_loss=0.1348, over 21692.00 frames. ], tot_loss[loss=0.2889, simple_loss=0.3471, pruned_loss=0.1154, over 4293117.72 frames. ], batch size: 414, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 20:13:14,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=214170.0, ans=0.0 2023-06-18 20:13:42,285 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=214230.0, ans=0.125 2023-06-18 20:14:44,652 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.42 vs. limit=15.0 2023-06-18 20:14:59,701 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.974e+02 2.964e+02 3.541e+02 4.269e+02 6.155e+02, threshold=7.082e+02, percent-clipped=0.0 2023-06-18 20:15:04,631 INFO [train.py:996] (3/4) Epoch 2, batch 5250, loss[loss=0.2461, simple_loss=0.3247, pruned_loss=0.08373, over 21495.00 frames. ], tot_loss[loss=0.2868, simple_loss=0.3483, pruned_loss=0.1126, over 4284758.34 frames. ], batch size: 195, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 20:15:13,579 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=214470.0, ans=0.125 2023-06-18 20:15:20,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=214470.0, ans=0.1 2023-06-18 20:16:48,329 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=214650.0, ans=0.1 2023-06-18 20:17:04,953 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=214710.0, ans=0.125 2023-06-18 20:17:16,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=214710.0, ans=0.1 2023-06-18 20:17:18,956 INFO [train.py:996] (3/4) Epoch 2, batch 5300, loss[loss=0.3487, simple_loss=0.3804, pruned_loss=0.1585, over 21828.00 frames. ], tot_loss[loss=0.2895, simple_loss=0.3491, pruned_loss=0.1149, over 4290053.10 frames. ], batch size: 441, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 20:17:47,930 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=214770.0, ans=0.0 2023-06-18 20:18:16,218 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.84 vs. limit=10.0 2023-06-18 20:18:20,842 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.39 vs. limit=22.5 2023-06-18 20:18:24,384 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=214890.0, ans=0.1 2023-06-18 20:18:26,228 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.12 vs. limit=6.0 2023-06-18 20:19:17,123 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.126e+02 2.854e+02 3.243e+02 3.685e+02 6.916e+02, threshold=6.486e+02, percent-clipped=0.0 2023-06-18 20:19:21,305 INFO [train.py:996] (3/4) Epoch 2, batch 5350, loss[loss=0.3482, simple_loss=0.377, pruned_loss=0.1597, over 21637.00 frames. ], tot_loss[loss=0.2917, simple_loss=0.3493, pruned_loss=0.1171, over 4298416.18 frames. ], batch size: 471, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 20:19:45,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=215070.0, ans=0.2 2023-06-18 20:20:40,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=215190.0, ans=0.0 2023-06-18 20:21:03,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=215250.0, ans=0.125 2023-06-18 20:21:05,917 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=215250.0, ans=0.125 2023-06-18 20:21:41,617 INFO [train.py:996] (3/4) Epoch 2, batch 5400, loss[loss=0.3459, simple_loss=0.4361, pruned_loss=0.1279, over 19856.00 frames. ], tot_loss[loss=0.2927, simple_loss=0.3485, pruned_loss=0.1185, over 4304948.07 frames. ], batch size: 702, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 20:21:51,979 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 20:22:03,432 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=215370.0, ans=0.0 2023-06-18 20:23:23,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=215550.0, ans=0.125 2023-06-18 20:23:23,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=215550.0, ans=0.125 2023-06-18 20:23:40,922 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.91 vs. limit=22.5 2023-06-18 20:23:43,971 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.922e+02 2.870e+02 3.526e+02 4.433e+02 7.880e+02, threshold=7.051e+02, percent-clipped=2.0 2023-06-18 20:24:07,770 INFO [train.py:996] (3/4) Epoch 2, batch 5450, loss[loss=0.3077, simple_loss=0.3942, pruned_loss=0.1106, over 20757.00 frames. ], tot_loss[loss=0.2898, simple_loss=0.3495, pruned_loss=0.1151, over 4298568.12 frames. ], batch size: 607, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 20:24:14,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=215670.0, ans=0.125 2023-06-18 20:25:16,683 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 20:25:29,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=215850.0, ans=0.1 2023-06-18 20:25:55,791 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=215850.0, ans=0.125 2023-06-18 20:25:56,413 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.99 vs. limit=10.0 2023-06-18 20:25:56,973 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=215850.0, ans=0.125 2023-06-18 20:26:26,990 INFO [train.py:996] (3/4) Epoch 2, batch 5500, loss[loss=0.3864, simple_loss=0.4371, pruned_loss=0.1678, over 21453.00 frames. ], tot_loss[loss=0.2896, simple_loss=0.3559, pruned_loss=0.1116, over 4285341.41 frames. ], batch size: 507, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 20:27:05,604 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=216030.0, ans=0.125 2023-06-18 20:27:14,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=216030.0, ans=0.0 2023-06-18 20:27:15,846 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=216030.0, ans=0.125 2023-06-18 20:27:18,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=216090.0, ans=0.0 2023-06-18 20:27:20,909 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.47 vs. limit=6.0 2023-06-18 20:27:46,742 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=216150.0, ans=0.0 2023-06-18 20:27:54,144 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=216150.0, ans=0.5 2023-06-18 20:28:51,967 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.961e+02 2.788e+02 3.266e+02 4.093e+02 7.420e+02, threshold=6.532e+02, percent-clipped=2.0 2023-06-18 20:29:02,102 INFO [train.py:996] (3/4) Epoch 2, batch 5550, loss[loss=0.2049, simple_loss=0.2783, pruned_loss=0.0657, over 21174.00 frames. ], tot_loss[loss=0.281, simple_loss=0.3497, pruned_loss=0.1062, over 4279063.17 frames. ], batch size: 159, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 20:29:22,485 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=216330.0, ans=0.0 2023-06-18 20:30:49,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=216510.0, ans=0.0 2023-06-18 20:31:13,791 INFO [train.py:996] (3/4) Epoch 2, batch 5600, loss[loss=0.334, simple_loss=0.4169, pruned_loss=0.1255, over 21651.00 frames. ], tot_loss[loss=0.2753, simple_loss=0.3453, pruned_loss=0.1026, over 4284369.29 frames. ], batch size: 414, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 20:31:43,219 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=216630.0, ans=0.1 2023-06-18 20:32:34,969 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.00 vs. limit=10.0 2023-06-18 20:32:37,503 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 20:33:03,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=216810.0, ans=10.0 2023-06-18 20:33:05,995 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.078e+02 3.046e+02 3.574e+02 4.352e+02 8.183e+02, threshold=7.147e+02, percent-clipped=1.0 2023-06-18 20:33:18,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=216810.0, ans=0.0 2023-06-18 20:33:20,542 INFO [train.py:996] (3/4) Epoch 2, batch 5650, loss[loss=0.3046, simple_loss=0.3549, pruned_loss=0.1271, over 21867.00 frames. ], tot_loss[loss=0.2814, simple_loss=0.3516, pruned_loss=0.1056, over 4282770.19 frames. ], batch size: 124, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 20:33:50,982 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.33 vs. limit=15.0 2023-06-18 20:34:02,042 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=216990.0, ans=0.125 2023-06-18 20:35:31,731 INFO [train.py:996] (3/4) Epoch 2, batch 5700, loss[loss=0.2755, simple_loss=0.3592, pruned_loss=0.0959, over 21782.00 frames. ], tot_loss[loss=0.2846, simple_loss=0.3527, pruned_loss=0.1083, over 4286136.25 frames. ], batch size: 371, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 20:35:33,690 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=217170.0, ans=0.04949747468305833 2023-06-18 20:35:59,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=217170.0, ans=0.125 2023-06-18 20:36:04,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=217170.0, ans=0.05 2023-06-18 20:36:36,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=217230.0, ans=0.125 2023-06-18 20:37:51,401 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.948e+02 3.053e+02 4.055e+02 5.336e+02 8.968e+02, threshold=8.109e+02, percent-clipped=6.0 2023-06-18 20:37:55,691 INFO [train.py:996] (3/4) Epoch 2, batch 5750, loss[loss=0.396, simple_loss=0.4947, pruned_loss=0.1487, over 19739.00 frames. ], tot_loss[loss=0.2813, simple_loss=0.3512, pruned_loss=0.1057, over 4275374.87 frames. ], batch size: 702, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 20:39:24,222 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=217590.0, ans=0.025 2023-06-18 20:40:27,367 INFO [train.py:996] (3/4) Epoch 2, batch 5800, loss[loss=0.3275, simple_loss=0.4107, pruned_loss=0.1221, over 21656.00 frames. ], tot_loss[loss=0.2793, simple_loss=0.35, pruned_loss=0.1043, over 4267856.77 frames. ], batch size: 414, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 20:40:42,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=217770.0, ans=0.125 2023-06-18 20:40:42,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=217770.0, ans=0.125 2023-06-18 20:42:37,690 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.479e+02 2.376e+02 3.080e+02 4.315e+02 9.402e+02, threshold=6.161e+02, percent-clipped=2.0 2023-06-18 20:42:38,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=218010.0, ans=0.125 2023-06-18 20:42:42,062 INFO [train.py:996] (3/4) Epoch 2, batch 5850, loss[loss=0.2148, simple_loss=0.3076, pruned_loss=0.06096, over 21732.00 frames. ], tot_loss[loss=0.2706, simple_loss=0.3445, pruned_loss=0.09833, over 4268599.79 frames. ], batch size: 298, lr: 1.85e-02, grad_scale: 64.0 2023-06-18 20:44:17,502 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=218250.0, ans=0.0 2023-06-18 20:44:52,892 INFO [train.py:996] (3/4) Epoch 2, batch 5900, loss[loss=0.2928, simple_loss=0.3493, pruned_loss=0.1181, over 21878.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.3345, pruned_loss=0.09097, over 4272191.82 frames. ], batch size: 371, lr: 1.84e-02, grad_scale: 32.0 2023-06-18 20:45:40,691 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=218430.0, ans=0.125 2023-06-18 20:46:01,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=218490.0, ans=0.125 2023-06-18 20:46:08,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=218550.0, ans=0.0 2023-06-18 20:46:10,041 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=218550.0, ans=0.125 2023-06-18 20:46:23,122 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=218610.0, ans=0.125 2023-06-18 20:46:52,901 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.535e+02 2.321e+02 3.028e+02 3.868e+02 8.968e+02, threshold=6.057e+02, percent-clipped=3.0 2023-06-18 20:46:57,509 INFO [train.py:996] (3/4) Epoch 2, batch 5950, loss[loss=0.2751, simple_loss=0.3225, pruned_loss=0.1139, over 21845.00 frames. ], tot_loss[loss=0.2633, simple_loss=0.3356, pruned_loss=0.09553, over 4280080.64 frames. ], batch size: 351, lr: 1.84e-02, grad_scale: 32.0 2023-06-18 20:47:48,929 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=218790.0, ans=0.125 2023-06-18 20:47:50,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=218790.0, ans=0.2 2023-06-18 20:48:52,201 INFO [train.py:996] (3/4) Epoch 2, batch 6000, loss[loss=0.2579, simple_loss=0.3107, pruned_loss=0.1026, over 21788.00 frames. ], tot_loss[loss=0.2668, simple_loss=0.3332, pruned_loss=0.1002, over 4282635.57 frames. ], batch size: 102, lr: 1.84e-02, grad_scale: 32.0 2023-06-18 20:48:52,202 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-18 20:49:44,683 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.3.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([2.9964, 2.1299, 2.2061, 2.7828, 1.3954, 2.6721, 2.4631, 1.6599], device='cuda:3') 2023-06-18 20:49:47,780 INFO [train.py:1028] (3/4) Epoch 2, validation: loss=0.2855, simple_loss=0.3796, pruned_loss=0.09574, over 1796401.00 frames. 2023-06-18 20:49:47,782 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-18 20:49:56,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=218970.0, ans=0.1 2023-06-18 20:50:00,392 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.38 vs. limit=10.0 2023-06-18 20:51:07,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=219210.0, ans=0.125 2023-06-18 20:51:24,020 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=219210.0, ans=0.0 2023-06-18 20:51:42,978 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.008e+02 2.988e+02 3.378e+02 4.160e+02 8.273e+02, threshold=6.755e+02, percent-clipped=6.0 2023-06-18 20:51:45,881 INFO [train.py:996] (3/4) Epoch 2, batch 6050, loss[loss=0.2258, simple_loss=0.2822, pruned_loss=0.08473, over 21459.00 frames. ], tot_loss[loss=0.267, simple_loss=0.3288, pruned_loss=0.1026, over 4276435.08 frames. ], batch size: 132, lr: 1.84e-02, grad_scale: 32.0 2023-06-18 20:52:24,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=219330.0, ans=0.2 2023-06-18 20:52:49,679 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=219390.0, ans=0.125 2023-06-18 20:53:09,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=219450.0, ans=0.2 2023-06-18 20:53:43,449 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=219510.0, ans=0.025 2023-06-18 20:53:45,889 INFO [train.py:996] (3/4) Epoch 2, batch 6100, loss[loss=0.22, simple_loss=0.2965, pruned_loss=0.07173, over 21456.00 frames. ], tot_loss[loss=0.2655, simple_loss=0.3281, pruned_loss=0.1014, over 4278769.54 frames. ], batch size: 194, lr: 1.84e-02, grad_scale: 16.0 2023-06-18 20:53:46,909 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.35 vs. limit=10.0 2023-06-18 20:54:12,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=219630.0, ans=0.0 2023-06-18 20:54:24,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=219630.0, ans=0.125 2023-06-18 20:54:28,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=219630.0, ans=0.125 2023-06-18 20:55:54,860 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.805e+02 3.086e+02 4.066e+02 5.102e+02 8.027e+02, threshold=8.133e+02, percent-clipped=8.0 2023-06-18 20:55:56,244 INFO [train.py:996] (3/4) Epoch 2, batch 6150, loss[loss=0.2393, simple_loss=0.3028, pruned_loss=0.08792, over 21519.00 frames. ], tot_loss[loss=0.2719, simple_loss=0.3324, pruned_loss=0.1057, over 4275512.38 frames. ], batch size: 212, lr: 1.84e-02, grad_scale: 16.0 2023-06-18 20:56:09,954 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=219870.0, ans=0.05 2023-06-18 20:56:48,288 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=219990.0, ans=0.0 2023-06-18 20:58:07,514 INFO [train.py:996] (3/4) Epoch 2, batch 6200, loss[loss=0.3402, simple_loss=0.4003, pruned_loss=0.1401, over 21894.00 frames. ], tot_loss[loss=0.2748, simple_loss=0.336, pruned_loss=0.1068, over 4273834.66 frames. ], batch size: 416, lr: 1.84e-02, grad_scale: 16.0 2023-06-18 20:58:07,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=220170.0, ans=0.0 2023-06-18 20:58:19,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=220170.0, ans=0.0 2023-06-18 20:58:38,262 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=220230.0, ans=0.0 2023-06-18 20:58:52,713 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=220290.0, ans=0.0 2023-06-18 20:59:43,593 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.21 vs. limit=22.5 2023-06-18 20:59:59,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=220350.0, ans=0.1 2023-06-18 21:00:12,643 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.39 vs. limit=10.0 2023-06-18 21:00:27,271 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.825e+02 2.590e+02 3.175e+02 3.801e+02 7.451e+02, threshold=6.350e+02, percent-clipped=0.0 2023-06-18 21:00:28,869 INFO [train.py:996] (3/4) Epoch 2, batch 6250, loss[loss=0.2634, simple_loss=0.3556, pruned_loss=0.0856, over 21632.00 frames. ], tot_loss[loss=0.2758, simple_loss=0.3399, pruned_loss=0.1058, over 4272561.50 frames. ], batch size: 230, lr: 1.84e-02, grad_scale: 16.0 2023-06-18 21:01:17,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=220590.0, ans=0.125 2023-06-18 21:02:16,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=220710.0, ans=0.1 2023-06-18 21:02:38,008 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=220710.0, ans=0.1 2023-06-18 21:02:42,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=220770.0, ans=0.0 2023-06-18 21:02:43,591 INFO [train.py:996] (3/4) Epoch 2, batch 6300, loss[loss=0.2781, simple_loss=0.3338, pruned_loss=0.1112, over 21852.00 frames. ], tot_loss[loss=0.2774, simple_loss=0.3439, pruned_loss=0.1055, over 4277547.40 frames. ], batch size: 298, lr: 1.83e-02, grad_scale: 16.0 2023-06-18 21:02:44,226 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=220770.0, ans=0.1 2023-06-18 21:03:02,071 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=220830.0, ans=0.0 2023-06-18 21:03:32,927 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.35 vs. limit=22.5 2023-06-18 21:03:55,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=220950.0, ans=0.125 2023-06-18 21:04:17,150 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=220950.0, ans=0.0 2023-06-18 21:04:18,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=220950.0, ans=0.125 2023-06-18 21:04:31,649 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.09 vs. limit=15.0 2023-06-18 21:04:32,651 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=221010.0, ans=0.07 2023-06-18 21:04:39,182 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.847e+02 2.847e+02 3.430e+02 4.458e+02 8.729e+02, threshold=6.860e+02, percent-clipped=4.0 2023-06-18 21:04:40,736 INFO [train.py:996] (3/4) Epoch 2, batch 6350, loss[loss=0.32, simple_loss=0.3725, pruned_loss=0.1338, over 21813.00 frames. ], tot_loss[loss=0.2844, simple_loss=0.3483, pruned_loss=0.1102, over 4277897.04 frames. ], batch size: 282, lr: 1.83e-02, grad_scale: 16.0 2023-06-18 21:04:43,179 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.99 vs. limit=22.5 2023-06-18 21:05:05,452 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.11 vs. limit=15.0 2023-06-18 21:05:56,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=221190.0, ans=0.125 2023-06-18 21:05:57,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=221190.0, ans=0.0 2023-06-18 21:06:29,979 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 21:06:31,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=221250.0, ans=0.2 2023-06-18 21:06:34,700 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=221250.0, ans=0.125 2023-06-18 21:06:46,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=221310.0, ans=0.125 2023-06-18 21:07:00,533 INFO [train.py:996] (3/4) Epoch 2, batch 6400, loss[loss=0.3689, simple_loss=0.406, pruned_loss=0.1659, over 21784.00 frames. ], tot_loss[loss=0.2942, simple_loss=0.3556, pruned_loss=0.1164, over 4278976.20 frames. ], batch size: 441, lr: 1.83e-02, grad_scale: 32.0 2023-06-18 21:07:04,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=221370.0, ans=0.125 2023-06-18 21:08:12,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=221490.0, ans=0.0 2023-06-18 21:08:21,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=221490.0, ans=0.125 2023-06-18 21:08:23,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=221490.0, ans=0.1 2023-06-18 21:08:25,091 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.95 vs. limit=12.0 2023-06-18 21:08:46,058 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=221550.0, ans=0.125 2023-06-18 21:09:08,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=221610.0, ans=0.125 2023-06-18 21:09:11,117 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=221610.0, ans=0.125 2023-06-18 21:09:14,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=221610.0, ans=0.04949747468305833 2023-06-18 21:09:22,170 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.827e+02 2.900e+02 3.291e+02 3.845e+02 6.146e+02, threshold=6.583e+02, percent-clipped=0.0 2023-06-18 21:09:22,193 INFO [train.py:996] (3/4) Epoch 2, batch 6450, loss[loss=0.2721, simple_loss=0.3352, pruned_loss=0.1045, over 21260.00 frames. ], tot_loss[loss=0.2929, simple_loss=0.3563, pruned_loss=0.1147, over 4282734.30 frames. ], batch size: 548, lr: 1.83e-02, grad_scale: 16.0 2023-06-18 21:11:12,012 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=221910.0, ans=0.05 2023-06-18 21:11:21,742 INFO [train.py:996] (3/4) Epoch 2, batch 6500, loss[loss=0.2398, simple_loss=0.2935, pruned_loss=0.09301, over 21564.00 frames. ], tot_loss[loss=0.2854, simple_loss=0.3463, pruned_loss=0.1122, over 4280890.53 frames. ], batch size: 263, lr: 1.83e-02, grad_scale: 16.0 2023-06-18 21:11:48,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=222030.0, ans=0.125 2023-06-18 21:12:01,373 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=222030.0, ans=0.125 2023-06-18 21:12:46,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=222150.0, ans=0.0 2023-06-18 21:13:21,620 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=222210.0, ans=0.1 2023-06-18 21:13:40,650 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.861e+02 2.666e+02 3.386e+02 4.253e+02 7.478e+02, threshold=6.772e+02, percent-clipped=2.0 2023-06-18 21:13:40,674 INFO [train.py:996] (3/4) Epoch 2, batch 6550, loss[loss=0.3218, simple_loss=0.3761, pruned_loss=0.1337, over 21712.00 frames. ], tot_loss[loss=0.283, simple_loss=0.3443, pruned_loss=0.1108, over 4289061.18 frames. ], batch size: 441, lr: 1.83e-02, grad_scale: 16.0 2023-06-18 21:13:47,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=222270.0, ans=0.125 2023-06-18 21:14:14,940 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.28 vs. limit=15.0 2023-06-18 21:14:46,524 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-18 21:14:50,309 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=222450.0, ans=0.125 2023-06-18 21:14:53,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=222450.0, ans=0.0 2023-06-18 21:15:40,943 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=222510.0, ans=0.125 2023-06-18 21:15:44,878 INFO [train.py:996] (3/4) Epoch 2, batch 6600, loss[loss=0.2472, simple_loss=0.2884, pruned_loss=0.1031, over 21522.00 frames. ], tot_loss[loss=0.2828, simple_loss=0.3419, pruned_loss=0.1119, over 4278433.18 frames. ], batch size: 230, lr: 1.83e-02, grad_scale: 16.0 2023-06-18 21:15:56,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=222570.0, ans=0.2 2023-06-18 21:16:05,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=222570.0, ans=0.95 2023-06-18 21:16:14,473 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=222630.0, ans=0.125 2023-06-18 21:16:24,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=222630.0, ans=0.125 2023-06-18 21:17:19,924 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=222750.0, ans=0.0 2023-06-18 21:17:42,534 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.832e+02 2.521e+02 3.069e+02 3.605e+02 6.340e+02, threshold=6.138e+02, percent-clipped=0.0 2023-06-18 21:17:42,557 INFO [train.py:996] (3/4) Epoch 2, batch 6650, loss[loss=0.247, simple_loss=0.3045, pruned_loss=0.0948, over 21974.00 frames. ], tot_loss[loss=0.2754, simple_loss=0.3341, pruned_loss=0.1084, over 4271260.25 frames. ], batch size: 103, lr: 1.83e-02, grad_scale: 16.0 2023-06-18 21:18:44,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=222990.0, ans=0.0 2023-06-18 21:19:51,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=223110.0, ans=0.125 2023-06-18 21:19:53,887 INFO [train.py:996] (3/4) Epoch 2, batch 6700, loss[loss=0.318, simple_loss=0.3621, pruned_loss=0.1369, over 21540.00 frames. ], tot_loss[loss=0.2721, simple_loss=0.3292, pruned_loss=0.1075, over 4278514.95 frames. ], batch size: 442, lr: 1.82e-02, grad_scale: 16.0 2023-06-18 21:20:40,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=223290.0, ans=0.0 2023-06-18 21:21:59,657 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.058e+02 3.168e+02 3.639e+02 4.397e+02 6.633e+02, threshold=7.278e+02, percent-clipped=4.0 2023-06-18 21:21:59,681 INFO [train.py:996] (3/4) Epoch 2, batch 6750, loss[loss=0.3176, simple_loss=0.3445, pruned_loss=0.1453, over 21346.00 frames. ], tot_loss[loss=0.2716, simple_loss=0.328, pruned_loss=0.1076, over 4277776.93 frames. ], batch size: 473, lr: 1.82e-02, grad_scale: 16.0 2023-06-18 21:22:57,822 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=223590.0, ans=0.0 2023-06-18 21:24:14,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=223770.0, ans=0.2 2023-06-18 21:24:15,909 INFO [train.py:996] (3/4) Epoch 2, batch 6800, loss[loss=0.2793, simple_loss=0.3273, pruned_loss=0.1157, over 21790.00 frames. ], tot_loss[loss=0.2754, simple_loss=0.331, pruned_loss=0.1099, over 4273420.66 frames. ], batch size: 300, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 21:25:11,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=223890.0, ans=0.015 2023-06-18 21:25:20,780 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.05 vs. limit=12.0 2023-06-18 21:25:23,401 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=223950.0, ans=0.5 2023-06-18 21:25:29,074 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 21:26:06,243 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.897e+02 2.622e+02 3.115e+02 3.857e+02 6.286e+02, threshold=6.230e+02, percent-clipped=0.0 2023-06-18 21:26:06,268 INFO [train.py:996] (3/4) Epoch 2, batch 6850, loss[loss=0.2899, simple_loss=0.3242, pruned_loss=0.1278, over 21276.00 frames. ], tot_loss[loss=0.2759, simple_loss=0.329, pruned_loss=0.1114, over 4280655.05 frames. ], batch size: 176, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 21:26:53,474 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.76 vs. limit=12.0 2023-06-18 21:27:46,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=224310.0, ans=0.0 2023-06-18 21:27:59,199 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.14 vs. limit=12.0 2023-06-18 21:28:11,350 INFO [train.py:996] (3/4) Epoch 2, batch 6900, loss[loss=0.2564, simple_loss=0.3092, pruned_loss=0.1018, over 21428.00 frames. ], tot_loss[loss=0.2779, simple_loss=0.3309, pruned_loss=0.1124, over 4289309.21 frames. ], batch size: 194, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 21:28:30,572 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.69 vs. limit=15.0 2023-06-18 21:29:29,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=224490.0, ans=15.0 2023-06-18 21:30:02,220 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=224550.0, ans=15.0 2023-06-18 21:30:47,619 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.868e+02 2.531e+02 2.951e+02 3.498e+02 5.383e+02, threshold=5.901e+02, percent-clipped=0.0 2023-06-18 21:30:47,644 INFO [train.py:996] (3/4) Epoch 2, batch 6950, loss[loss=0.3042, simple_loss=0.3633, pruned_loss=0.1225, over 21691.00 frames. ], tot_loss[loss=0.2756, simple_loss=0.3327, pruned_loss=0.1092, over 4292544.96 frames. ], batch size: 298, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 21:30:48,080 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=224670.0, ans=0.0 2023-06-18 21:30:48,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=224670.0, ans=0.2 2023-06-18 21:31:33,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=224790.0, ans=0.125 2023-06-18 21:32:17,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=224850.0, ans=0.125 2023-06-18 21:32:19,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=224850.0, ans=0.0 2023-06-18 21:32:49,128 INFO [train.py:996] (3/4) Epoch 2, batch 7000, loss[loss=0.3027, simple_loss=0.3405, pruned_loss=0.1325, over 21459.00 frames. ], tot_loss[loss=0.2817, simple_loss=0.3376, pruned_loss=0.1128, over 4291327.44 frames. ], batch size: 389, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 21:32:54,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=224970.0, ans=0.015 2023-06-18 21:32:57,303 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=224970.0, ans=0.1 2023-06-18 21:33:00,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=224970.0, ans=0.125 2023-06-18 21:33:06,924 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=225030.0, ans=0.125 2023-06-18 21:33:19,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=225030.0, ans=0.125 2023-06-18 21:33:51,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=225090.0, ans=0.2 2023-06-18 21:34:56,709 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.32 vs. limit=10.0 2023-06-18 21:34:58,012 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 21:35:00,447 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.917e+02 2.871e+02 3.500e+02 4.345e+02 7.832e+02, threshold=7.000e+02, percent-clipped=4.0 2023-06-18 21:35:00,471 INFO [train.py:996] (3/4) Epoch 2, batch 7050, loss[loss=0.3454, simple_loss=0.4492, pruned_loss=0.1208, over 19726.00 frames. ], tot_loss[loss=0.2774, simple_loss=0.3346, pruned_loss=0.1101, over 4281073.63 frames. ], batch size: 702, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 21:35:35,260 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=225330.0, ans=0.125 2023-06-18 21:35:41,452 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=225330.0, ans=0.0 2023-06-18 21:35:42,812 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=225330.0, ans=0.125 2023-06-18 21:37:14,080 INFO [train.py:996] (3/4) Epoch 2, batch 7100, loss[loss=0.3008, simple_loss=0.3624, pruned_loss=0.1196, over 21814.00 frames. ], tot_loss[loss=0.2821, simple_loss=0.3402, pruned_loss=0.112, over 4284758.19 frames. ], batch size: 118, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 21:37:37,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=225630.0, ans=0.2 2023-06-18 21:37:39,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=225630.0, ans=0.07 2023-06-18 21:37:40,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=225630.0, ans=0.1 2023-06-18 21:38:16,543 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=225690.0, ans=0.125 2023-06-18 21:39:17,172 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.822e+02 2.803e+02 3.580e+02 4.596e+02 1.041e+03, threshold=7.161e+02, percent-clipped=7.0 2023-06-18 21:39:17,196 INFO [train.py:996] (3/4) Epoch 2, batch 7150, loss[loss=0.3267, simple_loss=0.3805, pruned_loss=0.1364, over 21594.00 frames. ], tot_loss[loss=0.2774, simple_loss=0.3368, pruned_loss=0.109, over 4275585.24 frames. ], batch size: 389, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 21:40:27,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=225990.0, ans=0.125 2023-06-18 21:41:23,731 INFO [train.py:996] (3/4) Epoch 2, batch 7200, loss[loss=0.2709, simple_loss=0.3139, pruned_loss=0.1139, over 21628.00 frames. ], tot_loss[loss=0.2814, simple_loss=0.3392, pruned_loss=0.1118, over 4283110.14 frames. ], batch size: 247, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 21:41:24,324 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=226170.0, ans=0.1 2023-06-18 21:41:25,691 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 21:41:37,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=226170.0, ans=0.125 2023-06-18 21:42:33,420 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=226350.0, ans=0.025 2023-06-18 21:43:03,645 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.89 vs. limit=15.0 2023-06-18 21:43:32,285 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.330e+02 3.074e+02 3.664e+02 4.641e+02 7.244e+02, threshold=7.329e+02, percent-clipped=2.0 2023-06-18 21:43:32,309 INFO [train.py:996] (3/4) Epoch 2, batch 7250, loss[loss=0.3014, simple_loss=0.3312, pruned_loss=0.1357, over 21226.00 frames. ], tot_loss[loss=0.279, simple_loss=0.3342, pruned_loss=0.1119, over 4285302.73 frames. ], batch size: 471, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 21:44:40,235 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.28 vs. limit=15.0 2023-06-18 21:44:57,716 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=226650.0, ans=0.04949747468305833 2023-06-18 21:45:00,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=226650.0, ans=0.1 2023-06-18 21:45:14,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=226710.0, ans=0.1 2023-06-18 21:45:26,282 INFO [train.py:996] (3/4) Epoch 2, batch 7300, loss[loss=0.2275, simple_loss=0.2834, pruned_loss=0.08581, over 21764.00 frames. ], tot_loss[loss=0.2739, simple_loss=0.3273, pruned_loss=0.1103, over 4287020.56 frames. ], batch size: 317, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 21:46:06,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=226830.0, ans=0.125 2023-06-18 21:46:25,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=226890.0, ans=0.125 2023-06-18 21:46:29,832 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.78 vs. limit=12.0 2023-06-18 21:46:51,467 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-18 21:47:06,162 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2023-06-18 21:47:36,826 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.031e+02 2.896e+02 3.515e+02 4.271e+02 6.710e+02, threshold=7.030e+02, percent-clipped=0.0 2023-06-18 21:47:36,850 INFO [train.py:996] (3/4) Epoch 2, batch 7350, loss[loss=0.2755, simple_loss=0.3102, pruned_loss=0.1203, over 21243.00 frames. ], tot_loss[loss=0.2736, simple_loss=0.3261, pruned_loss=0.1105, over 4286198.25 frames. ], batch size: 608, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 21:48:14,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=227130.0, ans=10.0 2023-06-18 21:48:21,859 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=227130.0, ans=0.125 2023-06-18 21:48:39,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=227190.0, ans=0.1 2023-06-18 21:49:49,001 INFO [train.py:996] (3/4) Epoch 2, batch 7400, loss[loss=0.3572, simple_loss=0.4197, pruned_loss=0.1473, over 21515.00 frames. ], tot_loss[loss=0.2799, simple_loss=0.3343, pruned_loss=0.1128, over 4283482.02 frames. ], batch size: 509, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 21:50:57,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=227490.0, ans=0.1 2023-06-18 21:51:04,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=227490.0, ans=0.1 2023-06-18 21:51:16,237 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=227550.0, ans=0.0 2023-06-18 21:51:24,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=227550.0, ans=0.0 2023-06-18 21:51:28,101 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.35 vs. limit=10.0 2023-06-18 21:52:10,481 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.917e+02 2.891e+02 3.433e+02 4.412e+02 7.257e+02, threshold=6.867e+02, percent-clipped=1.0 2023-06-18 21:52:10,505 INFO [train.py:996] (3/4) Epoch 2, batch 7450, loss[loss=0.2749, simple_loss=0.3233, pruned_loss=0.1133, over 21609.00 frames. ], tot_loss[loss=0.2779, simple_loss=0.3339, pruned_loss=0.111, over 4278135.66 frames. ], batch size: 231, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 21:53:00,794 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.45 vs. limit=15.0 2023-06-18 21:53:53,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=227910.0, ans=0.0 2023-06-18 21:54:12,222 INFO [train.py:996] (3/4) Epoch 2, batch 7500, loss[loss=0.2985, simple_loss=0.3867, pruned_loss=0.1051, over 21605.00 frames. ], tot_loss[loss=0.2825, simple_loss=0.3401, pruned_loss=0.1124, over 4279450.78 frames. ], batch size: 230, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 21:54:22,307 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=227970.0, ans=0.1 2023-06-18 21:55:15,364 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.91 vs. limit=22.5 2023-06-18 21:55:17,421 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=228090.0, ans=0.09899494936611666 2023-06-18 21:56:33,947 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.135e+02 3.064e+02 3.636e+02 4.254e+02 7.923e+02, threshold=7.272e+02, percent-clipped=2.0 2023-06-18 21:56:33,971 INFO [train.py:996] (3/4) Epoch 2, batch 7550, loss[loss=0.2771, simple_loss=0.3051, pruned_loss=0.1245, over 20247.00 frames. ], tot_loss[loss=0.2838, simple_loss=0.3466, pruned_loss=0.1105, over 4278784.70 frames. ], batch size: 703, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 21:56:37,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=228270.0, ans=0.0 2023-06-18 21:56:39,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=228270.0, ans=6.0 2023-06-18 21:57:18,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=228330.0, ans=0.0 2023-06-18 21:57:22,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=228390.0, ans=0.0 2023-06-18 21:57:28,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=228390.0, ans=0.125 2023-06-18 21:57:29,578 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=228390.0, ans=0.125 2023-06-18 21:57:31,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=228390.0, ans=0.0 2023-06-18 21:57:31,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=228390.0, ans=0.0 2023-06-18 21:58:47,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=228510.0, ans=0.125 2023-06-18 21:58:51,174 INFO [train.py:996] (3/4) Epoch 2, batch 7600, loss[loss=0.2726, simple_loss=0.3299, pruned_loss=0.1076, over 21926.00 frames. ], tot_loss[loss=0.2816, simple_loss=0.344, pruned_loss=0.1096, over 4276363.91 frames. ], batch size: 316, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 21:59:17,715 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=228630.0, ans=0.0 2023-06-18 22:00:45,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=228810.0, ans=0.0 2023-06-18 22:01:06,613 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.075e+02 3.201e+02 4.085e+02 5.282e+02 1.069e+03, threshold=8.169e+02, percent-clipped=6.0 2023-06-18 22:01:06,636 INFO [train.py:996] (3/4) Epoch 2, batch 7650, loss[loss=0.2777, simple_loss=0.3311, pruned_loss=0.1121, over 21595.00 frames. ], tot_loss[loss=0.283, simple_loss=0.3435, pruned_loss=0.1113, over 4286904.03 frames. ], batch size: 195, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 22:01:20,274 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=228870.0, ans=0.2 2023-06-18 22:01:59,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=228990.0, ans=0.125 2023-06-18 22:03:00,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=229110.0, ans=0.0 2023-06-18 22:03:00,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=229110.0, ans=0.0 2023-06-18 22:03:23,246 INFO [train.py:996] (3/4) Epoch 2, batch 7700, loss[loss=0.3129, simple_loss=0.3683, pruned_loss=0.1288, over 21763.00 frames. ], tot_loss[loss=0.2895, simple_loss=0.347, pruned_loss=0.116, over 4288381.05 frames. ], batch size: 332, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 22:03:31,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=229170.0, ans=0.125 2023-06-18 22:03:34,446 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=229170.0, ans=0.0 2023-06-18 22:04:08,309 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=229230.0, ans=0.1 2023-06-18 22:04:19,324 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=229290.0, ans=0.0 2023-06-18 22:04:21,348 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.17 vs. limit=15.0 2023-06-18 22:04:55,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=229350.0, ans=0.125 2023-06-18 22:05:39,895 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.239e+02 2.910e+02 3.418e+02 4.204e+02 6.924e+02, threshold=6.836e+02, percent-clipped=0.0 2023-06-18 22:05:39,918 INFO [train.py:996] (3/4) Epoch 2, batch 7750, loss[loss=0.3315, simple_loss=0.4118, pruned_loss=0.1256, over 21690.00 frames. ], tot_loss[loss=0.2939, simple_loss=0.3536, pruned_loss=0.117, over 4285659.05 frames. ], batch size: 298, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 22:05:47,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=229470.0, ans=0.125 2023-06-18 22:06:28,073 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=229530.0, ans=0.2 2023-06-18 22:06:41,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=229590.0, ans=0.0 2023-06-18 22:07:01,808 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.61 vs. limit=15.0 2023-06-18 22:07:13,095 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.65 vs. limit=15.0 2023-06-18 22:07:13,949 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=229650.0, ans=0.1 2023-06-18 22:07:34,736 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.13 vs. limit=15.0 2023-06-18 22:07:38,200 INFO [train.py:996] (3/4) Epoch 2, batch 7800, loss[loss=0.2689, simple_loss=0.3282, pruned_loss=0.1048, over 21815.00 frames. ], tot_loss[loss=0.2921, simple_loss=0.3527, pruned_loss=0.1157, over 4277758.06 frames. ], batch size: 317, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 22:08:26,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=229830.0, ans=0.0 2023-06-18 22:08:49,469 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.25 vs. limit=12.0 2023-06-18 22:09:40,938 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=230010.0, ans=0.1 2023-06-18 22:09:46,308 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.229e+02 2.917e+02 3.531e+02 4.288e+02 7.064e+02, threshold=7.063e+02, percent-clipped=1.0 2023-06-18 22:09:46,331 INFO [train.py:996] (3/4) Epoch 2, batch 7850, loss[loss=0.3009, simple_loss=0.3322, pruned_loss=0.1348, over 21373.00 frames. ], tot_loss[loss=0.2868, simple_loss=0.3454, pruned_loss=0.1141, over 4267591.80 frames. ], batch size: 473, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 22:09:51,196 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=230070.0, ans=0.07 2023-06-18 22:10:24,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=230130.0, ans=0.2 2023-06-18 22:10:33,807 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.12 vs. limit=12.0 2023-06-18 22:11:15,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=230250.0, ans=0.125 2023-06-18 22:12:06,704 INFO [train.py:996] (3/4) Epoch 2, batch 7900, loss[loss=0.2387, simple_loss=0.2967, pruned_loss=0.09037, over 21171.00 frames. ], tot_loss[loss=0.2822, simple_loss=0.3405, pruned_loss=0.112, over 4267582.86 frames. ], batch size: 159, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 22:13:54,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=230550.0, ans=0.2 2023-06-18 22:14:14,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=230610.0, ans=0.0 2023-06-18 22:14:33,512 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.009e+02 2.975e+02 3.296e+02 3.925e+02 7.503e+02, threshold=6.593e+02, percent-clipped=2.0 2023-06-18 22:14:33,535 INFO [train.py:996] (3/4) Epoch 2, batch 7950, loss[loss=0.3123, simple_loss=0.3854, pruned_loss=0.1196, over 21374.00 frames. ], tot_loss[loss=0.285, simple_loss=0.3454, pruned_loss=0.1123, over 4268902.46 frames. ], batch size: 548, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 22:15:50,117 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=230790.0, ans=0.125 2023-06-18 22:16:29,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=230910.0, ans=0.07 2023-06-18 22:16:37,188 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.33 vs. limit=15.0 2023-06-18 22:16:56,561 INFO [train.py:996] (3/4) Epoch 2, batch 8000, loss[loss=0.2622, simple_loss=0.3259, pruned_loss=0.0992, over 21593.00 frames. ], tot_loss[loss=0.2928, simple_loss=0.3529, pruned_loss=0.1163, over 4268739.16 frames. ], batch size: 112, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 22:18:00,174 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.68 vs. limit=22.5 2023-06-18 22:19:34,457 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.058e+02 3.162e+02 3.821e+02 4.807e+02 7.380e+02, threshold=7.642e+02, percent-clipped=7.0 2023-06-18 22:19:34,481 INFO [train.py:996] (3/4) Epoch 2, batch 8050, loss[loss=0.2439, simple_loss=0.2961, pruned_loss=0.09588, over 21255.00 frames. ], tot_loss[loss=0.2937, simple_loss=0.356, pruned_loss=0.1157, over 4259095.33 frames. ], batch size: 159, lr: 1.79e-02, grad_scale: 32.0 2023-06-18 22:19:38,618 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.64 vs. limit=22.5 2023-06-18 22:19:44,810 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=231270.0, ans=0.025 2023-06-18 22:20:11,196 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=231330.0, ans=0.0 2023-06-18 22:20:35,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=231390.0, ans=0.1 2023-06-18 22:21:49,801 INFO [train.py:996] (3/4) Epoch 2, batch 8100, loss[loss=0.2971, simple_loss=0.3472, pruned_loss=0.1234, over 21308.00 frames. ], tot_loss[loss=0.2936, simple_loss=0.3539, pruned_loss=0.1166, over 4265101.85 frames. ], batch size: 143, lr: 1.79e-02, grad_scale: 32.0 2023-06-18 22:23:34,929 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=231690.0, ans=0.125 2023-06-18 22:23:35,514 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.09 vs. limit=15.0 2023-06-18 22:24:01,265 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=231750.0, ans=0.0 2023-06-18 22:24:02,711 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=231750.0, ans=0.125 2023-06-18 22:24:25,609 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.22 vs. limit=15.0 2023-06-18 22:24:27,846 INFO [train.py:996] (3/4) Epoch 2, batch 8150, loss[loss=0.3089, simple_loss=0.4033, pruned_loss=0.1073, over 21671.00 frames. ], tot_loss[loss=0.2987, simple_loss=0.3613, pruned_loss=0.118, over 4266786.51 frames. ], batch size: 414, lr: 1.79e-02, grad_scale: 16.0 2023-06-18 22:24:34,093 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.136e+02 2.988e+02 3.821e+02 5.220e+02 8.604e+02, threshold=7.643e+02, percent-clipped=3.0 2023-06-18 22:24:52,679 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=231870.0, ans=0.125 2023-06-18 22:25:56,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=232050.0, ans=0.2 2023-06-18 22:26:02,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=232050.0, ans=0.0 2023-06-18 22:26:36,692 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=232110.0, ans=0.0 2023-06-18 22:26:37,099 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.45 vs. limit=6.0 2023-06-18 22:26:44,094 INFO [train.py:996] (3/4) Epoch 2, batch 8200, loss[loss=0.2463, simple_loss=0.3007, pruned_loss=0.0959, over 21629.00 frames. ], tot_loss[loss=0.2922, simple_loss=0.3539, pruned_loss=0.1152, over 4265120.11 frames. ], batch size: 298, lr: 1.79e-02, grad_scale: 16.0 2023-06-18 22:27:21,712 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=232230.0, ans=0.0 2023-06-18 22:27:47,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=232290.0, ans=0.07 2023-06-18 22:28:06,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=232350.0, ans=0.125 2023-06-18 22:28:14,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=232350.0, ans=0.125 2023-06-18 22:28:26,778 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.43 vs. limit=22.5 2023-06-18 22:28:32,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=232410.0, ans=0.1 2023-06-18 22:28:43,250 INFO [train.py:996] (3/4) Epoch 2, batch 8250, loss[loss=0.2279, simple_loss=0.3063, pruned_loss=0.07475, over 21321.00 frames. ], tot_loss[loss=0.2926, simple_loss=0.3537, pruned_loss=0.1158, over 4267879.64 frames. ], batch size: 131, lr: 1.79e-02, grad_scale: 16.0 2023-06-18 22:28:44,762 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.241e+02 3.255e+02 3.894e+02 4.730e+02 9.572e+02, threshold=7.788e+02, percent-clipped=3.0 2023-06-18 22:28:46,759 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 22:29:33,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=232530.0, ans=0.0 2023-06-18 22:30:13,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=232650.0, ans=0.0 2023-06-18 22:30:21,258 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.62 vs. limit=12.0 2023-06-18 22:30:33,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=232710.0, ans=0.2 2023-06-18 22:31:07,388 INFO [train.py:996] (3/4) Epoch 2, batch 8300, loss[loss=0.2389, simple_loss=0.3198, pruned_loss=0.079, over 21682.00 frames. ], tot_loss[loss=0.287, simple_loss=0.3499, pruned_loss=0.1121, over 4266701.47 frames. ], batch size: 263, lr: 1.79e-02, grad_scale: 16.0 2023-06-18 22:31:58,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=232830.0, ans=0.1 2023-06-18 22:32:12,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=232890.0, ans=0.0 2023-06-18 22:33:23,373 INFO [train.py:996] (3/4) Epoch 2, batch 8350, loss[loss=0.2398, simple_loss=0.3034, pruned_loss=0.08817, over 21479.00 frames. ], tot_loss[loss=0.282, simple_loss=0.3476, pruned_loss=0.1082, over 4257846.14 frames. ], batch size: 195, lr: 1.79e-02, grad_scale: 16.0 2023-06-18 22:33:30,637 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.845e+02 2.718e+02 3.193e+02 3.748e+02 6.520e+02, threshold=6.386e+02, percent-clipped=0.0 2023-06-18 22:33:31,085 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=233070.0, ans=10.0 2023-06-18 22:33:46,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=233070.0, ans=0.1 2023-06-18 22:34:12,041 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=15.0 2023-06-18 22:34:30,645 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=233190.0, ans=0.125 2023-06-18 22:35:46,378 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=233370.0, ans=0.0 2023-06-18 22:35:52,760 INFO [train.py:996] (3/4) Epoch 2, batch 8400, loss[loss=0.213, simple_loss=0.2636, pruned_loss=0.08123, over 21844.00 frames. ], tot_loss[loss=0.2767, simple_loss=0.3452, pruned_loss=0.1041, over 4257420.30 frames. ], batch size: 107, lr: 1.79e-02, grad_scale: 32.0 2023-06-18 22:37:17,660 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=233550.0, ans=0.125 2023-06-18 22:38:00,539 INFO [train.py:996] (3/4) Epoch 2, batch 8450, loss[loss=0.2709, simple_loss=0.326, pruned_loss=0.1079, over 21768.00 frames. ], tot_loss[loss=0.2759, simple_loss=0.3425, pruned_loss=0.1047, over 4255382.85 frames. ], batch size: 124, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 22:38:02,007 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.885e+02 2.629e+02 3.179e+02 4.022e+02 7.095e+02, threshold=6.359e+02, percent-clipped=3.0 2023-06-18 22:38:43,610 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.66 vs. limit=15.0 2023-06-18 22:38:50,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=233790.0, ans=0.125 2023-06-18 22:39:14,969 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.05 vs. limit=6.0 2023-06-18 22:39:21,034 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.15 vs. limit=15.0 2023-06-18 22:39:55,620 INFO [train.py:996] (3/4) Epoch 2, batch 8500, loss[loss=0.2744, simple_loss=0.3219, pruned_loss=0.1135, over 21828.00 frames. ], tot_loss[loss=0.2778, simple_loss=0.3399, pruned_loss=0.1078, over 4255873.47 frames. ], batch size: 98, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 22:40:02,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=233970.0, ans=0.1 2023-06-18 22:40:53,368 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=234030.0, ans=0.125 2023-06-18 22:40:56,401 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=234090.0, ans=0.125 2023-06-18 22:41:12,985 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=234150.0, ans=0.125 2023-06-18 22:42:27,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=234270.0, ans=0.0 2023-06-18 22:42:28,182 INFO [train.py:996] (3/4) Epoch 2, batch 8550, loss[loss=0.3779, simple_loss=0.4647, pruned_loss=0.1455, over 20715.00 frames. ], tot_loss[loss=0.2854, simple_loss=0.3471, pruned_loss=0.1118, over 4265854.25 frames. ], batch size: 607, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 22:42:29,585 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.276e+02 3.313e+02 3.994e+02 5.022e+02 7.456e+02, threshold=7.988e+02, percent-clipped=4.0 2023-06-18 22:43:40,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=234450.0, ans=0.125 2023-06-18 22:44:17,260 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=234510.0, ans=0.125 2023-06-18 22:44:41,773 INFO [train.py:996] (3/4) Epoch 2, batch 8600, loss[loss=0.3237, simple_loss=0.3746, pruned_loss=0.1364, over 21786.00 frames. ], tot_loss[loss=0.2908, simple_loss=0.3523, pruned_loss=0.1147, over 4268698.83 frames. ], batch size: 332, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 22:45:04,014 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=234570.0, ans=0.0 2023-06-18 22:45:04,466 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-18 22:45:20,184 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=234630.0, ans=0.125 2023-06-18 22:45:37,744 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.15 vs. limit=6.0 2023-06-18 22:46:31,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=234750.0, ans=0.125 2023-06-18 22:47:07,831 INFO [train.py:996] (3/4) Epoch 2, batch 8650, loss[loss=0.3209, simple_loss=0.3848, pruned_loss=0.1286, over 21469.00 frames. ], tot_loss[loss=0.2961, simple_loss=0.3603, pruned_loss=0.1159, over 4272747.68 frames. ], batch size: 211, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 22:47:14,945 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.021e+02 3.220e+02 3.691e+02 4.621e+02 7.023e+02, threshold=7.382e+02, percent-clipped=0.0 2023-06-18 22:47:16,914 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 22:48:49,041 INFO [train.py:996] (3/4) Epoch 2, batch 8700, loss[loss=0.224, simple_loss=0.2777, pruned_loss=0.0852, over 21485.00 frames. ], tot_loss[loss=0.287, simple_loss=0.3517, pruned_loss=0.1111, over 4266480.64 frames. ], batch size: 230, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 22:51:10,604 INFO [train.py:996] (3/4) Epoch 2, batch 8750, loss[loss=0.2911, simple_loss=0.3472, pruned_loss=0.1175, over 21827.00 frames. ], tot_loss[loss=0.2861, simple_loss=0.3482, pruned_loss=0.112, over 4274034.00 frames. ], batch size: 298, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 22:51:12,169 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.932e+02 2.975e+02 3.407e+02 4.208e+02 7.792e+02, threshold=6.814e+02, percent-clipped=1.0 2023-06-18 22:52:21,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=235590.0, ans=0.125 2023-06-18 22:53:34,727 INFO [train.py:996] (3/4) Epoch 2, batch 8800, loss[loss=0.3215, simple_loss=0.4002, pruned_loss=0.1215, over 21767.00 frames. ], tot_loss[loss=0.2962, simple_loss=0.3585, pruned_loss=0.1169, over 4280724.74 frames. ], batch size: 332, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 22:53:43,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=235770.0, ans=0.0 2023-06-18 22:54:27,210 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.49 vs. limit=15.0 2023-06-18 22:55:03,363 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-06-18 22:55:52,235 INFO [train.py:996] (3/4) Epoch 2, batch 8850, loss[loss=0.255, simple_loss=0.3337, pruned_loss=0.0881, over 21407.00 frames. ], tot_loss[loss=0.3006, simple_loss=0.3644, pruned_loss=0.1184, over 4277645.66 frames. ], batch size: 194, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 22:55:53,508 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.391e+02 3.077e+02 3.659e+02 4.478e+02 7.244e+02, threshold=7.318e+02, percent-clipped=2.0 2023-06-18 22:56:27,231 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.44 vs. limit=22.5 2023-06-18 22:56:31,072 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 22:56:35,791 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.65 vs. limit=15.0 2023-06-18 22:57:40,700 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 22:57:53,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=236310.0, ans=0.0 2023-06-18 22:57:57,913 INFO [train.py:996] (3/4) Epoch 2, batch 8900, loss[loss=0.3005, simple_loss=0.3676, pruned_loss=0.1167, over 21578.00 frames. ], tot_loss[loss=0.2966, simple_loss=0.3576, pruned_loss=0.1178, over 4272608.42 frames. ], batch size: 441, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 22:57:58,365 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 22:58:20,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=236430.0, ans=0.1 2023-06-18 22:59:59,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=236610.0, ans=0.125 2023-06-18 23:00:24,375 INFO [train.py:996] (3/4) Epoch 2, batch 8950, loss[loss=0.2644, simple_loss=0.3301, pruned_loss=0.09932, over 21780.00 frames. ], tot_loss[loss=0.294, simple_loss=0.3564, pruned_loss=0.1159, over 4275078.60 frames. ], batch size: 282, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 23:00:24,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=236670.0, ans=0.0 2023-06-18 23:00:31,485 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.947e+02 2.983e+02 3.644e+02 4.576e+02 8.464e+02, threshold=7.288e+02, percent-clipped=6.0 2023-06-18 23:01:10,333 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=236790.0, ans=0.125 2023-06-18 23:01:44,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=236850.0, ans=0.07 2023-06-18 23:02:26,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=236910.0, ans=0.125 2023-06-18 23:02:28,409 INFO [train.py:996] (3/4) Epoch 2, batch 9000, loss[loss=0.2582, simple_loss=0.3209, pruned_loss=0.09775, over 21685.00 frames. ], tot_loss[loss=0.2896, simple_loss=0.3492, pruned_loss=0.115, over 4270215.16 frames. ], batch size: 333, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 23:02:28,410 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-18 23:03:37,454 INFO [train.py:1028] (3/4) Epoch 2, validation: loss=0.2827, simple_loss=0.3814, pruned_loss=0.09199, over 1796401.00 frames. 2023-06-18 23:03:37,456 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-18 23:03:53,413 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=236970.0, ans=0.125 2023-06-18 23:04:41,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=237150.0, ans=0.1 2023-06-18 23:05:34,997 INFO [train.py:996] (3/4) Epoch 2, batch 9050, loss[loss=0.3029, simple_loss=0.3625, pruned_loss=0.1216, over 21719.00 frames. ], tot_loss[loss=0.2824, simple_loss=0.3441, pruned_loss=0.1103, over 4278638.98 frames. ], batch size: 332, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 23:05:36,490 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.063e+02 2.756e+02 3.216e+02 4.039e+02 6.653e+02, threshold=6.431e+02, percent-clipped=0.0 2023-06-18 23:06:57,627 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=237450.0, ans=0.0 2023-06-18 23:07:48,317 INFO [train.py:996] (3/4) Epoch 2, batch 9100, loss[loss=0.2603, simple_loss=0.3393, pruned_loss=0.09067, over 21271.00 frames. ], tot_loss[loss=0.2907, simple_loss=0.352, pruned_loss=0.1147, over 4280741.67 frames. ], batch size: 159, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 23:08:08,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=237570.0, ans=0.125 2023-06-18 23:08:22,500 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.95 vs. limit=10.0 2023-06-18 23:08:46,032 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=237690.0, ans=0.125 2023-06-18 23:08:54,901 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=237690.0, ans=0.1 2023-06-18 23:10:01,557 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=237810.0, ans=0.125 2023-06-18 23:10:08,430 INFO [train.py:996] (3/4) Epoch 2, batch 9150, loss[loss=0.2744, simple_loss=0.3555, pruned_loss=0.09665, over 21740.00 frames. ], tot_loss[loss=0.2898, simple_loss=0.3562, pruned_loss=0.1116, over 4280905.44 frames. ], batch size: 298, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 23:10:18,846 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.985e+02 2.926e+02 3.668e+02 4.524e+02 1.019e+03, threshold=7.337e+02, percent-clipped=2.0 2023-06-18 23:10:57,645 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=237930.0, ans=0.125 2023-06-18 23:10:59,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=237930.0, ans=0.0 2023-06-18 23:11:51,165 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=238050.0, ans=0.0 2023-06-18 23:12:11,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=238110.0, ans=0.1 2023-06-18 23:12:29,968 INFO [train.py:996] (3/4) Epoch 2, batch 9200, loss[loss=0.3302, simple_loss=0.3923, pruned_loss=0.134, over 21630.00 frames. ], tot_loss[loss=0.2891, simple_loss=0.3579, pruned_loss=0.1102, over 4274842.48 frames. ], batch size: 389, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 23:12:31,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=238170.0, ans=0.2 2023-06-18 23:12:51,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=238170.0, ans=0.125 2023-06-18 23:14:11,627 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=238350.0, ans=0.0 2023-06-18 23:14:18,153 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.50 vs. limit=10.0 2023-06-18 23:14:22,182 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=238410.0, ans=0.1 2023-06-18 23:14:40,576 INFO [train.py:996] (3/4) Epoch 2, batch 9250, loss[loss=0.3325, simple_loss=0.3554, pruned_loss=0.1548, over 21257.00 frames. ], tot_loss[loss=0.2939, simple_loss=0.3586, pruned_loss=0.1146, over 4276405.03 frames. ], batch size: 471, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 23:14:41,995 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.823e+02 2.826e+02 3.250e+02 3.675e+02 5.270e+02, threshold=6.500e+02, percent-clipped=0.0 2023-06-18 23:16:23,333 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.55 vs. limit=22.5 2023-06-18 23:16:56,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=238770.0, ans=0.125 2023-06-18 23:17:00,736 INFO [train.py:996] (3/4) Epoch 2, batch 9300, loss[loss=0.2815, simple_loss=0.3561, pruned_loss=0.1034, over 21557.00 frames. ], tot_loss[loss=0.2904, simple_loss=0.3529, pruned_loss=0.1139, over 4274499.14 frames. ], batch size: 230, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 23:17:06,374 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.51 vs. limit=15.0 2023-06-18 23:17:07,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=238770.0, ans=0.0 2023-06-18 23:18:19,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=238890.0, ans=0.125 2023-06-18 23:18:41,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=238950.0, ans=0.125 2023-06-18 23:19:04,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=239010.0, ans=0.125 2023-06-18 23:19:09,905 INFO [train.py:996] (3/4) Epoch 2, batch 9350, loss[loss=0.3851, simple_loss=0.4301, pruned_loss=0.1701, over 21483.00 frames. ], tot_loss[loss=0.2954, simple_loss=0.3599, pruned_loss=0.1154, over 4276840.40 frames. ], batch size: 471, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 23:19:11,410 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 3.045e+02 3.493e+02 4.279e+02 7.856e+02, threshold=6.986e+02, percent-clipped=1.0 2023-06-18 23:19:19,412 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=239070.0, ans=0.125 2023-06-18 23:20:22,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=239190.0, ans=0.0 2023-06-18 23:21:28,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=239370.0, ans=0.0 2023-06-18 23:21:29,693 INFO [train.py:996] (3/4) Epoch 2, batch 9400, loss[loss=0.2512, simple_loss=0.3075, pruned_loss=0.09751, over 21634.00 frames. ], tot_loss[loss=0.2969, simple_loss=0.3609, pruned_loss=0.1164, over 4283298.92 frames. ], batch size: 298, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 23:21:41,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=239370.0, ans=0.125 2023-06-18 23:22:15,567 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.87 vs. limit=15.0 2023-06-18 23:22:42,087 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=239490.0, ans=0.0 2023-06-18 23:23:15,349 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.98 vs. limit=10.0 2023-06-18 23:23:16,204 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=239550.0, ans=0.125 2023-06-18 23:23:34,023 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=239670.0, ans=0.1 2023-06-18 23:23:40,462 INFO [train.py:996] (3/4) Epoch 2, batch 9450, loss[loss=0.2396, simple_loss=0.2884, pruned_loss=0.0954, over 21484.00 frames. ], tot_loss[loss=0.2902, simple_loss=0.351, pruned_loss=0.1147, over 4277864.19 frames. ], batch size: 195, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 23:23:41,898 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.969e+02 2.730e+02 3.234e+02 3.774e+02 7.408e+02, threshold=6.469e+02, percent-clipped=1.0 2023-06-18 23:24:03,301 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.84 vs. limit=15.0 2023-06-18 23:24:08,582 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=239730.0, ans=0.0 2023-06-18 23:24:22,065 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.44 vs. limit=15.0 2023-06-18 23:25:06,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=239850.0, ans=0.2 2023-06-18 23:25:15,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=239850.0, ans=0.05 2023-06-18 23:25:31,386 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=239910.0, ans=0.125 2023-06-18 23:26:05,440 INFO [train.py:996] (3/4) Epoch 2, batch 9500, loss[loss=0.2687, simple_loss=0.3361, pruned_loss=0.1007, over 21842.00 frames. ], tot_loss[loss=0.2857, simple_loss=0.3454, pruned_loss=0.1129, over 4267505.02 frames. ], batch size: 316, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 23:26:10,385 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=239970.0, ans=0.0 2023-06-18 23:26:22,536 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.69 vs. limit=22.5 2023-06-18 23:26:55,110 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.59 vs. limit=22.5 2023-06-18 23:27:03,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=240090.0, ans=0.0 2023-06-18 23:27:04,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=240090.0, ans=0.125 2023-06-18 23:28:19,627 INFO [train.py:996] (3/4) Epoch 2, batch 9550, loss[loss=0.3293, simple_loss=0.3794, pruned_loss=0.1396, over 21379.00 frames. ], tot_loss[loss=0.2909, simple_loss=0.3505, pruned_loss=0.1156, over 4264733.47 frames. ], batch size: 131, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 23:28:21,131 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 2.886e+02 3.595e+02 5.008e+02 8.398e+02, threshold=7.190e+02, percent-clipped=11.0 2023-06-18 23:28:38,012 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=240270.0, ans=0.0 2023-06-18 23:28:38,018 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=240270.0, ans=0.07 2023-06-18 23:30:36,134 INFO [train.py:996] (3/4) Epoch 2, batch 9600, loss[loss=0.3113, simple_loss=0.361, pruned_loss=0.1308, over 21754.00 frames. ], tot_loss[loss=0.293, simple_loss=0.3521, pruned_loss=0.1169, over 4276259.53 frames. ], batch size: 441, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 23:30:39,496 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=240570.0, ans=0.0 2023-06-18 23:30:50,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=240570.0, ans=0.125 2023-06-18 23:31:49,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=240690.0, ans=0.125 2023-06-18 23:31:59,214 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.05 vs. limit=15.0 2023-06-18 23:32:09,740 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.37 vs. limit=5.0 2023-06-18 23:32:45,694 INFO [train.py:996] (3/4) Epoch 2, batch 9650, loss[loss=0.3282, simple_loss=0.3841, pruned_loss=0.1362, over 21466.00 frames. ], tot_loss[loss=0.2935, simple_loss=0.3523, pruned_loss=0.1173, over 4277896.51 frames. ], batch size: 131, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 23:33:00,242 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.899e+02 3.277e+02 4.400e+02 6.787e+02, threshold=6.553e+02, percent-clipped=0.0 2023-06-18 23:34:37,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=241050.0, ans=0.125 2023-06-18 23:35:17,952 INFO [train.py:996] (3/4) Epoch 2, batch 9700, loss[loss=0.2784, simple_loss=0.3462, pruned_loss=0.1053, over 21731.00 frames. ], tot_loss[loss=0.2942, simple_loss=0.3551, pruned_loss=0.1167, over 4267592.62 frames. ], batch size: 414, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 23:35:34,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=241170.0, ans=0.125 2023-06-18 23:35:58,652 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=241230.0, ans=0.125 2023-06-18 23:36:06,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=241290.0, ans=0.0 2023-06-18 23:36:57,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=241410.0, ans=0.125 2023-06-18 23:37:09,756 INFO [train.py:996] (3/4) Epoch 2, batch 9750, loss[loss=0.2772, simple_loss=0.3221, pruned_loss=0.1161, over 21863.00 frames. ], tot_loss[loss=0.2891, simple_loss=0.3488, pruned_loss=0.1147, over 4261035.33 frames. ], batch size: 98, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 23:37:11,135 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.952e+02 2.764e+02 3.183e+02 3.672e+02 6.481e+02, threshold=6.367e+02, percent-clipped=0.0 2023-06-18 23:37:15,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=241470.0, ans=0.09899494936611666 2023-06-18 23:37:49,335 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=15.0 2023-06-18 23:38:00,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=241590.0, ans=0.0 2023-06-18 23:38:02,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=241590.0, ans=0.125 2023-06-18 23:38:10,169 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.58 vs. limit=15.0 2023-06-18 23:39:09,307 INFO [train.py:996] (3/4) Epoch 2, batch 9800, loss[loss=0.2618, simple_loss=0.3312, pruned_loss=0.09619, over 21867.00 frames. ], tot_loss[loss=0.2872, simple_loss=0.3464, pruned_loss=0.114, over 4253633.91 frames. ], batch size: 107, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 23:39:54,891 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=15.03 vs. limit=15.0 2023-06-18 23:40:00,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=241890.0, ans=0.125 2023-06-18 23:40:10,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=241890.0, ans=0.0 2023-06-18 23:40:18,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=241890.0, ans=0.1 2023-06-18 23:40:31,122 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=241950.0, ans=15.0 2023-06-18 23:40:53,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=242010.0, ans=0.0 2023-06-18 23:41:00,963 INFO [train.py:996] (3/4) Epoch 2, batch 9850, loss[loss=0.273, simple_loss=0.3171, pruned_loss=0.1144, over 21758.00 frames. ], tot_loss[loss=0.2851, simple_loss=0.3426, pruned_loss=0.1138, over 4252912.42 frames. ], batch size: 415, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 23:41:02,241 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.263e+02 2.994e+02 3.598e+02 4.495e+02 7.134e+02, threshold=7.196e+02, percent-clipped=3.0 2023-06-18 23:42:20,196 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=242250.0, ans=15.0 2023-06-18 23:42:30,064 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=242250.0, ans=0.125 2023-06-18 23:42:32,811 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=242310.0, ans=0.125 2023-06-18 23:42:51,329 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.03 vs. limit=15.0 2023-06-18 23:43:09,481 INFO [train.py:996] (3/4) Epoch 2, batch 9900, loss[loss=0.3672, simple_loss=0.4591, pruned_loss=0.1377, over 19707.00 frames. ], tot_loss[loss=0.2834, simple_loss=0.3402, pruned_loss=0.1133, over 4257070.71 frames. ], batch size: 702, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 23:44:36,689 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=242550.0, ans=0.125 2023-06-18 23:45:09,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=242610.0, ans=0.0 2023-06-18 23:45:24,184 INFO [train.py:996] (3/4) Epoch 2, batch 9950, loss[loss=0.2576, simple_loss=0.3063, pruned_loss=0.1044, over 21380.00 frames. ], tot_loss[loss=0.2871, simple_loss=0.3425, pruned_loss=0.1159, over 4256517.55 frames. ], batch size: 194, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 23:45:25,637 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.224e+02 3.032e+02 3.636e+02 4.504e+02 8.608e+02, threshold=7.273e+02, percent-clipped=3.0 2023-06-18 23:45:30,311 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.92 vs. limit=10.0 2023-06-18 23:45:54,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=242730.0, ans=0.0 2023-06-18 23:46:12,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=242790.0, ans=0.025 2023-06-18 23:46:22,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=242790.0, ans=0.125 2023-06-18 23:47:25,795 INFO [train.py:996] (3/4) Epoch 2, batch 10000, loss[loss=0.32, simple_loss=0.3758, pruned_loss=0.1321, over 21632.00 frames. ], tot_loss[loss=0.2837, simple_loss=0.3394, pruned_loss=0.114, over 4251026.70 frames. ], batch size: 415, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 23:48:03,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=243030.0, ans=0.125 2023-06-18 23:48:12,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=243090.0, ans=0.125 2023-06-18 23:49:46,146 INFO [train.py:996] (3/4) Epoch 2, batch 10050, loss[loss=0.2891, simple_loss=0.3426, pruned_loss=0.1178, over 21617.00 frames. ], tot_loss[loss=0.2854, simple_loss=0.3416, pruned_loss=0.1146, over 4254610.59 frames. ], batch size: 391, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 23:49:47,543 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.989e+02 2.747e+02 3.338e+02 4.174e+02 6.778e+02, threshold=6.677e+02, percent-clipped=0.0 2023-06-18 23:49:56,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=243270.0, ans=0.0 2023-06-18 23:50:10,558 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=243270.0, ans=0.125 2023-06-18 23:50:33,120 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=243390.0, ans=0.5 2023-06-18 23:50:39,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=243390.0, ans=0.125 2023-06-18 23:50:39,578 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.23 vs. limit=15.0 2023-06-18 23:50:46,532 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=243390.0, ans=0.1 2023-06-18 23:50:58,403 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.88 vs. limit=22.5 2023-06-18 23:51:48,908 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=243510.0, ans=0.0 2023-06-18 23:51:58,744 INFO [train.py:996] (3/4) Epoch 2, batch 10100, loss[loss=0.2539, simple_loss=0.3163, pruned_loss=0.09574, over 21472.00 frames. ], tot_loss[loss=0.2783, simple_loss=0.3355, pruned_loss=0.1106, over 4261358.07 frames. ], batch size: 211, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 23:51:59,496 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=243570.0, ans=0.125 2023-06-18 23:52:09,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=243570.0, ans=0.1 2023-06-18 23:54:11,448 INFO [train.py:996] (3/4) Epoch 2, batch 10150, loss[loss=0.2992, simple_loss=0.3373, pruned_loss=0.1305, over 21813.00 frames. ], tot_loss[loss=0.2863, simple_loss=0.3435, pruned_loss=0.1145, over 4266325.90 frames. ], batch size: 98, lr: 1.75e-02, grad_scale: 64.0 2023-06-18 23:54:12,512 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.79 vs. limit=6.0 2023-06-18 23:54:12,896 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.964e+02 2.705e+02 3.448e+02 4.239e+02 1.060e+03, threshold=6.897e+02, percent-clipped=5.0 2023-06-18 23:54:19,395 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 23:55:34,224 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=244050.0, ans=0.0 2023-06-18 23:56:09,356 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.62 vs. limit=15.0 2023-06-18 23:56:12,735 INFO [train.py:996] (3/4) Epoch 2, batch 10200, loss[loss=0.2631, simple_loss=0.3358, pruned_loss=0.09519, over 21560.00 frames. ], tot_loss[loss=0.2842, simple_loss=0.3427, pruned_loss=0.1128, over 4268056.92 frames. ], batch size: 389, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 23:56:23,883 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.05 vs. limit=12.0 2023-06-18 23:58:27,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=244410.0, ans=0.2 2023-06-18 23:58:30,104 INFO [train.py:996] (3/4) Epoch 2, batch 10250, loss[loss=0.1994, simple_loss=0.2864, pruned_loss=0.05617, over 21501.00 frames. ], tot_loss[loss=0.2743, simple_loss=0.3365, pruned_loss=0.106, over 4263025.73 frames. ], batch size: 195, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 23:58:31,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=244470.0, ans=0.125 2023-06-18 23:58:32,919 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.714e+02 2.447e+02 2.831e+02 3.418e+02 6.746e+02, threshold=5.661e+02, percent-clipped=0.0 2023-06-18 23:58:58,863 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 23:59:30,300 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.36 vs. limit=6.0 2023-06-19 00:00:06,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=244710.0, ans=0.125 2023-06-19 00:00:20,385 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.00 vs. limit=10.0 2023-06-19 00:00:24,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=244710.0, ans=0.2 2023-06-19 00:00:26,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=244770.0, ans=0.0 2023-06-19 00:00:27,730 INFO [train.py:996] (3/4) Epoch 2, batch 10300, loss[loss=0.2991, simple_loss=0.3746, pruned_loss=0.1118, over 21812.00 frames. ], tot_loss[loss=0.2787, simple_loss=0.3407, pruned_loss=0.1083, over 4270771.00 frames. ], batch size: 282, lr: 1.75e-02, grad_scale: 32.0 2023-06-19 00:00:31,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=244770.0, ans=0.0 2023-06-19 00:00:35,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=244770.0, ans=0.0 2023-06-19 00:00:51,689 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 00:01:59,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=244950.0, ans=0.2 2023-06-19 00:02:25,150 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=245010.0, ans=0.1 2023-06-19 00:02:37,392 INFO [train.py:996] (3/4) Epoch 2, batch 10350, loss[loss=0.3215, simple_loss=0.3796, pruned_loss=0.1317, over 21453.00 frames. ], tot_loss[loss=0.2792, simple_loss=0.3418, pruned_loss=0.1083, over 4274120.39 frames. ], batch size: 471, lr: 1.74e-02, grad_scale: 32.0 2023-06-19 00:02:45,454 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.405e+02 2.780e+02 3.474e+02 4.368e+02 7.573e+02, threshold=6.948e+02, percent-clipped=10.0 2023-06-19 00:03:26,849 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.26 vs. limit=15.0 2023-06-19 00:04:03,144 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=245250.0, ans=0.0 2023-06-19 00:04:04,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=245250.0, ans=0.125 2023-06-19 00:04:36,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=245310.0, ans=0.125 2023-06-19 00:04:49,422 INFO [train.py:996] (3/4) Epoch 2, batch 10400, loss[loss=0.3005, simple_loss=0.3636, pruned_loss=0.1187, over 20727.00 frames. ], tot_loss[loss=0.2707, simple_loss=0.3322, pruned_loss=0.1046, over 4271011.82 frames. ], batch size: 607, lr: 1.74e-02, grad_scale: 32.0 2023-06-19 00:05:01,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=245370.0, ans=0.125 2023-06-19 00:05:19,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=245430.0, ans=0.125 2023-06-19 00:06:46,441 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.45 vs. limit=15.0 2023-06-19 00:06:53,246 INFO [train.py:996] (3/4) Epoch 2, batch 10450, loss[loss=0.2961, simple_loss=0.3621, pruned_loss=0.1151, over 21640.00 frames. ], tot_loss[loss=0.2771, simple_loss=0.3373, pruned_loss=0.1084, over 4274934.96 frames. ], batch size: 263, lr: 1.74e-02, grad_scale: 32.0 2023-06-19 00:07:07,325 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.050e+02 3.343e+02 4.131e+02 5.392e+02 9.378e+02, threshold=8.262e+02, percent-clipped=3.0 2023-06-19 00:07:07,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=245670.0, ans=0.125 2023-06-19 00:07:24,743 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.81 vs. limit=22.5 2023-06-19 00:09:11,777 INFO [train.py:996] (3/4) Epoch 2, batch 10500, loss[loss=0.2755, simple_loss=0.3244, pruned_loss=0.1133, over 21563.00 frames. ], tot_loss[loss=0.2765, simple_loss=0.3376, pruned_loss=0.1077, over 4268113.57 frames. ], batch size: 414, lr: 1.74e-02, grad_scale: 32.0 2023-06-19 00:09:15,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=245970.0, ans=0.125 2023-06-19 00:09:16,105 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.05 vs. limit=15.0 2023-06-19 00:09:17,042 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=245970.0, ans=0.1 2023-06-19 00:10:33,372 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=246150.0, ans=0.0 2023-06-19 00:10:36,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=246150.0, ans=0.5 2023-06-19 00:11:20,154 INFO [train.py:996] (3/4) Epoch 2, batch 10550, loss[loss=0.2543, simple_loss=0.3022, pruned_loss=0.1032, over 21659.00 frames. ], tot_loss[loss=0.2729, simple_loss=0.3319, pruned_loss=0.1069, over 4271207.44 frames. ], batch size: 282, lr: 1.74e-02, grad_scale: 32.0 2023-06-19 00:11:23,103 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.840e+02 2.402e+02 2.798e+02 3.186e+02 5.857e+02, threshold=5.595e+02, percent-clipped=0.0 2023-06-19 00:12:35,367 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 00:12:36,718 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=246390.0, ans=0.125 2023-06-19 00:12:39,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=246390.0, ans=0.2 2023-06-19 00:13:18,270 INFO [train.py:996] (3/4) Epoch 2, batch 10600, loss[loss=0.2599, simple_loss=0.3456, pruned_loss=0.08711, over 21617.00 frames. ], tot_loss[loss=0.2676, simple_loss=0.3273, pruned_loss=0.104, over 4277938.88 frames. ], batch size: 389, lr: 1.74e-02, grad_scale: 32.0 2023-06-19 00:13:18,852 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=246570.0, ans=0.125 2023-06-19 00:14:29,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=246690.0, ans=0.1 2023-06-19 00:15:09,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=246750.0, ans=0.2 2023-06-19 00:15:36,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=246810.0, ans=0.1 2023-06-19 00:15:45,049 INFO [train.py:996] (3/4) Epoch 2, batch 10650, loss[loss=0.3075, simple_loss=0.3794, pruned_loss=0.1178, over 21647.00 frames. ], tot_loss[loss=0.267, simple_loss=0.3299, pruned_loss=0.1021, over 4278035.73 frames. ], batch size: 414, lr: 1.74e-02, grad_scale: 32.0 2023-06-19 00:15:55,454 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.945e+02 2.840e+02 3.386e+02 4.217e+02 7.399e+02, threshold=6.773e+02, percent-clipped=7.0 2023-06-19 00:16:52,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=246990.0, ans=0.1 2023-06-19 00:17:25,601 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.40 vs. limit=15.0 2023-06-19 00:17:40,514 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 00:18:08,604 INFO [train.py:996] (3/4) Epoch 2, batch 10700, loss[loss=0.2837, simple_loss=0.3365, pruned_loss=0.1155, over 21409.00 frames. ], tot_loss[loss=0.2674, simple_loss=0.3294, pruned_loss=0.1027, over 4255397.36 frames. ], batch size: 211, lr: 1.74e-02, grad_scale: 32.0 2023-06-19 00:18:22,348 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 00:18:28,373 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=247170.0, ans=0.1 2023-06-19 00:18:38,468 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=247230.0, ans=0.125 2023-06-19 00:18:49,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=247230.0, ans=0.0 2023-06-19 00:19:14,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=247290.0, ans=0.125 2023-06-19 00:20:13,202 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.14 vs. limit=6.0 2023-06-19 00:20:30,657 INFO [train.py:996] (3/4) Epoch 2, batch 10750, loss[loss=0.3215, simple_loss=0.4062, pruned_loss=0.1184, over 21710.00 frames. ], tot_loss[loss=0.2817, simple_loss=0.3436, pruned_loss=0.1099, over 4261840.04 frames. ], batch size: 414, lr: 1.74e-02, grad_scale: 32.0 2023-06-19 00:20:33,615 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.972e+02 2.858e+02 3.262e+02 4.061e+02 7.112e+02, threshold=6.525e+02, percent-clipped=1.0 2023-06-19 00:20:56,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=247530.0, ans=0.0 2023-06-19 00:20:59,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=247530.0, ans=0.125 2023-06-19 00:21:11,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=247530.0, ans=0.125 2023-06-19 00:21:13,506 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 00:21:36,516 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.42 vs. limit=6.0 2023-06-19 00:22:56,670 INFO [train.py:996] (3/4) Epoch 2, batch 10800, loss[loss=0.3184, simple_loss=0.3735, pruned_loss=0.1316, over 21721.00 frames. ], tot_loss[loss=0.2838, simple_loss=0.348, pruned_loss=0.1098, over 4259283.70 frames. ], batch size: 332, lr: 1.74e-02, grad_scale: 32.0 2023-06-19 00:22:57,687 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.55 vs. limit=22.5 2023-06-19 00:23:03,740 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=15.0 2023-06-19 00:24:21,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=247950.0, ans=0.125 2023-06-19 00:24:22,820 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=247950.0, ans=0.2 2023-06-19 00:25:00,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=248010.0, ans=0.0 2023-06-19 00:25:17,897 INFO [train.py:996] (3/4) Epoch 2, batch 10850, loss[loss=0.2355, simple_loss=0.3005, pruned_loss=0.08524, over 21573.00 frames. ], tot_loss[loss=0.2847, simple_loss=0.3485, pruned_loss=0.1105, over 4251956.50 frames. ], batch size: 263, lr: 1.73e-02, grad_scale: 32.0 2023-06-19 00:25:21,160 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.858e+02 2.702e+02 3.119e+02 3.804e+02 6.070e+02, threshold=6.238e+02, percent-clipped=0.0 2023-06-19 00:25:22,114 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.43 vs. limit=15.0 2023-06-19 00:26:20,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=248250.0, ans=0.1 2023-06-19 00:26:47,822 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=248310.0, ans=0.125 2023-06-19 00:27:21,651 INFO [train.py:996] (3/4) Epoch 2, batch 10900, loss[loss=0.2588, simple_loss=0.3097, pruned_loss=0.104, over 20753.00 frames. ], tot_loss[loss=0.2788, simple_loss=0.3422, pruned_loss=0.1077, over 4247110.68 frames. ], batch size: 607, lr: 1.73e-02, grad_scale: 32.0 2023-06-19 00:28:18,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=248490.0, ans=0.2 2023-06-19 00:28:24,792 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=248490.0, ans=0.125 2023-06-19 00:28:52,051 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 00:29:18,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=248610.0, ans=0.025 2023-06-19 00:29:25,755 INFO [train.py:996] (3/4) Epoch 2, batch 10950, loss[loss=0.2755, simple_loss=0.3234, pruned_loss=0.1138, over 21993.00 frames. ], tot_loss[loss=0.2732, simple_loss=0.3361, pruned_loss=0.1051, over 4252681.88 frames. ], batch size: 103, lr: 1.73e-02, grad_scale: 32.0 2023-06-19 00:29:28,627 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.213e+02 2.875e+02 3.449e+02 4.246e+02 8.484e+02, threshold=6.899e+02, percent-clipped=4.0 2023-06-19 00:29:34,137 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=248670.0, ans=0.125 2023-06-19 00:30:14,085 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=248730.0, ans=0.5 2023-06-19 00:30:59,900 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.23 vs. limit=15.0 2023-06-19 00:31:43,323 INFO [train.py:996] (3/4) Epoch 2, batch 11000, loss[loss=0.2368, simple_loss=0.3026, pruned_loss=0.08552, over 21835.00 frames. ], tot_loss[loss=0.2751, simple_loss=0.3359, pruned_loss=0.1072, over 4258415.62 frames. ], batch size: 98, lr: 1.73e-02, grad_scale: 32.0 2023-06-19 00:33:11,150 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.66 vs. limit=22.5 2023-06-19 00:33:11,964 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 00:33:47,309 INFO [train.py:996] (3/4) Epoch 2, batch 11050, loss[loss=0.2378, simple_loss=0.2895, pruned_loss=0.09299, over 21488.00 frames. ], tot_loss[loss=0.2753, simple_loss=0.3344, pruned_loss=0.1081, over 4259828.73 frames. ], batch size: 212, lr: 1.73e-02, grad_scale: 32.0 2023-06-19 00:33:50,071 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.192e+02 2.804e+02 3.313e+02 4.113e+02 7.332e+02, threshold=6.626e+02, percent-clipped=1.0 2023-06-19 00:33:58,896 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.10 vs. limit=15.0 2023-06-19 00:34:27,501 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=249330.0, ans=0.2 2023-06-19 00:34:28,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=249330.0, ans=0.125 2023-06-19 00:34:42,568 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=249390.0, ans=0.125 2023-06-19 00:35:08,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=249450.0, ans=0.0 2023-06-19 00:35:41,939 INFO [train.py:996] (3/4) Epoch 2, batch 11100, loss[loss=0.2736, simple_loss=0.3284, pruned_loss=0.1094, over 21726.00 frames. ], tot_loss[loss=0.2759, simple_loss=0.3332, pruned_loss=0.1093, over 4249636.71 frames. ], batch size: 351, lr: 1.73e-02, grad_scale: 32.0 2023-06-19 00:35:44,634 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.13 vs. limit=22.5 2023-06-19 00:35:53,699 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=12.0 2023-06-19 00:36:08,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=249630.0, ans=0.2 2023-06-19 00:36:31,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=249690.0, ans=0.125 2023-06-19 00:36:56,014 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.02 vs. limit=10.0 2023-06-19 00:37:06,792 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=249750.0, ans=0.125 2023-06-19 00:37:19,779 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=249810.0, ans=0.125 2023-06-19 00:37:27,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=249810.0, ans=0.5 2023-06-19 00:37:45,292 INFO [train.py:996] (3/4) Epoch 2, batch 11150, loss[loss=0.3042, simple_loss=0.3573, pruned_loss=0.1255, over 21672.00 frames. ], tot_loss[loss=0.2752, simple_loss=0.3323, pruned_loss=0.109, over 4248276.29 frames. ], batch size: 332, lr: 1.73e-02, grad_scale: 16.0 2023-06-19 00:37:49,644 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.023e+02 2.743e+02 3.116e+02 3.616e+02 5.732e+02, threshold=6.232e+02, percent-clipped=1.0 2023-06-19 00:38:26,574 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=249930.0, ans=0.025 2023-06-19 00:38:30,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=249930.0, ans=0.0 2023-06-19 00:38:57,401 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.54 vs. limit=15.0 2023-06-19 00:39:28,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=250110.0, ans=0.0 2023-06-19 00:39:38,500 INFO [train.py:996] (3/4) Epoch 2, batch 11200, loss[loss=0.2775, simple_loss=0.3161, pruned_loss=0.1195, over 21550.00 frames. ], tot_loss[loss=0.2733, simple_loss=0.3299, pruned_loss=0.1084, over 4257850.49 frames. ], batch size: 443, lr: 1.73e-02, grad_scale: 32.0 2023-06-19 00:39:54,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=250170.0, ans=0.0 2023-06-19 00:41:05,058 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=250350.0, ans=0.125 2023-06-19 00:41:12,365 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=250410.0, ans=0.07 2023-06-19 00:41:27,226 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=250410.0, ans=0.1 2023-06-19 00:41:31,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=250410.0, ans=0.2 2023-06-19 00:41:52,952 INFO [train.py:996] (3/4) Epoch 2, batch 11250, loss[loss=0.2731, simple_loss=0.3259, pruned_loss=0.1102, over 21176.00 frames. ], tot_loss[loss=0.2729, simple_loss=0.3302, pruned_loss=0.1078, over 4251005.04 frames. ], batch size: 176, lr: 1.73e-02, grad_scale: 32.0 2023-06-19 00:41:57,390 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.175e+02 2.808e+02 3.237e+02 3.673e+02 7.595e+02, threshold=6.473e+02, percent-clipped=2.0 2023-06-19 00:42:43,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=250530.0, ans=0.0 2023-06-19 00:42:56,039 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.76 vs. limit=22.5 2023-06-19 00:43:20,167 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-06-19 00:44:02,658 INFO [train.py:996] (3/4) Epoch 2, batch 11300, loss[loss=0.2504, simple_loss=0.3111, pruned_loss=0.09485, over 21870.00 frames. ], tot_loss[loss=0.2733, simple_loss=0.3313, pruned_loss=0.1077, over 4256472.29 frames. ], batch size: 118, lr: 1.73e-02, grad_scale: 32.0 2023-06-19 00:44:58,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=250830.0, ans=0.125 2023-06-19 00:45:13,073 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=250890.0, ans=0.125 2023-06-19 00:45:53,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=250950.0, ans=0.0 2023-06-19 00:46:16,911 INFO [train.py:996] (3/4) Epoch 2, batch 11350, loss[loss=0.2777, simple_loss=0.3456, pruned_loss=0.1049, over 21768.00 frames. ], tot_loss[loss=0.2714, simple_loss=0.3307, pruned_loss=0.1061, over 4262590.15 frames. ], batch size: 124, lr: 1.72e-02, grad_scale: 32.0 2023-06-19 00:46:23,546 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.825e+02 2.745e+02 3.314e+02 3.869e+02 6.937e+02, threshold=6.629e+02, percent-clipped=3.0 2023-06-19 00:46:50,720 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=251070.0, ans=0.0 2023-06-19 00:47:03,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=251130.0, ans=0.95 2023-06-19 00:47:50,370 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=251190.0, ans=0.125 2023-06-19 00:48:05,431 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=251250.0, ans=0.1 2023-06-19 00:48:12,852 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=251310.0, ans=0.1 2023-06-19 00:48:31,027 INFO [train.py:996] (3/4) Epoch 2, batch 11400, loss[loss=0.3345, simple_loss=0.3876, pruned_loss=0.1408, over 21752.00 frames. ], tot_loss[loss=0.281, simple_loss=0.3401, pruned_loss=0.111, over 4264998.81 frames. ], batch size: 441, lr: 1.72e-02, grad_scale: 32.0 2023-06-19 00:50:59,196 INFO [train.py:996] (3/4) Epoch 2, batch 11450, loss[loss=0.2394, simple_loss=0.3129, pruned_loss=0.08301, over 21435.00 frames. ], tot_loss[loss=0.2813, simple_loss=0.3414, pruned_loss=0.1106, over 4267172.51 frames. ], batch size: 211, lr: 1.72e-02, grad_scale: 32.0 2023-06-19 00:51:17,500 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.923e+02 2.931e+02 3.430e+02 4.239e+02 8.118e+02, threshold=6.860e+02, percent-clipped=4.0 2023-06-19 00:51:50,622 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.36 vs. limit=15.0 2023-06-19 00:51:59,490 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.59 vs. limit=22.5 2023-06-19 00:52:09,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=251790.0, ans=10.0 2023-06-19 00:52:31,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=251850.0, ans=0.125 2023-06-19 00:53:14,781 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=251910.0, ans=0.125 2023-06-19 00:53:23,627 INFO [train.py:996] (3/4) Epoch 2, batch 11500, loss[loss=0.2445, simple_loss=0.2781, pruned_loss=0.1054, over 20721.00 frames. ], tot_loss[loss=0.2824, simple_loss=0.343, pruned_loss=0.1109, over 4261732.33 frames. ], batch size: 608, lr: 1.72e-02, grad_scale: 32.0 2023-06-19 00:53:35,453 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=251970.0, ans=0.125 2023-06-19 00:53:42,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=252030.0, ans=0.95 2023-06-19 00:54:09,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=252030.0, ans=0.0 2023-06-19 00:54:10,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=252030.0, ans=0.04949747468305833 2023-06-19 00:54:23,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=252090.0, ans=0.125 2023-06-19 00:54:47,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=252150.0, ans=0.125 2023-06-19 00:55:52,490 INFO [train.py:996] (3/4) Epoch 2, batch 11550, loss[loss=0.2923, simple_loss=0.3439, pruned_loss=0.1203, over 21438.00 frames. ], tot_loss[loss=0.286, simple_loss=0.3489, pruned_loss=0.1115, over 4257399.19 frames. ], batch size: 131, lr: 1.72e-02, grad_scale: 32.0 2023-06-19 00:56:10,467 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.930e+02 3.008e+02 3.655e+02 4.469e+02 7.997e+02, threshold=7.310e+02, percent-clipped=1.0 2023-06-19 00:58:11,601 INFO [train.py:996] (3/4) Epoch 2, batch 11600, loss[loss=0.2548, simple_loss=0.3362, pruned_loss=0.08665, over 21866.00 frames. ], tot_loss[loss=0.2945, simple_loss=0.3635, pruned_loss=0.1127, over 4263016.07 frames. ], batch size: 107, lr: 1.72e-02, grad_scale: 32.0 2023-06-19 00:58:12,137 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=252570.0, ans=0.0 2023-06-19 00:58:58,207 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.01 vs. limit=15.0 2023-06-19 00:58:59,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=252630.0, ans=15.0 2023-06-19 00:59:33,911 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=252750.0, ans=0.0 2023-06-19 01:00:26,315 INFO [train.py:996] (3/4) Epoch 2, batch 11650, loss[loss=0.2836, simple_loss=0.3734, pruned_loss=0.09689, over 21426.00 frames. ], tot_loss[loss=0.297, simple_loss=0.3695, pruned_loss=0.1123, over 4255854.61 frames. ], batch size: 194, lr: 1.72e-02, grad_scale: 32.0 2023-06-19 01:00:30,508 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.993e+02 2.719e+02 3.568e+02 4.744e+02 9.384e+02, threshold=7.136e+02, percent-clipped=4.0 2023-06-19 01:01:15,623 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.47 vs. limit=15.0 2023-06-19 01:02:25,802 INFO [train.py:996] (3/4) Epoch 2, batch 11700, loss[loss=0.2441, simple_loss=0.2974, pruned_loss=0.0954, over 21630.00 frames. ], tot_loss[loss=0.292, simple_loss=0.3606, pruned_loss=0.1117, over 4257719.36 frames. ], batch size: 282, lr: 1.72e-02, grad_scale: 16.0 2023-06-19 01:03:05,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=253230.0, ans=0.0 2023-06-19 01:03:08,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=253230.0, ans=0.125 2023-06-19 01:03:35,282 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=253350.0, ans=0.125 2023-06-19 01:03:53,705 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.22 vs. limit=15.0 2023-06-19 01:04:04,110 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.36 vs. limit=10.0 2023-06-19 01:04:23,353 INFO [train.py:996] (3/4) Epoch 2, batch 11750, loss[loss=0.253, simple_loss=0.3021, pruned_loss=0.1019, over 21988.00 frames. ], tot_loss[loss=0.2874, simple_loss=0.3515, pruned_loss=0.1116, over 4266664.14 frames. ], batch size: 103, lr: 1.72e-02, grad_scale: 16.0 2023-06-19 01:04:42,020 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.104e+02 3.061e+02 3.639e+02 4.435e+02 7.294e+02, threshold=7.278e+02, percent-clipped=2.0 2023-06-19 01:05:01,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=253530.0, ans=0.1 2023-06-19 01:05:01,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=253530.0, ans=0.125 2023-06-19 01:05:46,256 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.60 vs. limit=22.5 2023-06-19 01:06:16,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=253710.0, ans=0.2 2023-06-19 01:06:43,877 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-19 01:06:45,927 INFO [train.py:996] (3/4) Epoch 2, batch 11800, loss[loss=0.2998, simple_loss=0.3891, pruned_loss=0.1052, over 21878.00 frames. ], tot_loss[loss=0.2938, simple_loss=0.3561, pruned_loss=0.1157, over 4264865.59 frames. ], batch size: 372, lr: 1.72e-02, grad_scale: 16.0 2023-06-19 01:07:03,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=253770.0, ans=0.0 2023-06-19 01:07:44,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=253890.0, ans=0.125 2023-06-19 01:08:33,410 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.75 vs. limit=6.0 2023-06-19 01:08:47,896 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=254070.0, ans=0.2 2023-06-19 01:08:48,977 INFO [train.py:996] (3/4) Epoch 2, batch 11850, loss[loss=0.2458, simple_loss=0.332, pruned_loss=0.07982, over 21732.00 frames. ], tot_loss[loss=0.2939, simple_loss=0.3576, pruned_loss=0.1151, over 4267936.33 frames. ], batch size: 298, lr: 1.71e-02, grad_scale: 16.0 2023-06-19 01:09:00,505 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.050e+02 2.827e+02 3.352e+02 4.207e+02 5.671e+02, threshold=6.705e+02, percent-clipped=0.0 2023-06-19 01:09:01,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=254070.0, ans=0.2 2023-06-19 01:09:59,552 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=254190.0, ans=0.2 2023-06-19 01:10:13,176 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-19 01:10:31,688 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.60 vs. limit=15.0 2023-06-19 01:10:45,380 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.76 vs. limit=22.5 2023-06-19 01:11:01,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=254370.0, ans=0.0 2023-06-19 01:11:08,824 INFO [train.py:996] (3/4) Epoch 2, batch 11900, loss[loss=0.2521, simple_loss=0.3219, pruned_loss=0.09115, over 21741.00 frames. ], tot_loss[loss=0.2887, simple_loss=0.3551, pruned_loss=0.1112, over 4271073.46 frames. ], batch size: 282, lr: 1.71e-02, grad_scale: 16.0 2023-06-19 01:11:17,731 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.43 vs. limit=22.5 2023-06-19 01:11:27,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=254370.0, ans=0.125 2023-06-19 01:11:45,378 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=254430.0, ans=0.1 2023-06-19 01:11:45,427 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 01:11:58,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=254430.0, ans=0.95 2023-06-19 01:13:24,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=254610.0, ans=0.07 2023-06-19 01:13:30,195 INFO [train.py:996] (3/4) Epoch 2, batch 11950, loss[loss=0.2143, simple_loss=0.3026, pruned_loss=0.06301, over 21770.00 frames. ], tot_loss[loss=0.2849, simple_loss=0.3541, pruned_loss=0.1078, over 4265020.15 frames. ], batch size: 316, lr: 1.71e-02, grad_scale: 16.0 2023-06-19 01:13:35,899 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.053e+02 2.552e+02 2.960e+02 3.499e+02 5.476e+02, threshold=5.920e+02, percent-clipped=0.0 2023-06-19 01:14:58,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=254850.0, ans=0.125 2023-06-19 01:15:47,314 INFO [train.py:996] (3/4) Epoch 2, batch 12000, loss[loss=0.2701, simple_loss=0.3296, pruned_loss=0.1053, over 22007.00 frames. ], tot_loss[loss=0.278, simple_loss=0.3473, pruned_loss=0.1044, over 4259905.87 frames. ], batch size: 103, lr: 1.71e-02, grad_scale: 32.0 2023-06-19 01:15:47,315 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-19 01:16:33,945 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.2.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.7938, 4.2137, 3.9145, 4.0439], device='cuda:3') 2023-06-19 01:16:40,668 INFO [train.py:1028] (3/4) Epoch 2, validation: loss=0.2909, simple_loss=0.3809, pruned_loss=0.1004, over 1796401.00 frames. 2023-06-19 01:16:40,669 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-19 01:17:05,799 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.79 vs. limit=12.0 2023-06-19 01:17:11,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=255030.0, ans=0.125 2023-06-19 01:17:46,850 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=255090.0, ans=0.1 2023-06-19 01:18:19,058 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.88 vs. limit=15.0 2023-06-19 01:18:42,666 INFO [train.py:996] (3/4) Epoch 2, batch 12050, loss[loss=0.337, simple_loss=0.3709, pruned_loss=0.1516, over 21635.00 frames. ], tot_loss[loss=0.2799, simple_loss=0.3444, pruned_loss=0.1076, over 4262662.40 frames. ], batch size: 471, lr: 1.71e-02, grad_scale: 32.0 2023-06-19 01:18:57,277 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.053e+02 2.833e+02 3.629e+02 4.945e+02 8.634e+02, threshold=7.258e+02, percent-clipped=13.0 2023-06-19 01:19:26,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=255330.0, ans=0.125 2023-06-19 01:19:39,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=255390.0, ans=0.0 2023-06-19 01:20:08,939 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.84 vs. limit=12.0 2023-06-19 01:20:56,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=255570.0, ans=0.125 2023-06-19 01:20:56,934 INFO [train.py:996] (3/4) Epoch 2, batch 12100, loss[loss=0.3263, simple_loss=0.3747, pruned_loss=0.139, over 21367.00 frames. ], tot_loss[loss=0.2874, simple_loss=0.3495, pruned_loss=0.1127, over 4275377.99 frames. ], batch size: 176, lr: 1.71e-02, grad_scale: 32.0 2023-06-19 01:22:02,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=255690.0, ans=0.125 2023-06-19 01:22:21,697 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=255750.0, ans=10.0 2023-06-19 01:22:27,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=255750.0, ans=0.1 2023-06-19 01:22:27,893 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=255750.0, ans=0.0 2023-06-19 01:23:41,035 INFO [train.py:996] (3/4) Epoch 2, batch 12150, loss[loss=0.2428, simple_loss=0.3127, pruned_loss=0.08645, over 21230.00 frames. ], tot_loss[loss=0.2878, simple_loss=0.3508, pruned_loss=0.1124, over 4274498.81 frames. ], batch size: 159, lr: 1.71e-02, grad_scale: 32.0 2023-06-19 01:23:45,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=255870.0, ans=0.0 2023-06-19 01:23:50,399 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.315e+02 3.161e+02 4.078e+02 5.260e+02 8.280e+02, threshold=8.155e+02, percent-clipped=4.0 2023-06-19 01:25:40,339 INFO [train.py:996] (3/4) Epoch 2, batch 12200, loss[loss=0.2746, simple_loss=0.3184, pruned_loss=0.1154, over 21542.00 frames. ], tot_loss[loss=0.2854, simple_loss=0.3469, pruned_loss=0.1119, over 4273394.45 frames. ], batch size: 414, lr: 1.71e-02, grad_scale: 32.0 2023-06-19 01:26:09,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=256230.0, ans=0.125 2023-06-19 01:26:40,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=256290.0, ans=0.125 2023-06-19 01:27:44,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=256410.0, ans=0.2 2023-06-19 01:27:48,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=256470.0, ans=0.125 2023-06-19 01:27:49,614 INFO [train.py:996] (3/4) Epoch 2, batch 12250, loss[loss=0.2319, simple_loss=0.3003, pruned_loss=0.08178, over 21788.00 frames. ], tot_loss[loss=0.2765, simple_loss=0.3389, pruned_loss=0.1071, over 4262007.27 frames. ], batch size: 352, lr: 1.71e-02, grad_scale: 16.0 2023-06-19 01:28:02,134 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.071e+02 2.815e+02 3.229e+02 3.820e+02 6.594e+02, threshold=6.459e+02, percent-clipped=0.0 2023-06-19 01:28:06,820 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=256470.0, ans=0.125 2023-06-19 01:28:50,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=256590.0, ans=0.1 2023-06-19 01:29:57,193 INFO [train.py:996] (3/4) Epoch 2, batch 12300, loss[loss=0.1702, simple_loss=0.2366, pruned_loss=0.05194, over 21252.00 frames. ], tot_loss[loss=0.2654, simple_loss=0.331, pruned_loss=0.09989, over 4260840.89 frames. ], batch size: 159, lr: 1.71e-02, grad_scale: 16.0 2023-06-19 01:32:07,883 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=257010.0, ans=0.125 2023-06-19 01:32:13,371 INFO [train.py:996] (3/4) Epoch 2, batch 12350, loss[loss=0.2896, simple_loss=0.34, pruned_loss=0.1196, over 21857.00 frames. ], tot_loss[loss=0.2669, simple_loss=0.3342, pruned_loss=0.09977, over 4269849.91 frames. ], batch size: 124, lr: 1.70e-02, grad_scale: 16.0 2023-06-19 01:32:25,737 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.822e+02 2.742e+02 3.231e+02 4.296e+02 8.197e+02, threshold=6.463e+02, percent-clipped=4.0 2023-06-19 01:32:49,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=257130.0, ans=0.125 2023-06-19 01:33:07,349 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=257190.0, ans=0.125 2023-06-19 01:33:49,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=257250.0, ans=0.2 2023-06-19 01:34:00,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=257250.0, ans=0.125 2023-06-19 01:34:09,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=257310.0, ans=0.125 2023-06-19 01:34:17,933 INFO [train.py:996] (3/4) Epoch 2, batch 12400, loss[loss=0.2899, simple_loss=0.3463, pruned_loss=0.1168, over 21258.00 frames. ], tot_loss[loss=0.2729, simple_loss=0.3375, pruned_loss=0.1041, over 4272174.35 frames. ], batch size: 176, lr: 1.70e-02, grad_scale: 32.0 2023-06-19 01:34:44,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=257370.0, ans=0.2 2023-06-19 01:36:05,442 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=257550.0, ans=0.125 2023-06-19 01:36:10,337 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=12.0 2023-06-19 01:36:36,087 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=257610.0, ans=0.1 2023-06-19 01:37:04,444 INFO [train.py:996] (3/4) Epoch 2, batch 12450, loss[loss=0.279, simple_loss=0.346, pruned_loss=0.106, over 21685.00 frames. ], tot_loss[loss=0.2795, simple_loss=0.3417, pruned_loss=0.1087, over 4270791.57 frames. ], batch size: 389, lr: 1.70e-02, grad_scale: 32.0 2023-06-19 01:37:17,437 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.084e+02 2.991e+02 3.683e+02 4.445e+02 7.854e+02, threshold=7.366e+02, percent-clipped=4.0 2023-06-19 01:37:28,836 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.02 vs. limit=15.0 2023-06-19 01:39:14,094 INFO [train.py:996] (3/4) Epoch 2, batch 12500, loss[loss=0.3481, simple_loss=0.4178, pruned_loss=0.1391, over 21412.00 frames. ], tot_loss[loss=0.2916, simple_loss=0.3547, pruned_loss=0.1142, over 4276144.91 frames. ], batch size: 131, lr: 1.70e-02, grad_scale: 32.0 2023-06-19 01:39:15,183 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.63 vs. limit=6.0 2023-06-19 01:39:16,124 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=257970.0, ans=0.1 2023-06-19 01:41:32,986 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=258270.0, ans=0.2 2023-06-19 01:41:33,987 INFO [train.py:996] (3/4) Epoch 2, batch 12550, loss[loss=0.2976, simple_loss=0.407, pruned_loss=0.09411, over 20731.00 frames. ], tot_loss[loss=0.2982, simple_loss=0.3619, pruned_loss=0.1172, over 4277907.67 frames. ], batch size: 608, lr: 1.70e-02, grad_scale: 32.0 2023-06-19 01:41:47,024 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.380e+02 3.352e+02 3.726e+02 4.299e+02 8.195e+02, threshold=7.451e+02, percent-clipped=1.0 2023-06-19 01:41:59,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=258330.0, ans=0.04949747468305833 2023-06-19 01:44:05,953 INFO [train.py:996] (3/4) Epoch 2, batch 12600, loss[loss=0.2221, simple_loss=0.3022, pruned_loss=0.07095, over 21472.00 frames. ], tot_loss[loss=0.2927, simple_loss=0.3584, pruned_loss=0.1136, over 4274532.46 frames. ], batch size: 212, lr: 1.70e-02, grad_scale: 32.0 2023-06-19 01:44:09,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=258570.0, ans=0.0 2023-06-19 01:46:08,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=258870.0, ans=0.125 2023-06-19 01:46:09,509 INFO [train.py:996] (3/4) Epoch 2, batch 12650, loss[loss=0.271, simple_loss=0.3269, pruned_loss=0.1076, over 21754.00 frames. ], tot_loss[loss=0.2835, simple_loss=0.3486, pruned_loss=0.1092, over 4271448.67 frames. ], batch size: 247, lr: 1.70e-02, grad_scale: 32.0 2023-06-19 01:46:28,021 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.772e+02 2.520e+02 3.258e+02 4.283e+02 8.969e+02, threshold=6.516e+02, percent-clipped=1.0 2023-06-19 01:46:45,293 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.85 vs. limit=22.5 2023-06-19 01:46:49,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=258930.0, ans=0.125 2023-06-19 01:46:55,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=258930.0, ans=0.125 2023-06-19 01:48:22,474 INFO [train.py:996] (3/4) Epoch 2, batch 12700, loss[loss=0.3763, simple_loss=0.4116, pruned_loss=0.1705, over 21442.00 frames. ], tot_loss[loss=0.2873, simple_loss=0.35, pruned_loss=0.1123, over 4269824.27 frames. ], batch size: 471, lr: 1.70e-02, grad_scale: 32.0 2023-06-19 01:48:34,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=259170.0, ans=10.0 2023-06-19 01:49:14,115 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=259230.0, ans=0.015 2023-06-19 01:49:34,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=259290.0, ans=0.1 2023-06-19 01:49:36,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=259290.0, ans=0.2 2023-06-19 01:49:49,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=259350.0, ans=0.0 2023-06-19 01:50:30,166 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=259410.0, ans=0.0 2023-06-19 01:50:32,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=259470.0, ans=0.2 2023-06-19 01:50:33,913 INFO [train.py:996] (3/4) Epoch 2, batch 12750, loss[loss=0.2949, simple_loss=0.3505, pruned_loss=0.1197, over 21896.00 frames. ], tot_loss[loss=0.2897, simple_loss=0.3532, pruned_loss=0.1131, over 4274136.54 frames. ], batch size: 118, lr: 1.70e-02, grad_scale: 32.0 2023-06-19 01:50:59,771 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.122e+02 2.841e+02 3.473e+02 4.427e+02 7.212e+02, threshold=6.945e+02, percent-clipped=3.0 2023-06-19 01:51:40,257 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=259590.0, ans=0.125 2023-06-19 01:52:26,224 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=259710.0, ans=0.2 2023-06-19 01:52:41,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=259710.0, ans=0.5 2023-06-19 01:52:59,511 INFO [train.py:996] (3/4) Epoch 2, batch 12800, loss[loss=0.3068, simple_loss=0.3568, pruned_loss=0.1284, over 21871.00 frames. ], tot_loss[loss=0.289, simple_loss=0.3513, pruned_loss=0.1133, over 4280972.80 frames. ], batch size: 414, lr: 1.70e-02, grad_scale: 32.0 2023-06-19 01:53:00,238 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=259770.0, ans=0.0 2023-06-19 01:53:01,612 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=259770.0, ans=0.125 2023-06-19 01:53:57,951 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.40 vs. limit=15.0 2023-06-19 01:53:58,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=259890.0, ans=0.95 2023-06-19 01:54:19,195 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=259950.0, ans=0.1 2023-06-19 01:55:03,239 INFO [train.py:996] (3/4) Epoch 2, batch 12850, loss[loss=0.3135, simple_loss=0.3723, pruned_loss=0.1274, over 21748.00 frames. ], tot_loss[loss=0.2907, simple_loss=0.3529, pruned_loss=0.1143, over 4281584.86 frames. ], batch size: 441, lr: 1.69e-02, grad_scale: 32.0 2023-06-19 01:55:12,226 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.953e+02 2.824e+02 3.270e+02 4.199e+02 6.279e+02, threshold=6.541e+02, percent-clipped=0.0 2023-06-19 01:56:04,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=260130.0, ans=0.07 2023-06-19 01:57:04,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=260310.0, ans=0.125 2023-06-19 01:57:28,199 INFO [train.py:996] (3/4) Epoch 2, batch 12900, loss[loss=0.2266, simple_loss=0.3092, pruned_loss=0.07203, over 21619.00 frames. ], tot_loss[loss=0.2844, simple_loss=0.3504, pruned_loss=0.1092, over 4278948.12 frames. ], batch size: 263, lr: 1.69e-02, grad_scale: 32.0 2023-06-19 01:58:26,929 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=260490.0, ans=0.125 2023-06-19 01:58:34,442 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.95 vs. limit=12.0 2023-06-19 01:59:14,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=260550.0, ans=0.125 2023-06-19 01:59:34,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=260610.0, ans=0.125 2023-06-19 01:59:54,193 INFO [train.py:996] (3/4) Epoch 2, batch 12950, loss[loss=0.3125, simple_loss=0.3647, pruned_loss=0.1302, over 21495.00 frames. ], tot_loss[loss=0.2813, simple_loss=0.3496, pruned_loss=0.1065, over 4272492.35 frames. ], batch size: 194, lr: 1.69e-02, grad_scale: 32.0 2023-06-19 01:59:55,284 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.34 vs. limit=10.0 2023-06-19 02:00:01,779 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.741e+02 2.844e+02 3.449e+02 4.156e+02 6.439e+02, threshold=6.898e+02, percent-clipped=0.0 2023-06-19 02:01:07,070 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=22.5 2023-06-19 02:01:16,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=260850.0, ans=0.125 2023-06-19 02:01:50,609 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 02:01:54,510 INFO [train.py:996] (3/4) Epoch 2, batch 13000, loss[loss=0.3393, simple_loss=0.3972, pruned_loss=0.1407, over 21395.00 frames. ], tot_loss[loss=0.2841, simple_loss=0.3512, pruned_loss=0.1085, over 4277483.27 frames. ], batch size: 549, lr: 1.69e-02, grad_scale: 32.0 2023-06-19 02:02:02,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=260970.0, ans=0.0 2023-06-19 02:02:11,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=261030.0, ans=0.015 2023-06-19 02:02:28,891 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.88 vs. limit=15.0 2023-06-19 02:02:48,838 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=261090.0, ans=0.125 2023-06-19 02:03:05,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=261150.0, ans=0.2 2023-06-19 02:03:21,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=261150.0, ans=6.0 2023-06-19 02:03:40,477 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.98 vs. limit=15.0 2023-06-19 02:03:51,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=261210.0, ans=0.0 2023-06-19 02:04:04,224 INFO [train.py:996] (3/4) Epoch 2, batch 13050, loss[loss=0.2941, simple_loss=0.3467, pruned_loss=0.1207, over 21951.00 frames. ], tot_loss[loss=0.2807, simple_loss=0.3487, pruned_loss=0.1064, over 4273351.43 frames. ], batch size: 333, lr: 1.69e-02, grad_scale: 32.0 2023-06-19 02:04:11,168 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.31 vs. limit=10.0 2023-06-19 02:04:11,587 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.046e+02 2.722e+02 3.295e+02 4.146e+02 8.681e+02, threshold=6.589e+02, percent-clipped=5.0 2023-06-19 02:04:18,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=261330.0, ans=0.0 2023-06-19 02:04:41,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=261390.0, ans=0.125 2023-06-19 02:04:43,957 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.01 vs. limit=15.0 2023-06-19 02:05:59,825 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=261510.0, ans=0.125 2023-06-19 02:06:09,596 INFO [train.py:996] (3/4) Epoch 2, batch 13100, loss[loss=0.3329, simple_loss=0.387, pruned_loss=0.1394, over 21244.00 frames. ], tot_loss[loss=0.2808, simple_loss=0.3482, pruned_loss=0.1067, over 4274229.94 frames. ], batch size: 143, lr: 1.69e-02, grad_scale: 32.0 2023-06-19 02:06:23,914 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=261570.0, ans=0.125 2023-06-19 02:06:53,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=261630.0, ans=0.125 2023-06-19 02:07:27,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=261690.0, ans=0.125 2023-06-19 02:07:32,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=261690.0, ans=0.125 2023-06-19 02:07:38,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=261690.0, ans=0.125 2023-06-19 02:08:19,575 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=261810.0, ans=0.2 2023-06-19 02:08:34,310 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.94 vs. limit=22.5 2023-06-19 02:08:39,337 INFO [train.py:996] (3/4) Epoch 2, batch 13150, loss[loss=0.289, simple_loss=0.346, pruned_loss=0.116, over 20936.00 frames. ], tot_loss[loss=0.2847, simple_loss=0.3496, pruned_loss=0.1099, over 4274712.50 frames. ], batch size: 608, lr: 1.69e-02, grad_scale: 32.0 2023-06-19 02:08:45,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=261870.0, ans=0.125 2023-06-19 02:08:46,862 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.943e+02 2.844e+02 3.521e+02 4.321e+02 7.421e+02, threshold=7.042e+02, percent-clipped=2.0 2023-06-19 02:10:44,895 INFO [train.py:996] (3/4) Epoch 2, batch 13200, loss[loss=0.3225, simple_loss=0.3748, pruned_loss=0.1351, over 21936.00 frames. ], tot_loss[loss=0.2816, simple_loss=0.3461, pruned_loss=0.1085, over 4270031.90 frames. ], batch size: 372, lr: 1.69e-02, grad_scale: 32.0 2023-06-19 02:11:27,332 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=262230.0, ans=0.1 2023-06-19 02:12:06,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=262290.0, ans=10.0 2023-06-19 02:12:43,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=262410.0, ans=0.04949747468305833 2023-06-19 02:12:59,890 INFO [train.py:996] (3/4) Epoch 2, batch 13250, loss[loss=0.2653, simple_loss=0.3314, pruned_loss=0.09961, over 21789.00 frames. ], tot_loss[loss=0.2849, simple_loss=0.3468, pruned_loss=0.1115, over 4277025.98 frames. ], batch size: 124, lr: 1.69e-02, grad_scale: 32.0 2023-06-19 02:13:09,754 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.942e+02 2.771e+02 3.301e+02 4.204e+02 7.419e+02, threshold=6.603e+02, percent-clipped=1.0 2023-06-19 02:13:21,490 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.09 vs. limit=6.0 2023-06-19 02:13:34,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=262530.0, ans=0.2 2023-06-19 02:13:58,199 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 02:14:47,456 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=262650.0, ans=0.0 2023-06-19 02:15:30,032 INFO [train.py:996] (3/4) Epoch 2, batch 13300, loss[loss=0.3078, simple_loss=0.3772, pruned_loss=0.1192, over 21755.00 frames. ], tot_loss[loss=0.2873, simple_loss=0.3509, pruned_loss=0.1118, over 4271423.74 frames. ], batch size: 332, lr: 1.69e-02, grad_scale: 32.0 2023-06-19 02:15:30,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=262770.0, ans=0.125 2023-06-19 02:16:04,619 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 02:17:27,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=263010.0, ans=0.07 2023-06-19 02:17:43,400 INFO [train.py:996] (3/4) Epoch 2, batch 13350, loss[loss=0.2873, simple_loss=0.361, pruned_loss=0.1068, over 21798.00 frames. ], tot_loss[loss=0.2923, simple_loss=0.3544, pruned_loss=0.1151, over 4275534.40 frames. ], batch size: 282, lr: 1.69e-02, grad_scale: 32.0 2023-06-19 02:18:00,913 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.931e+02 2.775e+02 3.449e+02 3.875e+02 6.020e+02, threshold=6.898e+02, percent-clipped=0.0 2023-06-19 02:18:24,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=263130.0, ans=0.125 2023-06-19 02:19:16,986 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=263250.0, ans=0.125 2023-06-19 02:19:21,234 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=263250.0, ans=0.02 2023-06-19 02:19:44,919 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.29 vs. limit=15.0 2023-06-19 02:20:06,313 INFO [train.py:996] (3/4) Epoch 2, batch 13400, loss[loss=0.29, simple_loss=0.3473, pruned_loss=0.1164, over 21667.00 frames. ], tot_loss[loss=0.2955, simple_loss=0.3561, pruned_loss=0.1174, over 4284335.16 frames. ], batch size: 263, lr: 1.68e-02, grad_scale: 32.0 2023-06-19 02:20:16,490 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=263370.0, ans=0.125 2023-06-19 02:20:22,571 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=263370.0, ans=0.0 2023-06-19 02:21:11,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=263490.0, ans=0.2 2023-06-19 02:22:32,121 INFO [train.py:996] (3/4) Epoch 2, batch 13450, loss[loss=0.2568, simple_loss=0.3212, pruned_loss=0.09619, over 21668.00 frames. ], tot_loss[loss=0.3007, simple_loss=0.3598, pruned_loss=0.1209, over 4286285.68 frames. ], batch size: 298, lr: 1.68e-02, grad_scale: 32.0 2023-06-19 02:22:39,489 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.392e+02 3.379e+02 3.829e+02 4.469e+02 7.112e+02, threshold=7.658e+02, percent-clipped=1.0 2023-06-19 02:22:50,446 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=263730.0, ans=0.125 2023-06-19 02:23:21,430 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=263790.0, ans=0.125 2023-06-19 02:23:44,083 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=263790.0, ans=0.125 2023-06-19 02:23:49,917 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=263850.0, ans=0.125 2023-06-19 02:24:10,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=263850.0, ans=0.125 2023-06-19 02:24:30,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=263910.0, ans=0.125 2023-06-19 02:24:37,627 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=263910.0, ans=0.2 2023-06-19 02:24:40,494 INFO [train.py:996] (3/4) Epoch 2, batch 13500, loss[loss=0.2247, simple_loss=0.279, pruned_loss=0.08519, over 21429.00 frames. ], tot_loss[loss=0.2887, simple_loss=0.3473, pruned_loss=0.1151, over 4277780.01 frames. ], batch size: 211, lr: 1.68e-02, grad_scale: 32.0 2023-06-19 02:24:54,003 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 02:25:42,554 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 02:25:42,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=264090.0, ans=0.125 2023-06-19 02:26:50,169 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.51 vs. limit=10.0 2023-06-19 02:26:50,569 INFO [train.py:996] (3/4) Epoch 2, batch 13550, loss[loss=0.2545, simple_loss=0.3427, pruned_loss=0.08314, over 21453.00 frames. ], tot_loss[loss=0.2908, simple_loss=0.3522, pruned_loss=0.1147, over 4274621.05 frames. ], batch size: 194, lr: 1.68e-02, grad_scale: 32.0 2023-06-19 02:27:04,664 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.970e+02 2.944e+02 3.381e+02 4.408e+02 7.046e+02, threshold=6.762e+02, percent-clipped=0.0 2023-06-19 02:27:29,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=264330.0, ans=0.125 2023-06-19 02:27:35,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=264330.0, ans=0.035 2023-06-19 02:28:12,919 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 02:28:15,667 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=264450.0, ans=0.125 2023-06-19 02:28:19,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=264450.0, ans=0.0 2023-06-19 02:29:05,567 INFO [train.py:996] (3/4) Epoch 2, batch 13600, loss[loss=0.2505, simple_loss=0.3183, pruned_loss=0.09137, over 21750.00 frames. ], tot_loss[loss=0.2913, simple_loss=0.3533, pruned_loss=0.1146, over 4273176.82 frames. ], batch size: 247, lr: 1.68e-02, grad_scale: 32.0 2023-06-19 02:30:22,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=264690.0, ans=0.125 2023-06-19 02:30:33,598 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=264690.0, ans=0.125 2023-06-19 02:30:45,107 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=264750.0, ans=0.125 2023-06-19 02:30:45,669 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.69 vs. limit=10.0 2023-06-19 02:30:47,928 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=264750.0, ans=0.1 2023-06-19 02:30:58,181 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=264750.0, ans=10.0 2023-06-19 02:30:59,430 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 02:31:26,700 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=264810.0, ans=0.0 2023-06-19 02:31:29,055 INFO [train.py:996] (3/4) Epoch 2, batch 13650, loss[loss=0.2548, simple_loss=0.31, pruned_loss=0.09981, over 21876.00 frames. ], tot_loss[loss=0.2849, simple_loss=0.3478, pruned_loss=0.111, over 4274070.02 frames. ], batch size: 118, lr: 1.68e-02, grad_scale: 32.0 2023-06-19 02:31:43,982 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.985e+02 2.727e+02 3.110e+02 3.674e+02 7.098e+02, threshold=6.220e+02, percent-clipped=1.0 2023-06-19 02:32:02,490 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.06 vs. limit=22.5 2023-06-19 02:32:25,083 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=264990.0, ans=0.1 2023-06-19 02:33:10,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=265050.0, ans=0.1 2023-06-19 02:33:32,197 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=265110.0, ans=0.125 2023-06-19 02:33:35,775 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.89 vs. limit=15.0 2023-06-19 02:33:48,006 INFO [train.py:996] (3/4) Epoch 2, batch 13700, loss[loss=0.2852, simple_loss=0.3205, pruned_loss=0.1249, over 20239.00 frames. ], tot_loss[loss=0.2814, simple_loss=0.3414, pruned_loss=0.1107, over 4277535.89 frames. ], batch size: 703, lr: 1.68e-02, grad_scale: 32.0 2023-06-19 02:34:23,480 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.54 vs. limit=6.0 2023-06-19 02:34:33,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=265230.0, ans=0.035 2023-06-19 02:34:38,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=265230.0, ans=22.5 2023-06-19 02:34:46,811 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=265290.0, ans=0.125 2023-06-19 02:34:59,587 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=265290.0, ans=0.125 2023-06-19 02:35:03,847 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 02:35:47,712 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 02:36:07,353 INFO [train.py:996] (3/4) Epoch 2, batch 13750, loss[loss=0.2382, simple_loss=0.2912, pruned_loss=0.09265, over 21170.00 frames. ], tot_loss[loss=0.2784, simple_loss=0.3378, pruned_loss=0.1094, over 4266405.06 frames. ], batch size: 143, lr: 1.68e-02, grad_scale: 32.0 2023-06-19 02:36:07,820 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=265470.0, ans=0.0 2023-06-19 02:36:26,651 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.274e+02 3.182e+02 3.904e+02 5.015e+02 8.772e+02, threshold=7.809e+02, percent-clipped=9.0 2023-06-19 02:37:24,187 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=265590.0, ans=0.125 2023-06-19 02:37:26,287 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.03 vs. limit=15.0 2023-06-19 02:38:42,202 INFO [train.py:996] (3/4) Epoch 2, batch 13800, loss[loss=0.2485, simple_loss=0.3206, pruned_loss=0.0882, over 21092.00 frames. ], tot_loss[loss=0.283, simple_loss=0.3463, pruned_loss=0.1099, over 4256151.20 frames. ], batch size: 143, lr: 1.68e-02, grad_scale: 32.0 2023-06-19 02:38:58,400 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.whiten.whitening_limit, batch_count=265770.0, ans=12.0 2023-06-19 02:39:00,891 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=265770.0, ans=0.1 2023-06-19 02:39:02,921 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.27 vs. limit=22.5 2023-06-19 02:40:24,711 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=265950.0, ans=0.125 2023-06-19 02:40:58,534 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=266070.0, ans=0.0 2023-06-19 02:40:59,930 INFO [train.py:996] (3/4) Epoch 2, batch 13850, loss[loss=0.3285, simple_loss=0.3891, pruned_loss=0.134, over 21741.00 frames. ], tot_loss[loss=0.2865, simple_loss=0.3518, pruned_loss=0.1106, over 4265588.54 frames. ], batch size: 332, lr: 1.68e-02, grad_scale: 32.0 2023-06-19 02:41:38,738 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.257e+02 2.862e+02 3.371e+02 4.001e+02 6.906e+02, threshold=6.742e+02, percent-clipped=0.0 2023-06-19 02:41:50,826 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.27 vs. limit=15.0 2023-06-19 02:41:53,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=266130.0, ans=0.1 2023-06-19 02:43:32,054 INFO [train.py:996] (3/4) Epoch 2, batch 13900, loss[loss=0.3043, simple_loss=0.3544, pruned_loss=0.1271, over 21816.00 frames. ], tot_loss[loss=0.2941, simple_loss=0.3569, pruned_loss=0.1157, over 4267689.93 frames. ], batch size: 414, lr: 1.68e-02, grad_scale: 32.0 2023-06-19 02:43:49,361 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=266370.0, ans=0.1 2023-06-19 02:43:53,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=266370.0, ans=0.1 2023-06-19 02:44:06,093 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.06 vs. limit=22.5 2023-06-19 02:44:13,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=266430.0, ans=0.125 2023-06-19 02:44:55,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=266550.0, ans=0.125 2023-06-19 02:45:07,665 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.56 vs. limit=15.0 2023-06-19 02:45:14,323 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.31 vs. limit=22.5 2023-06-19 02:45:45,581 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=266610.0, ans=0.125 2023-06-19 02:45:48,057 INFO [train.py:996] (3/4) Epoch 2, batch 13950, loss[loss=0.2944, simple_loss=0.3517, pruned_loss=0.1185, over 21866.00 frames. ], tot_loss[loss=0.2973, simple_loss=0.3584, pruned_loss=0.118, over 4278561.07 frames. ], batch size: 351, lr: 1.67e-02, grad_scale: 32.0 2023-06-19 02:46:01,266 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.170e+02 3.266e+02 3.802e+02 5.286e+02 1.041e+03, threshold=7.604e+02, percent-clipped=11.0 2023-06-19 02:46:47,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=266790.0, ans=0.1 2023-06-19 02:47:42,949 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.45 vs. limit=12.0 2023-06-19 02:48:10,876 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=266910.0, ans=0.0 2023-06-19 02:48:21,250 INFO [train.py:996] (3/4) Epoch 2, batch 14000, loss[loss=0.2556, simple_loss=0.3375, pruned_loss=0.0868, over 21672.00 frames. ], tot_loss[loss=0.2899, simple_loss=0.3522, pruned_loss=0.1138, over 4278096.95 frames. ], batch size: 247, lr: 1.67e-02, grad_scale: 32.0 2023-06-19 02:48:57,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=267090.0, ans=0.1 2023-06-19 02:49:17,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=267090.0, ans=0.125 2023-06-19 02:49:37,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=267150.0, ans=0.0 2023-06-19 02:50:17,480 INFO [train.py:996] (3/4) Epoch 2, batch 14050, loss[loss=0.2546, simple_loss=0.3055, pruned_loss=0.1018, over 21651.00 frames. ], tot_loss[loss=0.282, simple_loss=0.3455, pruned_loss=0.1092, over 4288234.37 frames. ], batch size: 298, lr: 1.67e-02, grad_scale: 32.0 2023-06-19 02:50:24,948 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.599e+02 2.535e+02 3.106e+02 3.697e+02 6.124e+02, threshold=6.211e+02, percent-clipped=0.0 2023-06-19 02:51:01,193 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=267330.0, ans=0.1 2023-06-19 02:51:15,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=267390.0, ans=0.2 2023-06-19 02:51:20,445 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=267390.0, ans=0.95 2023-06-19 02:51:23,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=267390.0, ans=0.2 2023-06-19 02:51:39,419 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=267450.0, ans=0.0 2023-06-19 02:52:10,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=267510.0, ans=0.0 2023-06-19 02:52:24,970 INFO [train.py:996] (3/4) Epoch 2, batch 14100, loss[loss=0.2418, simple_loss=0.299, pruned_loss=0.09227, over 21338.00 frames. ], tot_loss[loss=0.2787, simple_loss=0.3392, pruned_loss=0.1091, over 4284138.78 frames. ], batch size: 131, lr: 1.67e-02, grad_scale: 32.0 2023-06-19 02:52:34,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=267570.0, ans=0.1 2023-06-19 02:52:41,779 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=267630.0, ans=0.125 2023-06-19 02:54:11,523 INFO [train.py:996] (3/4) Epoch 2, batch 14150, loss[loss=0.2557, simple_loss=0.3391, pruned_loss=0.08611, over 21753.00 frames. ], tot_loss[loss=0.2829, simple_loss=0.3441, pruned_loss=0.1109, over 4277437.35 frames. ], batch size: 332, lr: 1.67e-02, grad_scale: 32.0 2023-06-19 02:54:23,830 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.967e+02 2.842e+02 3.231e+02 3.958e+02 7.482e+02, threshold=6.462e+02, percent-clipped=1.0 2023-06-19 02:55:00,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=267990.0, ans=0.125 2023-06-19 02:55:25,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=268050.0, ans=0.0 2023-06-19 02:56:02,057 INFO [train.py:996] (3/4) Epoch 2, batch 14200, loss[loss=0.2804, simple_loss=0.3376, pruned_loss=0.1115, over 21862.00 frames. ], tot_loss[loss=0.2776, simple_loss=0.3406, pruned_loss=0.1073, over 4280917.88 frames. ], batch size: 118, lr: 1.67e-02, grad_scale: 32.0 2023-06-19 02:56:36,793 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 02:57:01,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=268290.0, ans=0.0 2023-06-19 02:57:33,350 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=15.47 vs. limit=15.0 2023-06-19 02:57:55,374 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 02:58:11,568 INFO [train.py:996] (3/4) Epoch 2, batch 14250, loss[loss=0.2579, simple_loss=0.3008, pruned_loss=0.1075, over 21474.00 frames. ], tot_loss[loss=0.2736, simple_loss=0.3346, pruned_loss=0.1063, over 4272791.27 frames. ], batch size: 230, lr: 1.67e-02, grad_scale: 32.0 2023-06-19 02:58:25,987 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.515e+02 2.584e+02 3.251e+02 4.167e+02 7.132e+02, threshold=6.503e+02, percent-clipped=3.0 2023-06-19 02:59:25,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=268650.0, ans=0.0 2023-06-19 03:00:06,460 INFO [train.py:996] (3/4) Epoch 2, batch 14300, loss[loss=0.3242, simple_loss=0.3918, pruned_loss=0.1283, over 21400.00 frames. ], tot_loss[loss=0.2777, simple_loss=0.341, pruned_loss=0.1072, over 4282090.68 frames. ], batch size: 211, lr: 1.67e-02, grad_scale: 32.0 2023-06-19 03:00:10,594 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.14 vs. limit=10.0 2023-06-19 03:00:53,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=268830.0, ans=0.125 2023-06-19 03:00:59,937 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=268830.0, ans=0.0 2023-06-19 03:01:29,203 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.51 vs. limit=15.0 2023-06-19 03:02:34,012 INFO [train.py:996] (3/4) Epoch 2, batch 14350, loss[loss=0.2984, simple_loss=0.3551, pruned_loss=0.1209, over 21712.00 frames. ], tot_loss[loss=0.2825, simple_loss=0.3476, pruned_loss=0.1087, over 4258653.40 frames. ], batch size: 389, lr: 1.67e-02, grad_scale: 32.0 2023-06-19 03:02:54,121 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.943e+02 2.713e+02 3.120e+02 4.327e+02 9.558e+02, threshold=6.239e+02, percent-clipped=7.0 2023-06-19 03:02:57,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=269070.0, ans=0.125 2023-06-19 03:03:12,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=269130.0, ans=0.07 2023-06-19 03:04:36,236 INFO [train.py:996] (3/4) Epoch 2, batch 14400, loss[loss=0.2814, simple_loss=0.3308, pruned_loss=0.116, over 21816.00 frames. ], tot_loss[loss=0.283, simple_loss=0.345, pruned_loss=0.1105, over 4265537.32 frames. ], batch size: 351, lr: 1.67e-02, grad_scale: 32.0 2023-06-19 03:06:32,267 INFO [train.py:996] (3/4) Epoch 2, batch 14450, loss[loss=0.2333, simple_loss=0.2885, pruned_loss=0.08903, over 21576.00 frames. ], tot_loss[loss=0.2797, simple_loss=0.339, pruned_loss=0.1102, over 4269493.05 frames. ], batch size: 231, lr: 1.67e-02, grad_scale: 32.0 2023-06-19 03:06:46,537 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.906e+02 2.837e+02 3.414e+02 4.190e+02 7.999e+02, threshold=6.829e+02, percent-clipped=4.0 2023-06-19 03:07:08,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=269730.0, ans=0.125 2023-06-19 03:07:24,221 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=269790.0, ans=0.0 2023-06-19 03:08:32,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=269910.0, ans=0.125 2023-06-19 03:08:35,928 INFO [train.py:996] (3/4) Epoch 2, batch 14500, loss[loss=0.2884, simple_loss=0.3557, pruned_loss=0.1105, over 21828.00 frames. ], tot_loss[loss=0.2771, simple_loss=0.3357, pruned_loss=0.1092, over 4270646.37 frames. ], batch size: 371, lr: 1.66e-02, grad_scale: 32.0 2023-06-19 03:08:58,073 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.91 vs. limit=10.0 2023-06-19 03:09:14,059 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=270030.0, ans=0.0 2023-06-19 03:09:43,001 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.82 vs. limit=12.0 2023-06-19 03:09:52,917 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=270150.0, ans=0.125 2023-06-19 03:10:34,022 INFO [train.py:996] (3/4) Epoch 2, batch 14550, loss[loss=0.3457, simple_loss=0.4034, pruned_loss=0.144, over 21364.00 frames. ], tot_loss[loss=0.2815, simple_loss=0.3405, pruned_loss=0.1113, over 4267572.28 frames. ], batch size: 176, lr: 1.66e-02, grad_scale: 32.0 2023-06-19 03:10:55,135 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.872e+02 2.690e+02 3.116e+02 3.745e+02 6.340e+02, threshold=6.231e+02, percent-clipped=0.0 2023-06-19 03:11:10,951 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=270330.0, ans=0.05 2023-06-19 03:12:08,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=270450.0, ans=0.0 2023-06-19 03:13:00,815 INFO [train.py:996] (3/4) Epoch 2, batch 14600, loss[loss=0.2999, simple_loss=0.3791, pruned_loss=0.1103, over 21792.00 frames. ], tot_loss[loss=0.2918, simple_loss=0.3492, pruned_loss=0.1172, over 4262971.07 frames. ], batch size: 282, lr: 1.66e-02, grad_scale: 32.0 2023-06-19 03:13:23,454 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.55 vs. limit=15.0 2023-06-19 03:15:08,361 INFO [train.py:996] (3/4) Epoch 2, batch 14650, loss[loss=0.2291, simple_loss=0.313, pruned_loss=0.07265, over 21799.00 frames. ], tot_loss[loss=0.2879, simple_loss=0.3483, pruned_loss=0.1137, over 4258601.53 frames. ], batch size: 371, lr: 1.66e-02, grad_scale: 32.0 2023-06-19 03:15:28,427 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.627e+02 2.832e+02 3.402e+02 3.897e+02 6.395e+02, threshold=6.804e+02, percent-clipped=2.0 2023-06-19 03:15:37,781 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=270930.0, ans=0.0 2023-06-19 03:16:52,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=271050.0, ans=0.2 2023-06-19 03:17:07,956 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.54 vs. limit=15.0 2023-06-19 03:17:32,068 INFO [train.py:996] (3/4) Epoch 2, batch 14700, loss[loss=0.2479, simple_loss=0.3152, pruned_loss=0.0903, over 21798.00 frames. ], tot_loss[loss=0.2803, simple_loss=0.3443, pruned_loss=0.1081, over 4261676.52 frames. ], batch size: 124, lr: 1.66e-02, grad_scale: 32.0 2023-06-19 03:18:12,147 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=271230.0, ans=0.0 2023-06-19 03:19:09,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=271350.0, ans=0.2 2023-06-19 03:19:38,007 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=271470.0, ans=0.0 2023-06-19 03:19:41,506 INFO [train.py:996] (3/4) Epoch 2, batch 14750, loss[loss=0.5185, simple_loss=0.5458, pruned_loss=0.2457, over 21430.00 frames. ], tot_loss[loss=0.2866, simple_loss=0.3512, pruned_loss=0.111, over 4265318.11 frames. ], batch size: 507, lr: 1.66e-02, grad_scale: 32.0 2023-06-19 03:20:21,830 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.574e+02 2.730e+02 3.490e+02 4.322e+02 1.005e+03, threshold=6.981e+02, percent-clipped=5.0 2023-06-19 03:20:31,703 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=271530.0, ans=0.1 2023-06-19 03:21:08,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=271650.0, ans=0.125 2023-06-19 03:21:21,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=271650.0, ans=0.0 2023-06-19 03:22:05,207 INFO [train.py:996] (3/4) Epoch 2, batch 14800, loss[loss=0.2644, simple_loss=0.3252, pruned_loss=0.1018, over 21734.00 frames. ], tot_loss[loss=0.2972, simple_loss=0.3607, pruned_loss=0.1169, over 4263319.77 frames. ], batch size: 124, lr: 1.66e-02, grad_scale: 32.0 2023-06-19 03:22:14,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=271770.0, ans=0.2 2023-06-19 03:23:34,395 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=271950.0, ans=0.0 2023-06-19 03:24:08,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=272010.0, ans=0.125 2023-06-19 03:24:36,335 INFO [train.py:996] (3/4) Epoch 2, batch 14850, loss[loss=0.2784, simple_loss=0.3436, pruned_loss=0.1066, over 20798.00 frames. ], tot_loss[loss=0.2931, simple_loss=0.3539, pruned_loss=0.1162, over 4259911.31 frames. ], batch size: 608, lr: 1.66e-02, grad_scale: 32.0 2023-06-19 03:24:40,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=272070.0, ans=0.1 2023-06-19 03:24:45,640 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.380e+02 3.062e+02 3.650e+02 4.413e+02 7.562e+02, threshold=7.301e+02, percent-clipped=4.0 2023-06-19 03:25:03,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=272130.0, ans=0.125 2023-06-19 03:25:39,333 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.89 vs. limit=15.0 2023-06-19 03:26:48,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=272370.0, ans=0.2 2023-06-19 03:26:49,815 INFO [train.py:996] (3/4) Epoch 2, batch 14900, loss[loss=0.289, simple_loss=0.3604, pruned_loss=0.1088, over 21775.00 frames. ], tot_loss[loss=0.2954, simple_loss=0.3553, pruned_loss=0.1177, over 4257573.40 frames. ], batch size: 124, lr: 1.66e-02, grad_scale: 32.0 2023-06-19 03:27:25,781 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.52 vs. limit=15.0 2023-06-19 03:27:30,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=272430.0, ans=0.125 2023-06-19 03:27:47,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=272490.0, ans=0.125 2023-06-19 03:27:52,262 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=272490.0, ans=0.125 2023-06-19 03:28:35,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=272610.0, ans=0.125 2023-06-19 03:29:10,899 INFO [train.py:996] (3/4) Epoch 2, batch 14950, loss[loss=0.2871, simple_loss=0.3576, pruned_loss=0.1082, over 21613.00 frames. ], tot_loss[loss=0.294, simple_loss=0.3547, pruned_loss=0.1166, over 4257516.92 frames. ], batch size: 263, lr: 1.66e-02, grad_scale: 32.0 2023-06-19 03:29:25,173 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.872e+02 2.792e+02 3.355e+02 4.143e+02 6.575e+02, threshold=6.711e+02, percent-clipped=0.0 2023-06-19 03:31:17,962 INFO [train.py:996] (3/4) Epoch 2, batch 15000, loss[loss=0.3292, simple_loss=0.3769, pruned_loss=0.1408, over 21775.00 frames. ], tot_loss[loss=0.2963, simple_loss=0.3572, pruned_loss=0.1177, over 4264101.93 frames. ], batch size: 441, lr: 1.66e-02, grad_scale: 32.0 2023-06-19 03:31:17,963 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-19 03:32:09,030 INFO [train.py:1028] (3/4) Epoch 2, validation: loss=0.272, simple_loss=0.3679, pruned_loss=0.08803, over 1796401.00 frames. 2023-06-19 03:32:09,030 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-19 03:32:57,388 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=273090.0, ans=0.125 2023-06-19 03:32:58,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=273090.0, ans=0.125 2023-06-19 03:32:58,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=273090.0, ans=0.0 2023-06-19 03:33:08,605 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=273150.0, ans=0.1 2023-06-19 03:34:11,070 INFO [train.py:996] (3/4) Epoch 2, batch 15050, loss[loss=0.2742, simple_loss=0.3618, pruned_loss=0.09335, over 21857.00 frames. ], tot_loss[loss=0.2977, simple_loss=0.3585, pruned_loss=0.1184, over 4261009.53 frames. ], batch size: 316, lr: 1.65e-02, grad_scale: 32.0 2023-06-19 03:34:21,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=273270.0, ans=0.125 2023-06-19 03:34:21,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=273270.0, ans=0.125 2023-06-19 03:34:28,430 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.276e+02 3.105e+02 3.767e+02 5.029e+02 7.583e+02, threshold=7.535e+02, percent-clipped=4.0 2023-06-19 03:35:49,974 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.75 vs. limit=15.0 2023-06-19 03:36:22,093 INFO [train.py:996] (3/4) Epoch 2, batch 15100, loss[loss=0.372, simple_loss=0.4156, pruned_loss=0.1642, over 21778.00 frames. ], tot_loss[loss=0.2985, simple_loss=0.3611, pruned_loss=0.1179, over 4266370.13 frames. ], batch size: 441, lr: 1.65e-02, grad_scale: 32.0 2023-06-19 03:37:19,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=273690.0, ans=0.0 2023-06-19 03:37:55,226 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.92 vs. limit=10.0 2023-06-19 03:38:07,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=273810.0, ans=0.125 2023-06-19 03:38:41,650 INFO [train.py:996] (3/4) Epoch 2, batch 15150, loss[loss=0.2637, simple_loss=0.3078, pruned_loss=0.1098, over 21659.00 frames. ], tot_loss[loss=0.2965, simple_loss=0.3568, pruned_loss=0.1181, over 4266542.33 frames. ], batch size: 231, lr: 1.65e-02, grad_scale: 32.0 2023-06-19 03:38:42,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=273870.0, ans=0.125 2023-06-19 03:38:56,088 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.095e+02 2.943e+02 3.299e+02 3.908e+02 6.468e+02, threshold=6.598e+02, percent-clipped=0.0 2023-06-19 03:39:04,204 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=273930.0, ans=0.125 2023-06-19 03:39:07,934 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.76 vs. limit=15.0 2023-06-19 03:39:54,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=274050.0, ans=0.125 2023-06-19 03:39:56,072 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=274050.0, ans=0.125 2023-06-19 03:40:28,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=274110.0, ans=0.07 2023-06-19 03:40:37,432 INFO [train.py:996] (3/4) Epoch 2, batch 15200, loss[loss=0.2567, simple_loss=0.321, pruned_loss=0.09621, over 21826.00 frames. ], tot_loss[loss=0.2872, simple_loss=0.347, pruned_loss=0.1137, over 4265887.40 frames. ], batch size: 317, lr: 1.65e-02, grad_scale: 32.0 2023-06-19 03:42:28,463 INFO [train.py:996] (3/4) Epoch 2, batch 15250, loss[loss=0.2909, simple_loss=0.3428, pruned_loss=0.1195, over 21863.00 frames. ], tot_loss[loss=0.283, simple_loss=0.3413, pruned_loss=0.1123, over 4273892.01 frames. ], batch size: 98, lr: 1.65e-02, grad_scale: 32.0 2023-06-19 03:42:42,820 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.882e+02 2.617e+02 3.072e+02 3.337e+02 6.038e+02, threshold=6.144e+02, percent-clipped=0.0 2023-06-19 03:42:46,667 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=274470.0, ans=0.125 2023-06-19 03:43:09,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=274530.0, ans=0.0 2023-06-19 03:43:48,322 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=274590.0, ans=0.125 2023-06-19 03:43:49,971 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=274590.0, ans=0.125 2023-06-19 03:44:09,547 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=274650.0, ans=0.0 2023-06-19 03:44:15,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=274710.0, ans=0.0 2023-06-19 03:44:37,234 INFO [train.py:996] (3/4) Epoch 2, batch 15300, loss[loss=0.3645, simple_loss=0.397, pruned_loss=0.166, over 21837.00 frames. ], tot_loss[loss=0.2895, simple_loss=0.3467, pruned_loss=0.1161, over 4260078.92 frames. ], batch size: 441, lr: 1.65e-02, grad_scale: 32.0 2023-06-19 03:46:10,958 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.95 vs. limit=15.0 2023-06-19 03:46:38,760 INFO [train.py:996] (3/4) Epoch 2, batch 15350, loss[loss=0.2815, simple_loss=0.3677, pruned_loss=0.09769, over 21803.00 frames. ], tot_loss[loss=0.2952, simple_loss=0.3533, pruned_loss=0.1186, over 4266687.86 frames. ], batch size: 332, lr: 1.65e-02, grad_scale: 32.0 2023-06-19 03:46:56,644 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=275070.0, ans=0.0 2023-06-19 03:47:11,413 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.176e+02 3.136e+02 3.689e+02 4.822e+02 8.057e+02, threshold=7.379e+02, percent-clipped=7.0 2023-06-19 03:47:33,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=275130.0, ans=0.0 2023-06-19 03:48:14,319 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=275250.0, ans=0.125 2023-06-19 03:48:14,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=275250.0, ans=0.125 2023-06-19 03:48:51,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=275370.0, ans=0.0 2023-06-19 03:48:52,823 INFO [train.py:996] (3/4) Epoch 2, batch 15400, loss[loss=0.2904, simple_loss=0.344, pruned_loss=0.1184, over 21869.00 frames. ], tot_loss[loss=0.2927, simple_loss=0.354, pruned_loss=0.1157, over 4255187.20 frames. ], batch size: 118, lr: 1.65e-02, grad_scale: 32.0 2023-06-19 03:49:56,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=275490.0, ans=0.125 2023-06-19 03:50:09,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=275550.0, ans=0.0 2023-06-19 03:50:42,046 INFO [train.py:996] (3/4) Epoch 2, batch 15450, loss[loss=0.255, simple_loss=0.3125, pruned_loss=0.09877, over 21260.00 frames. ], tot_loss[loss=0.288, simple_loss=0.3495, pruned_loss=0.1132, over 4266368.63 frames. ], batch size: 159, lr: 1.65e-02, grad_scale: 32.0 2023-06-19 03:50:42,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=275670.0, ans=0.0 2023-06-19 03:51:14,676 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.892e+02 2.540e+02 3.062e+02 3.855e+02 5.645e+02, threshold=6.124e+02, percent-clipped=0.0 2023-06-19 03:51:25,522 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=275730.0, ans=0.0 2023-06-19 03:51:26,846 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=275730.0, ans=0.0 2023-06-19 03:52:30,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=275850.0, ans=0.2 2023-06-19 03:53:13,044 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.57 vs. limit=12.0 2023-06-19 03:53:13,393 INFO [train.py:996] (3/4) Epoch 2, batch 15500, loss[loss=0.3085, simple_loss=0.3682, pruned_loss=0.1244, over 21651.00 frames. ], tot_loss[loss=0.2901, simple_loss=0.3521, pruned_loss=0.1141, over 4259300.27 frames. ], batch size: 263, lr: 1.65e-02, grad_scale: 32.0 2023-06-19 03:54:53,387 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=276150.0, ans=0.125 2023-06-19 03:54:54,826 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=276150.0, ans=0.025 2023-06-19 03:54:56,422 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=276210.0, ans=0.07 2023-06-19 03:55:28,550 INFO [train.py:996] (3/4) Epoch 2, batch 15550, loss[loss=0.3138, simple_loss=0.3615, pruned_loss=0.133, over 21545.00 frames. ], tot_loss[loss=0.2862, simple_loss=0.3491, pruned_loss=0.1116, over 4262746.72 frames. ], batch size: 441, lr: 1.65e-02, grad_scale: 32.0 2023-06-19 03:55:38,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=276270.0, ans=0.2 2023-06-19 03:55:49,738 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.861e+02 3.609e+02 4.664e+02 1.166e+03, threshold=7.218e+02, percent-clipped=12.0 2023-06-19 03:56:06,598 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=276330.0, ans=0.2 2023-06-19 03:56:11,052 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.74 vs. limit=6.0 2023-06-19 03:56:35,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=276390.0, ans=0.05 2023-06-19 03:57:09,966 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=276450.0, ans=0.125 2023-06-19 03:57:36,356 INFO [train.py:996] (3/4) Epoch 2, batch 15600, loss[loss=0.2636, simple_loss=0.3122, pruned_loss=0.1075, over 21392.00 frames. ], tot_loss[loss=0.2801, simple_loss=0.3417, pruned_loss=0.1092, over 4256225.45 frames. ], batch size: 194, lr: 1.64e-02, grad_scale: 32.0 2023-06-19 03:58:06,686 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=276570.0, ans=0.125 2023-06-19 03:58:32,187 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.11 vs. limit=15.0 2023-06-19 03:58:40,121 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=276630.0, ans=0.125 2023-06-19 03:59:38,781 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.43 vs. limit=15.0 2023-06-19 03:59:39,432 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=276810.0, ans=0.0 2023-06-19 04:00:05,932 INFO [train.py:996] (3/4) Epoch 2, batch 15650, loss[loss=0.2686, simple_loss=0.3138, pruned_loss=0.1117, over 21782.00 frames. ], tot_loss[loss=0.2793, simple_loss=0.3407, pruned_loss=0.109, over 4257597.69 frames. ], batch size: 112, lr: 1.64e-02, grad_scale: 32.0 2023-06-19 04:00:14,621 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.027e+02 2.975e+02 3.392e+02 4.478e+02 7.977e+02, threshold=6.785e+02, percent-clipped=4.0 2023-06-19 04:01:27,692 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=277050.0, ans=0.125 2023-06-19 04:02:03,676 INFO [train.py:996] (3/4) Epoch 2, batch 15700, loss[loss=0.2353, simple_loss=0.291, pruned_loss=0.08982, over 21607.00 frames. ], tot_loss[loss=0.2768, simple_loss=0.3378, pruned_loss=0.108, over 4255960.03 frames. ], batch size: 247, lr: 1.64e-02, grad_scale: 32.0 2023-06-19 04:02:13,839 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=277170.0, ans=0.125 2023-06-19 04:02:27,051 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=277170.0, ans=0.125 2023-06-19 04:02:36,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=277230.0, ans=0.1 2023-06-19 04:03:57,150 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=277410.0, ans=0.0 2023-06-19 04:04:04,414 INFO [train.py:996] (3/4) Epoch 2, batch 15750, loss[loss=0.2412, simple_loss=0.2999, pruned_loss=0.0913, over 21324.00 frames. ], tot_loss[loss=0.2717, simple_loss=0.3318, pruned_loss=0.1058, over 4252392.30 frames. ], batch size: 211, lr: 1.64e-02, grad_scale: 32.0 2023-06-19 04:04:22,788 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.958e+02 2.529e+02 2.886e+02 3.398e+02 5.891e+02, threshold=5.773e+02, percent-clipped=0.0 2023-06-19 04:05:09,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=277590.0, ans=0.125 2023-06-19 04:05:09,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=277590.0, ans=0.1 2023-06-19 04:05:24,677 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=277650.0, ans=0.0 2023-06-19 04:06:12,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=277710.0, ans=0.05 2023-06-19 04:06:12,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=277710.0, ans=0.0 2023-06-19 04:06:19,590 INFO [train.py:996] (3/4) Epoch 2, batch 15800, loss[loss=0.2734, simple_loss=0.3199, pruned_loss=0.1134, over 21874.00 frames. ], tot_loss[loss=0.269, simple_loss=0.3275, pruned_loss=0.1053, over 4259498.00 frames. ], batch size: 373, lr: 1.64e-02, grad_scale: 32.0 2023-06-19 04:08:40,431 INFO [train.py:996] (3/4) Epoch 2, batch 15850, loss[loss=0.2998, simple_loss=0.3539, pruned_loss=0.1228, over 21702.00 frames. ], tot_loss[loss=0.2754, simple_loss=0.3325, pruned_loss=0.1092, over 4260646.65 frames. ], batch size: 351, lr: 1.64e-02, grad_scale: 32.0 2023-06-19 04:08:49,297 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.894e+02 2.859e+02 3.359e+02 4.336e+02 6.556e+02, threshold=6.719e+02, percent-clipped=7.0 2023-06-19 04:09:30,837 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=278190.0, ans=0.125 2023-06-19 04:10:04,679 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.81 vs. limit=22.5 2023-06-19 04:10:38,064 INFO [train.py:996] (3/4) Epoch 2, batch 15900, loss[loss=0.2474, simple_loss=0.3061, pruned_loss=0.09437, over 22021.00 frames. ], tot_loss[loss=0.2763, simple_loss=0.3334, pruned_loss=0.1096, over 4261035.95 frames. ], batch size: 103, lr: 1.64e-02, grad_scale: 32.0 2023-06-19 04:11:05,239 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-06-19 04:12:30,090 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.08 vs. limit=15.0 2023-06-19 04:12:36,282 INFO [train.py:996] (3/4) Epoch 2, batch 15950, loss[loss=0.2223, simple_loss=0.3202, pruned_loss=0.06215, over 21650.00 frames. ], tot_loss[loss=0.2716, simple_loss=0.3315, pruned_loss=0.1058, over 4266686.53 frames. ], batch size: 389, lr: 1.64e-02, grad_scale: 16.0 2023-06-19 04:13:02,147 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.948e+02 2.565e+02 3.122e+02 3.960e+02 8.698e+02, threshold=6.245e+02, percent-clipped=1.0 2023-06-19 04:13:57,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=278790.0, ans=0.0 2023-06-19 04:15:07,374 INFO [train.py:996] (3/4) Epoch 2, batch 16000, loss[loss=0.2296, simple_loss=0.2972, pruned_loss=0.08101, over 21896.00 frames. ], tot_loss[loss=0.269, simple_loss=0.3315, pruned_loss=0.1033, over 4234420.12 frames. ], batch size: 98, lr: 1.64e-02, grad_scale: 32.0 2023-06-19 04:15:09,193 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=278970.0, ans=0.125 2023-06-19 04:15:43,850 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=279030.0, ans=0.0 2023-06-19 04:15:45,908 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.47 vs. limit=15.0 2023-06-19 04:16:47,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=279150.0, ans=0.2 2023-06-19 04:17:13,455 INFO [train.py:996] (3/4) Epoch 2, batch 16050, loss[loss=0.2173, simple_loss=0.2999, pruned_loss=0.06735, over 21420.00 frames. ], tot_loss[loss=0.2681, simple_loss=0.3333, pruned_loss=0.1014, over 4238797.35 frames. ], batch size: 211, lr: 1.64e-02, grad_scale: 16.0 2023-06-19 04:17:30,615 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.739e+02 2.605e+02 3.348e+02 4.219e+02 7.021e+02, threshold=6.696e+02, percent-clipped=3.0 2023-06-19 04:19:19,762 INFO [train.py:996] (3/4) Epoch 2, batch 16100, loss[loss=0.2862, simple_loss=0.342, pruned_loss=0.1152, over 21328.00 frames. ], tot_loss[loss=0.2725, simple_loss=0.338, pruned_loss=0.1035, over 4257252.44 frames. ], batch size: 176, lr: 1.64e-02, grad_scale: 16.0 2023-06-19 04:19:21,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=279570.0, ans=0.07 2023-06-19 04:19:54,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=279630.0, ans=0.05 2023-06-19 04:20:21,815 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=279690.0, ans=0.125 2023-06-19 04:21:27,812 INFO [train.py:996] (3/4) Epoch 2, batch 16150, loss[loss=0.2773, simple_loss=0.3336, pruned_loss=0.1105, over 21954.00 frames. ], tot_loss[loss=0.2759, simple_loss=0.3386, pruned_loss=0.1066, over 4271849.35 frames. ], batch size: 316, lr: 1.64e-02, grad_scale: 16.0 2023-06-19 04:21:45,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=279870.0, ans=0.2 2023-06-19 04:21:57,623 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.299e+02 3.251e+02 4.179e+02 4.916e+02 8.388e+02, threshold=8.358e+02, percent-clipped=5.0 2023-06-19 04:22:31,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=279990.0, ans=0.1 2023-06-19 04:22:54,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=280050.0, ans=0.125 2023-06-19 04:23:20,309 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=280110.0, ans=0.2 2023-06-19 04:23:30,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=280110.0, ans=0.0 2023-06-19 04:23:36,114 INFO [train.py:996] (3/4) Epoch 2, batch 16200, loss[loss=0.3333, simple_loss=0.3942, pruned_loss=0.1362, over 21647.00 frames. ], tot_loss[loss=0.2805, simple_loss=0.3434, pruned_loss=0.1088, over 4276917.14 frames. ], batch size: 389, lr: 1.63e-02, grad_scale: 16.0 2023-06-19 04:23:54,142 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=280170.0, ans=0.125 2023-06-19 04:24:30,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=280230.0, ans=0.1 2023-06-19 04:25:01,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=280350.0, ans=0.0 2023-06-19 04:25:53,141 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=22.5 2023-06-19 04:25:56,551 INFO [train.py:996] (3/4) Epoch 2, batch 16250, loss[loss=0.247, simple_loss=0.3152, pruned_loss=0.08941, over 21669.00 frames. ], tot_loss[loss=0.2812, simple_loss=0.3444, pruned_loss=0.109, over 4279978.51 frames. ], batch size: 391, lr: 1.63e-02, grad_scale: 16.0 2023-06-19 04:25:58,674 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=280470.0, ans=0.015 2023-06-19 04:26:10,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=280470.0, ans=0.125 2023-06-19 04:26:14,108 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.805e+02 2.660e+02 3.048e+02 3.476e+02 5.105e+02, threshold=6.097e+02, percent-clipped=0.0 2023-06-19 04:26:38,260 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=280530.0, ans=0.1 2023-06-19 04:26:38,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=280530.0, ans=10.0 2023-06-19 04:27:00,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=280650.0, ans=0.125 2023-06-19 04:27:15,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=280710.0, ans=0.0 2023-06-19 04:27:46,232 INFO [train.py:996] (3/4) Epoch 2, batch 16300, loss[loss=0.2418, simple_loss=0.3262, pruned_loss=0.07871, over 21622.00 frames. ], tot_loss[loss=0.2719, simple_loss=0.3373, pruned_loss=0.1033, over 4277410.21 frames. ], batch size: 389, lr: 1.63e-02, grad_scale: 16.0 2023-06-19 04:27:46,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=280770.0, ans=0.1 2023-06-19 04:28:45,352 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.92 vs. limit=6.0 2023-06-19 04:29:47,571 INFO [train.py:996] (3/4) Epoch 2, batch 16350, loss[loss=0.3025, simple_loss=0.4099, pruned_loss=0.09753, over 20833.00 frames. ], tot_loss[loss=0.2736, simple_loss=0.3384, pruned_loss=0.1044, over 4263924.59 frames. ], batch size: 608, lr: 1.63e-02, grad_scale: 16.0 2023-06-19 04:30:23,877 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.913e+02 2.593e+02 3.131e+02 3.921e+02 6.968e+02, threshold=6.263e+02, percent-clipped=2.0 2023-06-19 04:30:24,349 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=281070.0, ans=0.2 2023-06-19 04:30:36,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=281130.0, ans=0.1 2023-06-19 04:31:23,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=281250.0, ans=0.025 2023-06-19 04:32:19,385 INFO [train.py:996] (3/4) Epoch 2, batch 16400, loss[loss=0.2959, simple_loss=0.3477, pruned_loss=0.1221, over 19980.00 frames. ], tot_loss[loss=0.278, simple_loss=0.3432, pruned_loss=0.1064, over 4260138.35 frames. ], batch size: 702, lr: 1.63e-02, grad_scale: 32.0 2023-06-19 04:34:06,618 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=281610.0, ans=0.05 2023-06-19 04:34:11,561 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.09 vs. limit=15.0 2023-06-19 04:34:38,583 INFO [train.py:996] (3/4) Epoch 2, batch 16450, loss[loss=0.2546, simple_loss=0.3104, pruned_loss=0.09936, over 21459.00 frames. ], tot_loss[loss=0.2784, simple_loss=0.3419, pruned_loss=0.1075, over 4271487.66 frames. ], batch size: 194, lr: 1.63e-02, grad_scale: 32.0 2023-06-19 04:34:56,412 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.130e+02 2.650e+02 3.270e+02 4.334e+02 6.545e+02, threshold=6.541e+02, percent-clipped=2.0 2023-06-19 04:35:10,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=281730.0, ans=0.04949747468305833 2023-06-19 04:35:34,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=281850.0, ans=0.125 2023-06-19 04:36:34,078 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=281910.0, ans=0.1 2023-06-19 04:36:35,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=281910.0, ans=0.035 2023-06-19 04:36:49,342 INFO [train.py:996] (3/4) Epoch 2, batch 16500, loss[loss=0.2166, simple_loss=0.2774, pruned_loss=0.07789, over 21632.00 frames. ], tot_loss[loss=0.2783, simple_loss=0.3406, pruned_loss=0.108, over 4268302.00 frames. ], batch size: 230, lr: 1.63e-02, grad_scale: 16.0 2023-06-19 04:38:03,815 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=282090.0, ans=0.0 2023-06-19 04:39:16,515 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.75 vs. limit=10.0 2023-06-19 04:39:25,949 INFO [train.py:996] (3/4) Epoch 2, batch 16550, loss[loss=0.3072, simple_loss=0.3712, pruned_loss=0.1216, over 21312.00 frames. ], tot_loss[loss=0.2775, simple_loss=0.3413, pruned_loss=0.1068, over 4263684.86 frames. ], batch size: 548, lr: 1.63e-02, grad_scale: 16.0 2023-06-19 04:39:38,910 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.161e+02 2.903e+02 3.447e+02 4.239e+02 9.534e+02, threshold=6.894e+02, percent-clipped=2.0 2023-06-19 04:39:51,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=282330.0, ans=0.125 2023-06-19 04:39:56,913 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=282330.0, ans=0.125 2023-06-19 04:41:43,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=282510.0, ans=0.125 2023-06-19 04:41:46,296 INFO [train.py:996] (3/4) Epoch 2, batch 16600, loss[loss=0.3205, simple_loss=0.4119, pruned_loss=0.1145, over 21782.00 frames. ], tot_loss[loss=0.2876, simple_loss=0.3519, pruned_loss=0.1116, over 4267526.14 frames. ], batch size: 282, lr: 1.63e-02, grad_scale: 16.0 2023-06-19 04:41:51,087 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 04:42:11,298 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=282630.0, ans=0.0 2023-06-19 04:42:21,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=282630.0, ans=0.125 2023-06-19 04:43:36,502 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=282810.0, ans=0.0 2023-06-19 04:43:39,444 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=282810.0, ans=0.0 2023-06-19 04:43:39,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=282810.0, ans=0.025 2023-06-19 04:44:14,846 INFO [train.py:996] (3/4) Epoch 2, batch 16650, loss[loss=0.3145, simple_loss=0.3812, pruned_loss=0.1239, over 21381.00 frames. ], tot_loss[loss=0.294, simple_loss=0.3593, pruned_loss=0.1144, over 4267036.18 frames. ], batch size: 176, lr: 1.63e-02, grad_scale: 16.0 2023-06-19 04:44:28,260 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.064e+02 2.764e+02 3.302e+02 3.864e+02 7.360e+02, threshold=6.604e+02, percent-clipped=1.0 2023-06-19 04:45:45,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=283050.0, ans=0.125 2023-06-19 04:46:22,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=283110.0, ans=0.125 2023-06-19 04:46:32,962 INFO [train.py:996] (3/4) Epoch 2, batch 16700, loss[loss=0.2425, simple_loss=0.2937, pruned_loss=0.09566, over 21821.00 frames. ], tot_loss[loss=0.2923, simple_loss=0.357, pruned_loss=0.1138, over 4268021.54 frames. ], batch size: 124, lr: 1.63e-02, grad_scale: 16.0 2023-06-19 04:47:12,813 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=283230.0, ans=0.0 2023-06-19 04:47:17,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=283230.0, ans=0.125 2023-06-19 04:47:17,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=283230.0, ans=0.125 2023-06-19 04:47:23,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=283290.0, ans=22.5 2023-06-19 04:48:10,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=283350.0, ans=0.125 2023-06-19 04:48:23,329 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=283350.0, ans=0.125 2023-06-19 04:48:26,251 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.64 vs. limit=15.0 2023-06-19 04:49:01,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=283410.0, ans=0.0 2023-06-19 04:49:06,320 INFO [train.py:996] (3/4) Epoch 2, batch 16750, loss[loss=0.3087, simple_loss=0.3795, pruned_loss=0.119, over 21747.00 frames. ], tot_loss[loss=0.2989, simple_loss=0.3618, pruned_loss=0.1181, over 4265324.91 frames. ], batch size: 332, lr: 1.63e-02, grad_scale: 16.0 2023-06-19 04:49:32,336 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.454e+02 3.141e+02 3.831e+02 4.600e+02 7.842e+02, threshold=7.663e+02, percent-clipped=1.0 2023-06-19 04:50:08,483 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-06-19 04:51:17,469 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=283710.0, ans=0.125 2023-06-19 04:51:32,369 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.69 vs. limit=10.0 2023-06-19 04:51:37,205 INFO [train.py:996] (3/4) Epoch 2, batch 16800, loss[loss=0.2652, simple_loss=0.3293, pruned_loss=0.1005, over 21815.00 frames. ], tot_loss[loss=0.3008, simple_loss=0.3652, pruned_loss=0.1182, over 4269491.38 frames. ], batch size: 298, lr: 1.62e-02, grad_scale: 32.0 2023-06-19 04:52:50,292 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-19 04:53:54,893 INFO [train.py:996] (3/4) Epoch 2, batch 16850, loss[loss=0.2615, simple_loss=0.3134, pruned_loss=0.1048, over 21587.00 frames. ], tot_loss[loss=0.2988, simple_loss=0.3611, pruned_loss=0.1182, over 4273414.84 frames. ], batch size: 212, lr: 1.62e-02, grad_scale: 32.0 2023-06-19 04:54:16,971 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.007e+02 3.232e+02 3.798e+02 4.468e+02 9.826e+02, threshold=7.596e+02, percent-clipped=4.0 2023-06-19 04:54:34,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=284130.0, ans=0.125 2023-06-19 04:54:48,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=284190.0, ans=0.2 2023-06-19 04:56:00,028 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=284310.0, ans=0.125 2023-06-19 04:56:08,434 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.93 vs. limit=15.0 2023-06-19 04:56:12,104 INFO [train.py:996] (3/4) Epoch 2, batch 16900, loss[loss=0.3088, simple_loss=0.343, pruned_loss=0.1372, over 20125.00 frames. ], tot_loss[loss=0.2951, simple_loss=0.3566, pruned_loss=0.1168, over 4278010.54 frames. ], batch size: 707, lr: 1.62e-02, grad_scale: 32.0 2023-06-19 04:56:41,024 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=284430.0, ans=0.2 2023-06-19 04:57:03,319 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=284490.0, ans=0.125 2023-06-19 04:58:33,328 INFO [train.py:996] (3/4) Epoch 2, batch 16950, loss[loss=0.2794, simple_loss=0.3331, pruned_loss=0.1128, over 21894.00 frames. ], tot_loss[loss=0.2891, simple_loss=0.3492, pruned_loss=0.1145, over 4277502.93 frames. ], batch size: 351, lr: 1.62e-02, grad_scale: 32.0 2023-06-19 04:58:46,768 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.166e+02 2.811e+02 3.359e+02 4.146e+02 6.829e+02, threshold=6.718e+02, percent-clipped=0.0 2023-06-19 04:59:48,221 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=284790.0, ans=0.125 2023-06-19 05:00:36,263 INFO [train.py:996] (3/4) Epoch 2, batch 17000, loss[loss=0.2953, simple_loss=0.347, pruned_loss=0.1218, over 21517.00 frames. ], tot_loss[loss=0.2879, simple_loss=0.3464, pruned_loss=0.1147, over 4283474.77 frames. ], batch size: 131, lr: 1.62e-02, grad_scale: 32.0 2023-06-19 05:02:50,691 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=285210.0, ans=0.1 2023-06-19 05:02:51,101 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.10 vs. limit=12.0 2023-06-19 05:03:09,669 INFO [train.py:996] (3/4) Epoch 2, batch 17050, loss[loss=0.2896, simple_loss=0.3395, pruned_loss=0.1198, over 21198.00 frames. ], tot_loss[loss=0.2934, simple_loss=0.3529, pruned_loss=0.1169, over 4287715.72 frames. ], batch size: 608, lr: 1.62e-02, grad_scale: 32.0 2023-06-19 05:03:28,009 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.296e+02 2.955e+02 3.306e+02 4.122e+02 6.323e+02, threshold=6.612e+02, percent-clipped=0.0 2023-06-19 05:03:57,735 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.98 vs. limit=15.0 2023-06-19 05:05:00,474 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=285510.0, ans=0.125 2023-06-19 05:05:04,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=285510.0, ans=0.0 2023-06-19 05:05:09,701 INFO [train.py:996] (3/4) Epoch 2, batch 17100, loss[loss=0.2806, simple_loss=0.3334, pruned_loss=0.1139, over 21723.00 frames. ], tot_loss[loss=0.2928, simple_loss=0.3518, pruned_loss=0.1169, over 4287959.32 frames. ], batch size: 230, lr: 1.62e-02, grad_scale: 32.0 2023-06-19 05:05:10,721 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.19 vs. limit=12.0 2023-06-19 05:06:20,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=285690.0, ans=0.0 2023-06-19 05:06:26,981 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.76 vs. limit=15.0 2023-06-19 05:06:41,000 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=285750.0, ans=0.0 2023-06-19 05:06:48,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=285750.0, ans=0.125 2023-06-19 05:07:26,619 INFO [train.py:996] (3/4) Epoch 2, batch 17150, loss[loss=0.2533, simple_loss=0.3034, pruned_loss=0.1016, over 21247.00 frames. ], tot_loss[loss=0.2895, simple_loss=0.3474, pruned_loss=0.1158, over 4285069.64 frames. ], batch size: 608, lr: 1.62e-02, grad_scale: 32.0 2023-06-19 05:08:02,661 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.995e+02 2.782e+02 3.187e+02 4.144e+02 6.035e+02, threshold=6.374e+02, percent-clipped=0.0 2023-06-19 05:08:08,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=285930.0, ans=0.035 2023-06-19 05:09:09,192 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.65 vs. limit=10.0 2023-06-19 05:09:34,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=286170.0, ans=0.125 2023-06-19 05:09:35,272 INFO [train.py:996] (3/4) Epoch 2, batch 17200, loss[loss=0.3302, simple_loss=0.374, pruned_loss=0.1432, over 21864.00 frames. ], tot_loss[loss=0.2893, simple_loss=0.3468, pruned_loss=0.1159, over 4285045.29 frames. ], batch size: 371, lr: 1.62e-02, grad_scale: 32.0 2023-06-19 05:10:15,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=286230.0, ans=0.1 2023-06-19 05:11:25,695 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 05:12:03,954 INFO [train.py:996] (3/4) Epoch 2, batch 17250, loss[loss=0.3922, simple_loss=0.4355, pruned_loss=0.1745, over 21723.00 frames. ], tot_loss[loss=0.2921, simple_loss=0.3499, pruned_loss=0.1172, over 4283014.36 frames. ], batch size: 441, lr: 1.62e-02, grad_scale: 32.0 2023-06-19 05:12:40,420 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=286530.0, ans=0.0 2023-06-19 05:12:46,967 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.154e+02 3.005e+02 3.499e+02 4.330e+02 7.906e+02, threshold=6.999e+02, percent-clipped=5.0 2023-06-19 05:12:52,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=286530.0, ans=0.1 2023-06-19 05:13:12,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=286590.0, ans=0.0 2023-06-19 05:14:18,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=286710.0, ans=0.0 2023-06-19 05:14:28,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=286710.0, ans=0.5 2023-06-19 05:14:30,721 INFO [train.py:996] (3/4) Epoch 2, batch 17300, loss[loss=0.3215, simple_loss=0.3853, pruned_loss=0.1289, over 21766.00 frames. ], tot_loss[loss=0.3022, simple_loss=0.3598, pruned_loss=0.1223, over 4274630.85 frames. ], batch size: 332, lr: 1.62e-02, grad_scale: 16.0 2023-06-19 05:15:21,809 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.18 vs. limit=6.0 2023-06-19 05:15:52,804 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=18.20 vs. limit=15.0 2023-06-19 05:16:16,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=286950.0, ans=0.125 2023-06-19 05:16:18,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=287010.0, ans=0.0 2023-06-19 05:16:33,686 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=287010.0, ans=0.0 2023-06-19 05:16:42,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=287010.0, ans=0.2 2023-06-19 05:17:10,024 INFO [train.py:996] (3/4) Epoch 2, batch 17350, loss[loss=0.2413, simple_loss=0.3253, pruned_loss=0.07865, over 21710.00 frames. ], tot_loss[loss=0.301, simple_loss=0.3587, pruned_loss=0.1217, over 4267342.34 frames. ], batch size: 298, lr: 1.62e-02, grad_scale: 16.0 2023-06-19 05:17:17,528 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.53 vs. limit=22.5 2023-06-19 05:17:20,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=287070.0, ans=0.035 2023-06-19 05:17:31,001 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.996e+02 3.184e+02 3.862e+02 4.907e+02 9.344e+02, threshold=7.725e+02, percent-clipped=8.0 2023-06-19 05:17:39,776 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.04 vs. limit=8.0 2023-06-19 05:17:45,839 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.05 vs. limit=15.0 2023-06-19 05:17:51,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=287190.0, ans=0.125 2023-06-19 05:19:11,902 INFO [train.py:996] (3/4) Epoch 2, batch 17400, loss[loss=0.224, simple_loss=0.2792, pruned_loss=0.08436, over 21198.00 frames. ], tot_loss[loss=0.2932, simple_loss=0.353, pruned_loss=0.1167, over 4265433.76 frames. ], batch size: 159, lr: 1.61e-02, grad_scale: 16.0 2023-06-19 05:21:14,190 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=287610.0, ans=0.95 2023-06-19 05:21:27,434 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 05:21:39,828 INFO [train.py:996] (3/4) Epoch 2, batch 17450, loss[loss=0.2124, simple_loss=0.2931, pruned_loss=0.06586, over 21250.00 frames. ], tot_loss[loss=0.2858, simple_loss=0.3475, pruned_loss=0.112, over 4267559.42 frames. ], batch size: 176, lr: 1.61e-02, grad_scale: 16.0 2023-06-19 05:22:08,217 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=287670.0, ans=0.125 2023-06-19 05:22:16,014 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.811e+02 2.812e+02 3.482e+02 4.322e+02 6.614e+02, threshold=6.964e+02, percent-clipped=0.0 2023-06-19 05:22:41,345 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=287730.0, ans=0.0 2023-06-19 05:23:30,633 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.34 vs. limit=22.5 2023-06-19 05:23:54,481 INFO [train.py:996] (3/4) Epoch 2, batch 17500, loss[loss=0.2546, simple_loss=0.3144, pruned_loss=0.09741, over 21819.00 frames. ], tot_loss[loss=0.2808, simple_loss=0.3433, pruned_loss=0.1092, over 4277486.00 frames. ], batch size: 282, lr: 1.61e-02, grad_scale: 16.0 2023-06-19 05:24:12,031 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.23 vs. limit=15.0 2023-06-19 05:25:41,899 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.15 vs. limit=15.0 2023-06-19 05:25:56,524 INFO [train.py:996] (3/4) Epoch 2, batch 17550, loss[loss=0.2669, simple_loss=0.3442, pruned_loss=0.0948, over 21197.00 frames. ], tot_loss[loss=0.2787, simple_loss=0.3435, pruned_loss=0.107, over 4273434.44 frames. ], batch size: 159, lr: 1.61e-02, grad_scale: 16.0 2023-06-19 05:26:05,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=288270.0, ans=0.2 2023-06-19 05:26:10,713 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.478e+02 2.669e+02 3.304e+02 4.133e+02 8.142e+02, threshold=6.607e+02, percent-clipped=6.0 2023-06-19 05:27:37,111 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.40 vs. limit=22.5 2023-06-19 05:27:53,952 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.13 vs. limit=10.0 2023-06-19 05:27:57,422 INFO [train.py:996] (3/4) Epoch 2, batch 17600, loss[loss=0.2908, simple_loss=0.3497, pruned_loss=0.1159, over 21770.00 frames. ], tot_loss[loss=0.279, simple_loss=0.3439, pruned_loss=0.107, over 4267947.76 frames. ], batch size: 247, lr: 1.61e-02, grad_scale: 32.0 2023-06-19 05:28:42,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=288690.0, ans=0.2 2023-06-19 05:28:44,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=288690.0, ans=0.125 2023-06-19 05:29:09,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=288690.0, ans=0.1 2023-06-19 05:29:26,784 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=288750.0, ans=0.125 2023-06-19 05:29:53,828 INFO [train.py:996] (3/4) Epoch 2, batch 17650, loss[loss=0.1847, simple_loss=0.2372, pruned_loss=0.06608, over 21225.00 frames. ], tot_loss[loss=0.2775, simple_loss=0.3416, pruned_loss=0.1067, over 4269212.51 frames. ], batch size: 176, lr: 1.61e-02, grad_scale: 32.0 2023-06-19 05:30:05,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=288870.0, ans=0.125 2023-06-19 05:30:31,853 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=288870.0, ans=0.0 2023-06-19 05:30:34,314 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.876e+02 2.529e+02 3.160e+02 3.730e+02 7.099e+02, threshold=6.320e+02, percent-clipped=1.0 2023-06-19 05:30:36,666 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.32 vs. limit=15.0 2023-06-19 05:30:40,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=288930.0, ans=0.125 2023-06-19 05:31:40,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=289050.0, ans=0.125 2023-06-19 05:32:11,540 INFO [train.py:996] (3/4) Epoch 2, batch 17700, loss[loss=0.2616, simple_loss=0.3508, pruned_loss=0.08626, over 21812.00 frames. ], tot_loss[loss=0.2691, simple_loss=0.3336, pruned_loss=0.1023, over 4266005.78 frames. ], batch size: 282, lr: 1.61e-02, grad_scale: 32.0 2023-06-19 05:34:21,803 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.70 vs. limit=10.0 2023-06-19 05:34:45,048 INFO [train.py:996] (3/4) Epoch 2, batch 17750, loss[loss=0.3169, simple_loss=0.3799, pruned_loss=0.127, over 21690.00 frames. ], tot_loss[loss=0.2813, simple_loss=0.3454, pruned_loss=0.1086, over 4268986.84 frames. ], batch size: 351, lr: 1.61e-02, grad_scale: 32.0 2023-06-19 05:34:45,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=289470.0, ans=0.0 2023-06-19 05:35:15,614 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.935e+02 2.691e+02 3.269e+02 3.962e+02 7.961e+02, threshold=6.538e+02, percent-clipped=2.0 2023-06-19 05:35:30,400 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=289530.0, ans=0.125 2023-06-19 05:35:51,125 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=289590.0, ans=0.0 2023-06-19 05:36:02,349 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=289650.0, ans=0.2 2023-06-19 05:36:38,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=289710.0, ans=0.125 2023-06-19 05:36:43,303 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-06-19 05:37:04,585 INFO [train.py:996] (3/4) Epoch 2, batch 17800, loss[loss=0.2712, simple_loss=0.3433, pruned_loss=0.09954, over 21740.00 frames. ], tot_loss[loss=0.2798, simple_loss=0.3445, pruned_loss=0.1075, over 4273175.65 frames. ], batch size: 332, lr: 1.61e-02, grad_scale: 32.0 2023-06-19 05:39:34,451 INFO [train.py:996] (3/4) Epoch 2, batch 17850, loss[loss=0.274, simple_loss=0.3137, pruned_loss=0.1172, over 20174.00 frames. ], tot_loss[loss=0.2805, simple_loss=0.345, pruned_loss=0.1079, over 4273701.74 frames. ], batch size: 702, lr: 1.61e-02, grad_scale: 32.0 2023-06-19 05:40:01,052 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.945e+02 2.782e+02 3.357e+02 4.257e+02 9.205e+02, threshold=6.714e+02, percent-clipped=6.0 2023-06-19 05:40:07,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=290130.0, ans=0.125 2023-06-19 05:40:24,401 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=290190.0, ans=0.025 2023-06-19 05:41:49,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=290370.0, ans=0.04949747468305833 2023-06-19 05:41:50,091 INFO [train.py:996] (3/4) Epoch 2, batch 17900, loss[loss=0.3043, simple_loss=0.3703, pruned_loss=0.1191, over 21488.00 frames. ], tot_loss[loss=0.2861, simple_loss=0.3511, pruned_loss=0.1106, over 4273384.59 frames. ], batch size: 131, lr: 1.61e-02, grad_scale: 32.0 2023-06-19 05:42:09,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=290430.0, ans=0.0 2023-06-19 05:42:42,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=290490.0, ans=0.1 2023-06-19 05:43:16,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=290610.0, ans=0.125 2023-06-19 05:43:57,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=290610.0, ans=0.1 2023-06-19 05:43:59,616 INFO [train.py:996] (3/4) Epoch 2, batch 17950, loss[loss=0.2818, simple_loss=0.3647, pruned_loss=0.09947, over 21498.00 frames. ], tot_loss[loss=0.2821, simple_loss=0.3511, pruned_loss=0.1065, over 4274304.63 frames. ], batch size: 471, lr: 1.61e-02, grad_scale: 32.0 2023-06-19 05:44:20,872 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.28 vs. limit=15.0 2023-06-19 05:44:34,921 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.834e+02 2.675e+02 3.159e+02 3.778e+02 7.100e+02, threshold=6.318e+02, percent-clipped=1.0 2023-06-19 05:44:56,436 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.57 vs. limit=10.0 2023-06-19 05:45:02,967 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=290790.0, ans=0.0 2023-06-19 05:46:07,266 INFO [train.py:996] (3/4) Epoch 2, batch 18000, loss[loss=0.2623, simple_loss=0.3148, pruned_loss=0.1049, over 21739.00 frames. ], tot_loss[loss=0.2761, simple_loss=0.3428, pruned_loss=0.1048, over 4274370.57 frames. ], batch size: 371, lr: 1.60e-02, grad_scale: 32.0 2023-06-19 05:46:07,267 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-19 05:47:07,987 INFO [train.py:1028] (3/4) Epoch 2, validation: loss=0.2814, simple_loss=0.3799, pruned_loss=0.0915, over 1796401.00 frames. 2023-06-19 05:47:07,988 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-19 05:48:54,871 INFO [train.py:996] (3/4) Epoch 2, batch 18050, loss[loss=0.2457, simple_loss=0.3052, pruned_loss=0.09312, over 21428.00 frames. ], tot_loss[loss=0.2733, simple_loss=0.3381, pruned_loss=0.1042, over 4271534.30 frames. ], batch size: 211, lr: 1.60e-02, grad_scale: 32.0 2023-06-19 05:48:58,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=291270.0, ans=0.09899494936611666 2023-06-19 05:49:26,879 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.963e+02 2.691e+02 3.245e+02 3.925e+02 6.947e+02, threshold=6.490e+02, percent-clipped=2.0 2023-06-19 05:49:33,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=291330.0, ans=0.0 2023-06-19 05:50:47,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=291510.0, ans=0.125 2023-06-19 05:50:47,554 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=291510.0, ans=0.0 2023-06-19 05:51:13,357 INFO [train.py:996] (3/4) Epoch 2, batch 18100, loss[loss=0.3115, simple_loss=0.3851, pruned_loss=0.119, over 21636.00 frames. ], tot_loss[loss=0.2777, simple_loss=0.3426, pruned_loss=0.1064, over 4277566.97 frames. ], batch size: 414, lr: 1.60e-02, grad_scale: 32.0 2023-06-19 05:51:17,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=291570.0, ans=0.0 2023-06-19 05:52:09,857 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=291690.0, ans=0.1 2023-06-19 05:52:32,240 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.02 vs. limit=22.5 2023-06-19 05:52:36,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=291750.0, ans=0.1 2023-06-19 05:52:52,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=291810.0, ans=15.0 2023-06-19 05:52:53,745 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.08 vs. limit=10.0 2023-06-19 05:53:14,867 INFO [train.py:996] (3/4) Epoch 2, batch 18150, loss[loss=0.3001, simple_loss=0.3531, pruned_loss=0.1236, over 21633.00 frames. ], tot_loss[loss=0.2777, simple_loss=0.3434, pruned_loss=0.106, over 4279373.05 frames. ], batch size: 415, lr: 1.60e-02, grad_scale: 32.0 2023-06-19 05:53:28,393 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.26 vs. limit=15.0 2023-06-19 05:53:40,265 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.147e+02 2.848e+02 3.405e+02 4.028e+02 7.103e+02, threshold=6.809e+02, percent-clipped=3.0 2023-06-19 05:53:43,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=291930.0, ans=0.125 2023-06-19 05:53:45,234 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=291930.0, ans=0.0 2023-06-19 05:54:34,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=291990.0, ans=0.125 2023-06-19 05:54:37,866 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.13 vs. limit=12.0 2023-06-19 05:54:37,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.whiten.whitening_limit, batch_count=292050.0, ans=12.0 2023-06-19 05:54:40,285 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=292050.0, ans=0.125 2023-06-19 05:55:05,668 INFO [train.py:996] (3/4) Epoch 2, batch 18200, loss[loss=0.2141, simple_loss=0.2861, pruned_loss=0.071, over 21827.00 frames. ], tot_loss[loss=0.2752, simple_loss=0.3388, pruned_loss=0.1058, over 4270933.50 frames. ], batch size: 102, lr: 1.60e-02, grad_scale: 32.0 2023-06-19 05:55:31,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=292170.0, ans=0.1 2023-06-19 05:57:11,337 INFO [train.py:996] (3/4) Epoch 2, batch 18250, loss[loss=0.252, simple_loss=0.2913, pruned_loss=0.1063, over 20821.00 frames. ], tot_loss[loss=0.2689, simple_loss=0.3319, pruned_loss=0.1029, over 4277329.28 frames. ], batch size: 609, lr: 1.60e-02, grad_scale: 32.0 2023-06-19 05:57:30,902 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.767e+02 2.559e+02 2.946e+02 3.738e+02 5.936e+02, threshold=5.891e+02, percent-clipped=0.0 2023-06-19 05:58:34,008 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=292650.0, ans=0.2 2023-06-19 05:58:39,356 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-06-19 05:59:09,761 INFO [train.py:996] (3/4) Epoch 2, batch 18300, loss[loss=0.2936, simple_loss=0.3981, pruned_loss=0.0945, over 21791.00 frames. ], tot_loss[loss=0.2669, simple_loss=0.3308, pruned_loss=0.1015, over 4274072.13 frames. ], batch size: 351, lr: 1.60e-02, grad_scale: 32.0 2023-06-19 05:59:49,893 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=292830.0, ans=0.125 2023-06-19 06:00:12,519 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.63 vs. limit=15.0 2023-06-19 06:00:13,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=292890.0, ans=0.2 2023-06-19 06:01:22,872 INFO [train.py:996] (3/4) Epoch 2, batch 18350, loss[loss=0.2533, simple_loss=0.3078, pruned_loss=0.09939, over 21335.00 frames. ], tot_loss[loss=0.2712, simple_loss=0.3365, pruned_loss=0.103, over 4271251.65 frames. ], batch size: 144, lr: 1.60e-02, grad_scale: 32.0 2023-06-19 06:01:32,721 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.72 vs. limit=22.5 2023-06-19 06:01:43,032 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.967e+02 3.069e+02 3.880e+02 5.006e+02 9.959e+02, threshold=7.760e+02, percent-clipped=14.0 2023-06-19 06:02:32,166 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=293190.0, ans=0.1 2023-06-19 06:02:39,873 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=293250.0, ans=0.0 2023-06-19 06:03:19,017 INFO [train.py:996] (3/4) Epoch 2, batch 18400, loss[loss=0.2211, simple_loss=0.2815, pruned_loss=0.08032, over 21340.00 frames. ], tot_loss[loss=0.267, simple_loss=0.3322, pruned_loss=0.1009, over 4266087.50 frames. ], batch size: 160, lr: 1.60e-02, grad_scale: 32.0 2023-06-19 06:03:35,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=293370.0, ans=0.0 2023-06-19 06:05:28,466 INFO [train.py:996] (3/4) Epoch 2, batch 18450, loss[loss=0.2336, simple_loss=0.2875, pruned_loss=0.0898, over 21831.00 frames. ], tot_loss[loss=0.2612, simple_loss=0.3284, pruned_loss=0.09702, over 4265893.03 frames. ], batch size: 118, lr: 1.60e-02, grad_scale: 32.0 2023-06-19 06:05:48,547 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.750e+02 2.533e+02 3.129e+02 3.878e+02 6.267e+02, threshold=6.259e+02, percent-clipped=0.0 2023-06-19 06:06:17,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=293790.0, ans=0.125 2023-06-19 06:07:29,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=293910.0, ans=0.0 2023-06-19 06:07:36,735 INFO [train.py:996] (3/4) Epoch 2, batch 18500, loss[loss=0.2607, simple_loss=0.3353, pruned_loss=0.09305, over 21568.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.3238, pruned_loss=0.09665, over 4268120.70 frames. ], batch size: 389, lr: 1.60e-02, grad_scale: 32.0 2023-06-19 06:08:38,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=294090.0, ans=0.025 2023-06-19 06:09:36,611 INFO [train.py:996] (3/4) Epoch 2, batch 18550, loss[loss=0.2448, simple_loss=0.2852, pruned_loss=0.1022, over 21353.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3205, pruned_loss=0.09547, over 4258023.80 frames. ], batch size: 160, lr: 1.60e-02, grad_scale: 32.0 2023-06-19 06:09:38,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=294270.0, ans=0.0 2023-06-19 06:09:50,006 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=294270.0, ans=0.2 2023-06-19 06:09:51,893 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.67 vs. limit=10.0 2023-06-19 06:10:04,159 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.907e+02 2.529e+02 3.013e+02 3.541e+02 7.378e+02, threshold=6.027e+02, percent-clipped=2.0 2023-06-19 06:10:12,167 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=294330.0, ans=0.0 2023-06-19 06:10:29,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=294390.0, ans=0.1 2023-06-19 06:10:31,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=294390.0, ans=0.2 2023-06-19 06:10:34,417 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=294390.0, ans=0.2 2023-06-19 06:10:53,184 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.45 vs. limit=22.5 2023-06-19 06:11:01,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=294450.0, ans=0.125 2023-06-19 06:11:26,230 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=294450.0, ans=0.125 2023-06-19 06:11:40,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=294510.0, ans=0.1 2023-06-19 06:11:43,211 INFO [train.py:996] (3/4) Epoch 2, batch 18600, loss[loss=0.2368, simple_loss=0.3123, pruned_loss=0.08067, over 21648.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3186, pruned_loss=0.09636, over 4258975.01 frames. ], batch size: 247, lr: 1.60e-02, grad_scale: 32.0 2023-06-19 06:13:27,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=294750.0, ans=0.1 2023-06-19 06:13:38,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=294810.0, ans=0.0 2023-06-19 06:13:41,023 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=294810.0, ans=0.0 2023-06-19 06:13:45,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=294810.0, ans=0.2 2023-06-19 06:13:48,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=294870.0, ans=0.0 2023-06-19 06:13:49,184 INFO [train.py:996] (3/4) Epoch 2, batch 18650, loss[loss=0.2769, simple_loss=0.3206, pruned_loss=0.1166, over 21590.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.3184, pruned_loss=0.09751, over 4257585.22 frames. ], batch size: 415, lr: 1.59e-02, grad_scale: 32.0 2023-06-19 06:14:20,860 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.956e+02 2.816e+02 3.268e+02 3.940e+02 5.301e+02, threshold=6.535e+02, percent-clipped=0.0 2023-06-19 06:14:29,476 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.39 vs. limit=10.0 2023-06-19 06:15:39,304 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=295110.0, ans=0.2 2023-06-19 06:15:54,973 INFO [train.py:996] (3/4) Epoch 2, batch 18700, loss[loss=0.3107, simple_loss=0.346, pruned_loss=0.1377, over 21564.00 frames. ], tot_loss[loss=0.2593, simple_loss=0.3183, pruned_loss=0.1001, over 4259793.80 frames. ], batch size: 471, lr: 1.59e-02, grad_scale: 32.0 2023-06-19 06:17:19,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=295350.0, ans=0.125 2023-06-19 06:18:11,675 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.97 vs. limit=15.0 2023-06-19 06:18:15,070 INFO [train.py:996] (3/4) Epoch 2, batch 18750, loss[loss=0.3359, simple_loss=0.3934, pruned_loss=0.1392, over 21651.00 frames. ], tot_loss[loss=0.2634, simple_loss=0.3209, pruned_loss=0.1029, over 4267194.34 frames. ], batch size: 389, lr: 1.59e-02, grad_scale: 32.0 2023-06-19 06:18:34,901 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.081e+02 2.770e+02 3.195e+02 3.990e+02 6.392e+02, threshold=6.389e+02, percent-clipped=0.0 2023-06-19 06:18:38,151 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=295530.0, ans=0.0 2023-06-19 06:19:53,204 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=295710.0, ans=0.125 2023-06-19 06:20:08,755 INFO [train.py:996] (3/4) Epoch 2, batch 18800, loss[loss=0.2633, simple_loss=0.3453, pruned_loss=0.09069, over 21693.00 frames. ], tot_loss[loss=0.2634, simple_loss=0.3243, pruned_loss=0.1013, over 4267872.15 frames. ], batch size: 441, lr: 1.59e-02, grad_scale: 32.0 2023-06-19 06:20:10,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=295770.0, ans=0.125 2023-06-19 06:20:10,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=295770.0, ans=0.0 2023-06-19 06:20:25,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=295770.0, ans=0.125 2023-06-19 06:20:54,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=295830.0, ans=0.035 2023-06-19 06:21:00,329 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=295830.0, ans=0.125 2023-06-19 06:22:12,775 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.06 vs. limit=15.0 2023-06-19 06:22:26,188 INFO [train.py:996] (3/4) Epoch 2, batch 18850, loss[loss=0.2114, simple_loss=0.2784, pruned_loss=0.07224, over 21619.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3193, pruned_loss=0.0953, over 4266475.44 frames. ], batch size: 247, lr: 1.59e-02, grad_scale: 16.0 2023-06-19 06:22:50,582 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=296130.0, ans=0.2 2023-06-19 06:22:52,847 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.657e+02 2.651e+02 3.232e+02 4.439e+02 7.009e+02, threshold=6.464e+02, percent-clipped=3.0 2023-06-19 06:23:00,558 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=296130.0, ans=0.125 2023-06-19 06:23:03,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=296130.0, ans=0.05 2023-06-19 06:23:17,827 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.17 vs. limit=15.0 2023-06-19 06:24:28,386 INFO [train.py:996] (3/4) Epoch 2, batch 18900, loss[loss=0.2292, simple_loss=0.2904, pruned_loss=0.08394, over 20801.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.3194, pruned_loss=0.09781, over 4258297.27 frames. ], batch size: 609, lr: 1.59e-02, grad_scale: 16.0 2023-06-19 06:26:37,115 INFO [train.py:996] (3/4) Epoch 2, batch 18950, loss[loss=0.2927, simple_loss=0.3764, pruned_loss=0.1046, over 21432.00 frames. ], tot_loss[loss=0.2636, simple_loss=0.3235, pruned_loss=0.1019, over 4261959.27 frames. ], batch size: 212, lr: 1.59e-02, grad_scale: 16.0 2023-06-19 06:27:13,341 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.019e+02 2.916e+02 3.423e+02 4.145e+02 6.065e+02, threshold=6.846e+02, percent-clipped=0.0 2023-06-19 06:28:57,538 INFO [train.py:996] (3/4) Epoch 2, batch 19000, loss[loss=0.2825, simple_loss=0.3628, pruned_loss=0.101, over 21410.00 frames. ], tot_loss[loss=0.2696, simple_loss=0.332, pruned_loss=0.1037, over 4254384.63 frames. ], batch size: 131, lr: 1.59e-02, grad_scale: 16.0 2023-06-19 06:28:59,914 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=17.43 vs. limit=15.0 2023-06-19 06:29:00,934 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=296970.0, ans=0.0 2023-06-19 06:29:58,689 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.73 vs. limit=15.0 2023-06-19 06:30:09,448 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=297090.0, ans=0.1 2023-06-19 06:30:38,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=297150.0, ans=0.125 2023-06-19 06:31:16,422 INFO [train.py:996] (3/4) Epoch 2, batch 19050, loss[loss=0.2967, simple_loss=0.348, pruned_loss=0.1227, over 21485.00 frames. ], tot_loss[loss=0.2787, simple_loss=0.3389, pruned_loss=0.1093, over 4262195.78 frames. ], batch size: 211, lr: 1.59e-02, grad_scale: 16.0 2023-06-19 06:31:58,933 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.011e+02 3.316e+02 4.197e+02 5.124e+02 7.398e+02, threshold=8.394e+02, percent-clipped=4.0 2023-06-19 06:32:06,973 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.58 vs. limit=15.0 2023-06-19 06:33:17,822 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=297510.0, ans=0.125 2023-06-19 06:33:25,438 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.74 vs. limit=15.0 2023-06-19 06:33:39,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=297570.0, ans=0.2 2023-06-19 06:33:40,893 INFO [train.py:996] (3/4) Epoch 2, batch 19100, loss[loss=0.2613, simple_loss=0.3155, pruned_loss=0.1035, over 21613.00 frames. ], tot_loss[loss=0.2789, simple_loss=0.3371, pruned_loss=0.1103, over 4265992.83 frames. ], batch size: 332, lr: 1.59e-02, grad_scale: 16.0 2023-06-19 06:34:00,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=297630.0, ans=0.125 2023-06-19 06:34:01,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=297630.0, ans=0.07 2023-06-19 06:34:30,642 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.58 vs. limit=15.0 2023-06-19 06:34:49,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=297690.0, ans=0.1 2023-06-19 06:35:48,532 INFO [train.py:996] (3/4) Epoch 2, batch 19150, loss[loss=0.3463, simple_loss=0.4393, pruned_loss=0.1266, over 21233.00 frames. ], tot_loss[loss=0.2822, simple_loss=0.3415, pruned_loss=0.1115, over 4258603.74 frames. ], batch size: 549, lr: 1.59e-02, grad_scale: 16.0 2023-06-19 06:36:23,911 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.074e+02 3.259e+02 3.778e+02 5.445e+02 1.039e+03, threshold=7.556e+02, percent-clipped=5.0 2023-06-19 06:36:26,045 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=297930.0, ans=0.1 2023-06-19 06:36:27,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=297930.0, ans=0.09899494936611666 2023-06-19 06:38:15,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=298110.0, ans=0.025 2023-06-19 06:38:21,146 INFO [train.py:996] (3/4) Epoch 2, batch 19200, loss[loss=0.2935, simple_loss=0.3792, pruned_loss=0.104, over 21614.00 frames. ], tot_loss[loss=0.2908, simple_loss=0.3543, pruned_loss=0.1136, over 4261791.43 frames. ], batch size: 230, lr: 1.59e-02, grad_scale: 32.0 2023-06-19 06:38:23,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=298170.0, ans=0.95 2023-06-19 06:38:32,926 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.75 vs. limit=10.0 2023-06-19 06:39:13,950 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.94 vs. limit=15.0 2023-06-19 06:40:23,951 INFO [train.py:996] (3/4) Epoch 2, batch 19250, loss[loss=0.285, simple_loss=0.3661, pruned_loss=0.102, over 19943.00 frames. ], tot_loss[loss=0.2803, simple_loss=0.3495, pruned_loss=0.1055, over 4259180.48 frames. ], batch size: 702, lr: 1.58e-02, grad_scale: 32.0 2023-06-19 06:40:29,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=298470.0, ans=0.0 2023-06-19 06:40:45,986 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=298530.0, ans=0.125 2023-06-19 06:40:47,030 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.454e+02 2.384e+02 2.937e+02 3.389e+02 6.470e+02, threshold=5.874e+02, percent-clipped=0.0 2023-06-19 06:40:48,139 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.15 vs. limit=15.0 2023-06-19 06:41:19,573 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 06:42:02,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=298710.0, ans=0.125 2023-06-19 06:42:04,254 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=298710.0, ans=0.125 2023-06-19 06:42:30,661 INFO [train.py:996] (3/4) Epoch 2, batch 19300, loss[loss=0.2097, simple_loss=0.265, pruned_loss=0.07717, over 16337.00 frames. ], tot_loss[loss=0.2766, simple_loss=0.3449, pruned_loss=0.1041, over 4262237.61 frames. ], batch size: 60, lr: 1.58e-02, grad_scale: 32.0 2023-06-19 06:43:08,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=298830.0, ans=0.125 2023-06-19 06:43:43,587 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=298890.0, ans=0.0 2023-06-19 06:44:16,478 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=299010.0, ans=0.0 2023-06-19 06:44:44,485 INFO [train.py:996] (3/4) Epoch 2, batch 19350, loss[loss=0.2116, simple_loss=0.2871, pruned_loss=0.06807, over 21583.00 frames. ], tot_loss[loss=0.2663, simple_loss=0.3362, pruned_loss=0.0982, over 4271777.51 frames. ], batch size: 230, lr: 1.58e-02, grad_scale: 32.0 2023-06-19 06:45:28,982 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.812e+02 2.566e+02 3.106e+02 3.775e+02 8.572e+02, threshold=6.211e+02, percent-clipped=2.0 2023-06-19 06:45:38,537 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=299130.0, ans=0.125 2023-06-19 06:46:02,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=299190.0, ans=0.2 2023-06-19 06:47:03,757 INFO [train.py:996] (3/4) Epoch 2, batch 19400, loss[loss=0.2762, simple_loss=0.3308, pruned_loss=0.1108, over 21264.00 frames. ], tot_loss[loss=0.2648, simple_loss=0.3348, pruned_loss=0.09744, over 4261730.80 frames. ], batch size: 143, lr: 1.58e-02, grad_scale: 32.0 2023-06-19 06:47:29,347 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.66 vs. limit=10.0 2023-06-19 06:47:46,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=299430.0, ans=0.125 2023-06-19 06:48:00,309 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=299490.0, ans=0.5 2023-06-19 06:48:00,843 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.56 vs. limit=15.0 2023-06-19 06:48:06,074 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=299490.0, ans=0.0 2023-06-19 06:49:09,821 INFO [train.py:996] (3/4) Epoch 2, batch 19450, loss[loss=0.2831, simple_loss=0.3343, pruned_loss=0.1159, over 21974.00 frames. ], tot_loss[loss=0.2663, simple_loss=0.3328, pruned_loss=0.0999, over 4275898.54 frames. ], batch size: 113, lr: 1.58e-02, grad_scale: 32.0 2023-06-19 06:49:30,660 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=299730.0, ans=0.0 2023-06-19 06:49:34,674 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.090e+02 3.192e+02 3.820e+02 4.643e+02 7.190e+02, threshold=7.640e+02, percent-clipped=4.0 2023-06-19 06:50:08,184 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=299790.0, ans=0.1 2023-06-19 06:51:01,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=299910.0, ans=0.1 2023-06-19 06:51:06,122 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.65 vs. limit=12.0 2023-06-19 06:51:16,585 INFO [train.py:996] (3/4) Epoch 2, batch 19500, loss[loss=0.3668, simple_loss=0.4055, pruned_loss=0.1641, over 21507.00 frames. ], tot_loss[loss=0.2676, simple_loss=0.3299, pruned_loss=0.1026, over 4267404.82 frames. ], batch size: 509, lr: 1.58e-02, grad_scale: 32.0 2023-06-19 06:51:46,943 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=299970.0, ans=0.0 2023-06-19 06:52:11,527 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.25 vs. limit=15.0 2023-06-19 06:53:03,522 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=300150.0, ans=0.025 2023-06-19 06:53:03,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=300150.0, ans=0.1 2023-06-19 06:53:22,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=300150.0, ans=0.1 2023-06-19 06:53:38,793 INFO [train.py:996] (3/4) Epoch 2, batch 19550, loss[loss=0.2464, simple_loss=0.3349, pruned_loss=0.07894, over 21661.00 frames. ], tot_loss[loss=0.2643, simple_loss=0.3263, pruned_loss=0.1012, over 4264995.44 frames. ], batch size: 263, lr: 1.58e-02, grad_scale: 32.0 2023-06-19 06:54:27,317 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.977e+02 2.773e+02 3.270e+02 3.884e+02 6.073e+02, threshold=6.540e+02, percent-clipped=0.0 2023-06-19 06:54:31,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=300330.0, ans=0.1 2023-06-19 06:54:49,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=300390.0, ans=0.1 2023-06-19 06:55:56,406 INFO [train.py:996] (3/4) Epoch 2, batch 19600, loss[loss=0.2771, simple_loss=0.327, pruned_loss=0.1136, over 21570.00 frames. ], tot_loss[loss=0.2666, simple_loss=0.328, pruned_loss=0.1026, over 4269202.35 frames. ], batch size: 548, lr: 1.58e-02, grad_scale: 32.0 2023-06-19 06:55:58,217 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=300570.0, ans=0.125 2023-06-19 06:56:11,553 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.90 vs. limit=15.0 2023-06-19 06:57:02,051 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=300690.0, ans=0.0 2023-06-19 06:57:16,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=300690.0, ans=0.0 2023-06-19 06:57:47,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=300810.0, ans=0.1 2023-06-19 06:58:20,680 INFO [train.py:996] (3/4) Epoch 2, batch 19650, loss[loss=0.2731, simple_loss=0.3286, pruned_loss=0.1088, over 21613.00 frames. ], tot_loss[loss=0.275, simple_loss=0.3346, pruned_loss=0.1077, over 4275460.13 frames. ], batch size: 263, lr: 1.58e-02, grad_scale: 32.0 2023-06-19 06:58:54,420 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 2.810e+02 3.179e+02 3.741e+02 7.713e+02, threshold=6.358e+02, percent-clipped=2.0 2023-06-19 06:59:12,556 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 07:00:22,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=301110.0, ans=0.125 2023-06-19 07:00:55,589 INFO [train.py:996] (3/4) Epoch 2, batch 19700, loss[loss=0.2431, simple_loss=0.3447, pruned_loss=0.07082, over 20761.00 frames. ], tot_loss[loss=0.277, simple_loss=0.3376, pruned_loss=0.1082, over 4265431.18 frames. ], batch size: 608, lr: 1.58e-02, grad_scale: 32.0 2023-06-19 07:02:54,431 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=301410.0, ans=0.0 2023-06-19 07:03:08,467 INFO [train.py:996] (3/4) Epoch 2, batch 19750, loss[loss=0.3159, simple_loss=0.3996, pruned_loss=0.1161, over 21773.00 frames. ], tot_loss[loss=0.2858, simple_loss=0.3501, pruned_loss=0.1108, over 4268044.84 frames. ], batch size: 298, lr: 1.58e-02, grad_scale: 32.0 2023-06-19 07:03:22,217 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=301470.0, ans=0.2 2023-06-19 07:03:41,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=301530.0, ans=0.125 2023-06-19 07:03:44,312 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.063e+02 2.569e+02 3.192e+02 3.926e+02 7.719e+02, threshold=6.384e+02, percent-clipped=3.0 2023-06-19 07:04:00,877 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=301530.0, ans=0.1 2023-06-19 07:04:50,465 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=301650.0, ans=0.125 2023-06-19 07:04:50,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=301650.0, ans=0.0 2023-06-19 07:05:24,033 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.21 vs. limit=15.0 2023-06-19 07:05:27,404 INFO [train.py:996] (3/4) Epoch 2, batch 19800, loss[loss=0.2067, simple_loss=0.2522, pruned_loss=0.08057, over 21793.00 frames. ], tot_loss[loss=0.2865, simple_loss=0.3499, pruned_loss=0.1116, over 4270470.17 frames. ], batch size: 102, lr: 1.58e-02, grad_scale: 32.0 2023-06-19 07:05:29,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=301770.0, ans=0.125 2023-06-19 07:06:00,229 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=301830.0, ans=0.04949747468305833 2023-06-19 07:06:52,361 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=22.5 2023-06-19 07:07:31,119 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.37 vs. limit=15.0 2023-06-19 07:07:36,728 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.56 vs. limit=15.0 2023-06-19 07:07:55,596 INFO [train.py:996] (3/4) Epoch 2, batch 19850, loss[loss=0.2649, simple_loss=0.3583, pruned_loss=0.08578, over 20859.00 frames. ], tot_loss[loss=0.2733, simple_loss=0.3384, pruned_loss=0.1041, over 4265947.93 frames. ], batch size: 607, lr: 1.58e-02, grad_scale: 32.0 2023-06-19 07:08:29,053 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.743e+02 2.580e+02 3.192e+02 4.086e+02 8.227e+02, threshold=6.384e+02, percent-clipped=3.0 2023-06-19 07:08:50,887 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=302190.0, ans=0.125 2023-06-19 07:08:51,554 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.93 vs. limit=10.0 2023-06-19 07:10:08,949 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=302370.0, ans=0.0 2023-06-19 07:10:15,730 INFO [train.py:996] (3/4) Epoch 2, batch 19900, loss[loss=0.2382, simple_loss=0.3076, pruned_loss=0.0844, over 20774.00 frames. ], tot_loss[loss=0.2711, simple_loss=0.339, pruned_loss=0.1016, over 4258675.10 frames. ], batch size: 607, lr: 1.57e-02, grad_scale: 32.0 2023-06-19 07:10:55,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=302430.0, ans=0.125 2023-06-19 07:10:58,987 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=12.0 2023-06-19 07:11:49,812 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=302550.0, ans=0.125 2023-06-19 07:11:56,039 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.22 vs. limit=15.0 2023-06-19 07:12:13,339 INFO [train.py:996] (3/4) Epoch 2, batch 19950, loss[loss=0.2431, simple_loss=0.3199, pruned_loss=0.08313, over 21759.00 frames. ], tot_loss[loss=0.267, simple_loss=0.3321, pruned_loss=0.1009, over 4257112.39 frames. ], batch size: 316, lr: 1.57e-02, grad_scale: 32.0 2023-06-19 07:12:21,423 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=302670.0, ans=0.0 2023-06-19 07:12:35,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=302670.0, ans=0.1 2023-06-19 07:12:46,415 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.827e+02 2.592e+02 3.325e+02 4.122e+02 6.437e+02, threshold=6.651e+02, percent-clipped=1.0 2023-06-19 07:12:46,888 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=302730.0, ans=0.125 2023-06-19 07:12:46,929 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=302730.0, ans=0.125 2023-06-19 07:13:05,287 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.44 vs. limit=6.0 2023-06-19 07:13:18,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=302790.0, ans=0.125 2023-06-19 07:13:33,434 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=302790.0, ans=0.125 2023-06-19 07:14:28,332 INFO [train.py:996] (3/4) Epoch 2, batch 20000, loss[loss=0.3522, simple_loss=0.3933, pruned_loss=0.1555, over 21615.00 frames. ], tot_loss[loss=0.2681, simple_loss=0.3331, pruned_loss=0.1015, over 4248432.47 frames. ], batch size: 471, lr: 1.57e-02, grad_scale: 32.0 2023-06-19 07:14:55,116 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.04 vs. limit=10.0 2023-06-19 07:16:08,084 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=303150.0, ans=0.125 2023-06-19 07:16:11,574 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.19 vs. limit=22.5 2023-06-19 07:16:47,201 INFO [train.py:996] (3/4) Epoch 2, batch 20050, loss[loss=0.3531, simple_loss=0.381, pruned_loss=0.1626, over 21758.00 frames. ], tot_loss[loss=0.2741, simple_loss=0.3364, pruned_loss=0.1059, over 4248596.77 frames. ], batch size: 508, lr: 1.57e-02, grad_scale: 32.0 2023-06-19 07:17:00,044 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=303270.0, ans=0.125 2023-06-19 07:17:17,714 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.952e+02 3.033e+02 3.423e+02 3.985e+02 8.117e+02, threshold=6.846e+02, percent-clipped=3.0 2023-06-19 07:17:36,357 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=303330.0, ans=0.125 2023-06-19 07:18:06,304 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=303390.0, ans=0.1 2023-06-19 07:18:12,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=303450.0, ans=0.1 2023-06-19 07:18:15,807 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=36.07 vs. limit=15.0 2023-06-19 07:18:18,389 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 07:18:18,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=303450.0, ans=0.125 2023-06-19 07:18:35,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=303510.0, ans=0.0 2023-06-19 07:19:11,596 INFO [train.py:996] (3/4) Epoch 2, batch 20100, loss[loss=0.2875, simple_loss=0.3699, pruned_loss=0.1025, over 20990.00 frames. ], tot_loss[loss=0.2787, simple_loss=0.34, pruned_loss=0.1087, over 4248921.40 frames. ], batch size: 607, lr: 1.57e-02, grad_scale: 32.0 2023-06-19 07:19:13,547 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=303570.0, ans=0.2 2023-06-19 07:19:16,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=303570.0, ans=0.0 2023-06-19 07:19:49,988 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=303630.0, ans=0.0 2023-06-19 07:20:06,904 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=303630.0, ans=0.0 2023-06-19 07:20:19,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=303690.0, ans=0.125 2023-06-19 07:20:21,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=303690.0, ans=0.0 2023-06-19 07:21:50,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=303870.0, ans=0.125 2023-06-19 07:21:56,484 INFO [train.py:996] (3/4) Epoch 2, batch 20150, loss[loss=0.2687, simple_loss=0.2993, pruned_loss=0.119, over 20228.00 frames. ], tot_loss[loss=0.287, simple_loss=0.3494, pruned_loss=0.1123, over 4261605.88 frames. ], batch size: 703, lr: 1.57e-02, grad_scale: 32.0 2023-06-19 07:22:26,654 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.229e+02 2.980e+02 4.036e+02 4.841e+02 1.073e+03, threshold=8.072e+02, percent-clipped=4.0 2023-06-19 07:22:46,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=303990.0, ans=0.0 2023-06-19 07:22:59,200 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.71 vs. limit=10.0 2023-06-19 07:24:04,561 INFO [train.py:996] (3/4) Epoch 2, batch 20200, loss[loss=0.2571, simple_loss=0.317, pruned_loss=0.09857, over 21682.00 frames. ], tot_loss[loss=0.2926, simple_loss=0.3552, pruned_loss=0.115, over 4268183.86 frames. ], batch size: 247, lr: 1.57e-02, grad_scale: 16.0 2023-06-19 07:24:53,458 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.43 vs. limit=12.0 2023-06-19 07:25:09,746 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=22.5 2023-06-19 07:25:13,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=304290.0, ans=0.0 2023-06-19 07:26:31,276 INFO [train.py:996] (3/4) Epoch 2, batch 20250, loss[loss=0.3307, simple_loss=0.3802, pruned_loss=0.1406, over 21593.00 frames. ], tot_loss[loss=0.2887, simple_loss=0.354, pruned_loss=0.1117, over 4267138.95 frames. ], batch size: 471, lr: 1.57e-02, grad_scale: 16.0 2023-06-19 07:26:38,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=304470.0, ans=0.0 2023-06-19 07:26:42,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=304470.0, ans=0.125 2023-06-19 07:26:44,322 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.41 vs. limit=10.0 2023-06-19 07:26:55,245 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.982e+02 2.816e+02 3.333e+02 3.978e+02 6.194e+02, threshold=6.665e+02, percent-clipped=0.0 2023-06-19 07:26:59,350 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.22 vs. limit=10.0 2023-06-19 07:28:03,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=304650.0, ans=0.0 2023-06-19 07:28:36,320 INFO [train.py:996] (3/4) Epoch 2, batch 20300, loss[loss=0.2808, simple_loss=0.3514, pruned_loss=0.1051, over 21929.00 frames. ], tot_loss[loss=0.282, simple_loss=0.3492, pruned_loss=0.1074, over 4258890.56 frames. ], batch size: 107, lr: 1.57e-02, grad_scale: 16.0 2023-06-19 07:28:38,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=304770.0, ans=0.0 2023-06-19 07:28:52,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=304830.0, ans=0.0 2023-06-19 07:29:07,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=304830.0, ans=0.125 2023-06-19 07:29:11,971 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=304890.0, ans=0.05 2023-06-19 07:29:48,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=304950.0, ans=0.2 2023-06-19 07:30:34,158 INFO [train.py:996] (3/4) Epoch 2, batch 20350, loss[loss=0.314, simple_loss=0.386, pruned_loss=0.121, over 20743.00 frames. ], tot_loss[loss=0.2823, simple_loss=0.3484, pruned_loss=0.108, over 4256465.97 frames. ], batch size: 607, lr: 1.57e-02, grad_scale: 16.0 2023-06-19 07:31:00,705 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.949e+02 2.745e+02 3.216e+02 3.932e+02 7.808e+02, threshold=6.432e+02, percent-clipped=2.0 2023-06-19 07:31:51,397 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.82 vs. limit=12.0 2023-06-19 07:32:34,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=305310.0, ans=0.125 2023-06-19 07:32:35,531 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.02 vs. limit=15.0 2023-06-19 07:32:42,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=305310.0, ans=0.0 2023-06-19 07:32:54,758 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=12.0 2023-06-19 07:32:55,101 INFO [train.py:996] (3/4) Epoch 2, batch 20400, loss[loss=0.2314, simple_loss=0.2974, pruned_loss=0.08272, over 16796.00 frames. ], tot_loss[loss=0.287, simple_loss=0.3517, pruned_loss=0.1112, over 4257117.01 frames. ], batch size: 63, lr: 1.57e-02, grad_scale: 32.0 2023-06-19 07:32:55,585 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=305370.0, ans=0.5 2023-06-19 07:33:03,051 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=305370.0, ans=0.0 2023-06-19 07:34:21,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=305550.0, ans=0.125 2023-06-19 07:34:37,280 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=305550.0, ans=0.0 2023-06-19 07:34:48,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=305610.0, ans=0.125 2023-06-19 07:35:00,796 INFO [train.py:996] (3/4) Epoch 2, batch 20450, loss[loss=0.286, simple_loss=0.3441, pruned_loss=0.114, over 21887.00 frames. ], tot_loss[loss=0.2907, simple_loss=0.3526, pruned_loss=0.1144, over 4247953.64 frames. ], batch size: 118, lr: 1.57e-02, grad_scale: 32.0 2023-06-19 07:35:29,276 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.52 vs. limit=15.0 2023-06-19 07:35:34,026 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.893e+02 2.841e+02 3.405e+02 4.322e+02 6.691e+02, threshold=6.810e+02, percent-clipped=2.0 2023-06-19 07:35:53,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=305790.0, ans=0.035 2023-06-19 07:36:09,248 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=305790.0, ans=0.125 2023-06-19 07:36:25,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=305850.0, ans=0.1 2023-06-19 07:37:12,741 INFO [train.py:996] (3/4) Epoch 2, batch 20500, loss[loss=0.2644, simple_loss=0.3088, pruned_loss=0.11, over 21171.00 frames. ], tot_loss[loss=0.2905, simple_loss=0.35, pruned_loss=0.1155, over 4255045.83 frames. ], batch size: 159, lr: 1.57e-02, grad_scale: 32.0 2023-06-19 07:37:13,206 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=305970.0, ans=0.125 2023-06-19 07:37:24,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=305970.0, ans=0.125 2023-06-19 07:38:35,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=306150.0, ans=0.0 2023-06-19 07:38:45,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=306150.0, ans=0.125 2023-06-19 07:39:27,696 INFO [train.py:996] (3/4) Epoch 2, batch 20550, loss[loss=0.2392, simple_loss=0.2995, pruned_loss=0.0895, over 21150.00 frames. ], tot_loss[loss=0.2841, simple_loss=0.3431, pruned_loss=0.1126, over 4251551.57 frames. ], batch size: 143, lr: 1.57e-02, grad_scale: 32.0 2023-06-19 07:40:03,745 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.942e+02 2.682e+02 3.144e+02 3.592e+02 6.172e+02, threshold=6.288e+02, percent-clipped=0.0 2023-06-19 07:41:33,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=306510.0, ans=0.125 2023-06-19 07:41:38,985 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=306570.0, ans=0.0 2023-06-19 07:41:39,992 INFO [train.py:996] (3/4) Epoch 2, batch 20600, loss[loss=0.2937, simple_loss=0.3456, pruned_loss=0.1209, over 21872.00 frames. ], tot_loss[loss=0.283, simple_loss=0.3438, pruned_loss=0.1111, over 4247814.02 frames. ], batch size: 371, lr: 1.56e-02, grad_scale: 32.0 2023-06-19 07:41:53,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=306630.0, ans=0.0 2023-06-19 07:41:53,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=306630.0, ans=0.125 2023-06-19 07:42:28,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=306690.0, ans=0.0 2023-06-19 07:43:48,061 INFO [train.py:996] (3/4) Epoch 2, batch 20650, loss[loss=0.2367, simple_loss=0.3022, pruned_loss=0.08563, over 17348.00 frames. ], tot_loss[loss=0.2795, simple_loss=0.3385, pruned_loss=0.1102, over 4247897.42 frames. ], batch size: 64, lr: 1.56e-02, grad_scale: 32.0 2023-06-19 07:43:48,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=306870.0, ans=0.0 2023-06-19 07:43:50,014 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=306870.0, ans=0.0 2023-06-19 07:43:53,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_na.min_abs, batch_count=306870.0, ans=0.02 2023-06-19 07:44:18,677 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.957e+02 2.691e+02 3.207e+02 4.278e+02 6.062e+02, threshold=6.414e+02, percent-clipped=0.0 2023-06-19 07:44:57,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=306990.0, ans=0.0 2023-06-19 07:45:13,538 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=307050.0, ans=0.1 2023-06-19 07:45:53,884 INFO [train.py:996] (3/4) Epoch 2, batch 20700, loss[loss=0.1979, simple_loss=0.2621, pruned_loss=0.0668, over 21374.00 frames. ], tot_loss[loss=0.2708, simple_loss=0.3304, pruned_loss=0.1056, over 4250411.09 frames. ], batch size: 131, lr: 1.56e-02, grad_scale: 32.0 2023-06-19 07:47:16,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=307350.0, ans=0.1 2023-06-19 07:47:22,901 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=307350.0, ans=0.125 2023-06-19 07:47:58,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=307410.0, ans=0.2 2023-06-19 07:48:09,366 INFO [train.py:996] (3/4) Epoch 2, batch 20750, loss[loss=0.3526, simple_loss=0.4353, pruned_loss=0.1349, over 21697.00 frames. ], tot_loss[loss=0.271, simple_loss=0.3331, pruned_loss=0.1045, over 4258014.47 frames. ], batch size: 414, lr: 1.56e-02, grad_scale: 32.0 2023-06-19 07:48:44,916 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.914e+02 2.969e+02 3.611e+02 4.710e+02 7.755e+02, threshold=7.221e+02, percent-clipped=2.0 2023-06-19 07:48:48,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=307530.0, ans=0.0 2023-06-19 07:48:49,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=307530.0, ans=0.05 2023-06-19 07:49:02,697 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.98 vs. limit=6.0 2023-06-19 07:49:04,443 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.96 vs. limit=22.5 2023-06-19 07:49:44,447 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=16.89 vs. limit=22.5 2023-06-19 07:49:45,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=307650.0, ans=0.0 2023-06-19 07:49:53,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=307650.0, ans=0.125 2023-06-19 07:50:11,241 INFO [train.py:996] (3/4) Epoch 2, batch 20800, loss[loss=0.3174, simple_loss=0.4326, pruned_loss=0.1011, over 20765.00 frames. ], tot_loss[loss=0.2734, simple_loss=0.3371, pruned_loss=0.1048, over 4253115.76 frames. ], batch size: 607, lr: 1.56e-02, grad_scale: 32.0 2023-06-19 07:51:38,859 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=307890.0, ans=0.125 2023-06-19 07:51:58,660 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=307950.0, ans=0.125 2023-06-19 07:52:34,129 INFO [train.py:996] (3/4) Epoch 2, batch 20850, loss[loss=0.2893, simple_loss=0.342, pruned_loss=0.1183, over 22003.00 frames. ], tot_loss[loss=0.2695, simple_loss=0.3318, pruned_loss=0.1036, over 4249296.06 frames. ], batch size: 113, lr: 1.56e-02, grad_scale: 32.0 2023-06-19 07:53:03,495 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.087e+02 2.789e+02 3.231e+02 4.140e+02 1.099e+03, threshold=6.461e+02, percent-clipped=5.0 2023-06-19 07:54:37,012 INFO [train.py:996] (3/4) Epoch 2, batch 20900, loss[loss=0.2793, simple_loss=0.3379, pruned_loss=0.1104, over 21855.00 frames. ], tot_loss[loss=0.273, simple_loss=0.3349, pruned_loss=0.1056, over 4250391.15 frames. ], batch size: 351, lr: 1.56e-02, grad_scale: 32.0 2023-06-19 07:54:47,280 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=308370.0, ans=0.125 2023-06-19 07:54:51,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=308370.0, ans=0.125 2023-06-19 07:55:55,119 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2023-06-19 07:56:33,637 INFO [train.py:996] (3/4) Epoch 2, batch 20950, loss[loss=0.2578, simple_loss=0.317, pruned_loss=0.09926, over 21677.00 frames. ], tot_loss[loss=0.2653, simple_loss=0.3291, pruned_loss=0.1008, over 4245670.70 frames. ], batch size: 414, lr: 1.56e-02, grad_scale: 32.0 2023-06-19 07:56:42,167 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.31 vs. limit=15.0 2023-06-19 07:56:56,391 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.539e+02 2.542e+02 3.174e+02 4.032e+02 6.054e+02, threshold=6.348e+02, percent-clipped=0.0 2023-06-19 07:58:28,162 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=308910.0, ans=0.125 2023-06-19 07:58:38,195 INFO [train.py:996] (3/4) Epoch 2, batch 21000, loss[loss=0.2681, simple_loss=0.3293, pruned_loss=0.1034, over 21874.00 frames. ], tot_loss[loss=0.2654, simple_loss=0.3282, pruned_loss=0.1013, over 4261401.90 frames. ], batch size: 124, lr: 1.56e-02, grad_scale: 32.0 2023-06-19 07:58:38,195 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-19 07:59:32,903 INFO [train.py:1028] (3/4) Epoch 2, validation: loss=0.2892, simple_loss=0.3858, pruned_loss=0.09632, over 1796401.00 frames. 2023-06-19 07:59:32,904 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-19 07:59:42,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=308970.0, ans=0.1 2023-06-19 07:59:49,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=309030.0, ans=0.125 2023-06-19 07:59:54,582 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=309030.0, ans=0.2 2023-06-19 07:59:56,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=309030.0, ans=0.0 2023-06-19 07:59:56,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=309030.0, ans=0.025 2023-06-19 08:00:49,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=309210.0, ans=0.125 2023-06-19 08:01:11,554 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.79 vs. limit=10.0 2023-06-19 08:01:14,076 INFO [train.py:996] (3/4) Epoch 2, batch 21050, loss[loss=0.2478, simple_loss=0.3014, pruned_loss=0.09714, over 21724.00 frames. ], tot_loss[loss=0.2655, simple_loss=0.327, pruned_loss=0.102, over 4263158.76 frames. ], batch size: 316, lr: 1.56e-02, grad_scale: 32.0 2023-06-19 08:01:45,264 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.159e+02 2.810e+02 3.247e+02 3.992e+02 5.990e+02, threshold=6.494e+02, percent-clipped=0.0 2023-06-19 08:01:47,045 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=309330.0, ans=0.125 2023-06-19 08:03:08,456 INFO [train.py:996] (3/4) Epoch 2, batch 21100, loss[loss=0.2456, simple_loss=0.3083, pruned_loss=0.09149, over 21594.00 frames. ], tot_loss[loss=0.2634, simple_loss=0.3238, pruned_loss=0.1015, over 4260753.50 frames. ], batch size: 263, lr: 1.56e-02, grad_scale: 32.0 2023-06-19 08:03:56,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=309630.0, ans=0.0 2023-06-19 08:04:19,150 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=309630.0, ans=0.0 2023-06-19 08:04:30,543 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=309690.0, ans=0.2 2023-06-19 08:04:44,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=309750.0, ans=0.125 2023-06-19 08:04:57,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=309750.0, ans=0.1 2023-06-19 08:05:17,514 INFO [train.py:996] (3/4) Epoch 2, batch 21150, loss[loss=0.2639, simple_loss=0.3144, pruned_loss=0.1067, over 21851.00 frames. ], tot_loss[loss=0.2623, simple_loss=0.32, pruned_loss=0.1023, over 4264204.15 frames. ], batch size: 107, lr: 1.56e-02, grad_scale: 32.0 2023-06-19 08:05:40,947 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.033e+02 2.684e+02 3.280e+02 4.252e+02 7.142e+02, threshold=6.560e+02, percent-clipped=1.0 2023-06-19 08:05:45,629 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=309930.0, ans=0.2 2023-06-19 08:05:50,434 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.07 vs. limit=15.0 2023-06-19 08:06:14,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=309990.0, ans=0.0 2023-06-19 08:06:14,785 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=309990.0, ans=0.1 2023-06-19 08:06:14,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=309990.0, ans=0.1 2023-06-19 08:06:31,809 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.06 vs. limit=12.0 2023-06-19 08:07:12,908 INFO [train.py:996] (3/4) Epoch 2, batch 21200, loss[loss=0.2349, simple_loss=0.2846, pruned_loss=0.09262, over 21574.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.3146, pruned_loss=0.1013, over 4264274.26 frames. ], batch size: 247, lr: 1.56e-02, grad_scale: 32.0 2023-06-19 08:07:21,007 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=310170.0, ans=0.1 2023-06-19 08:07:22,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=310170.0, ans=0.025 2023-06-19 08:07:51,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=310230.0, ans=0.125 2023-06-19 08:08:18,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=310290.0, ans=0.0 2023-06-19 08:08:34,395 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=310350.0, ans=0.125 2023-06-19 08:08:38,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=310350.0, ans=0.125 2023-06-19 08:09:05,332 INFO [train.py:996] (3/4) Epoch 2, batch 21250, loss[loss=0.2614, simple_loss=0.3158, pruned_loss=0.1035, over 21167.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3128, pruned_loss=0.1012, over 4272990.20 frames. ], batch size: 176, lr: 1.55e-02, grad_scale: 32.0 2023-06-19 08:09:11,754 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=310470.0, ans=0.0 2023-06-19 08:09:28,386 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.822e+02 2.442e+02 2.877e+02 3.403e+02 5.738e+02, threshold=5.754e+02, percent-clipped=0.0 2023-06-19 08:09:50,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=310530.0, ans=0.2 2023-06-19 08:10:43,300 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.35 vs. limit=22.5 2023-06-19 08:10:53,089 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.15 vs. limit=15.0 2023-06-19 08:11:05,115 INFO [train.py:996] (3/4) Epoch 2, batch 21300, loss[loss=0.2973, simple_loss=0.3541, pruned_loss=0.1202, over 21889.00 frames. ], tot_loss[loss=0.2655, simple_loss=0.3213, pruned_loss=0.1048, over 4274948.15 frames. ], batch size: 107, lr: 1.55e-02, grad_scale: 32.0 2023-06-19 08:11:20,650 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=310770.0, ans=0.2 2023-06-19 08:12:56,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=310950.0, ans=0.125 2023-06-19 08:12:56,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=310950.0, ans=0.0 2023-06-19 08:12:57,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=310950.0, ans=0.125 2023-06-19 08:12:57,669 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=310950.0, ans=0.0 2023-06-19 08:13:31,863 INFO [train.py:996] (3/4) Epoch 2, batch 21350, loss[loss=0.269, simple_loss=0.3326, pruned_loss=0.1027, over 21654.00 frames. ], tot_loss[loss=0.2702, simple_loss=0.327, pruned_loss=0.1067, over 4289189.77 frames. ], batch size: 263, lr: 1.55e-02, grad_scale: 32.0 2023-06-19 08:14:12,354 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.121e+02 2.747e+02 3.368e+02 4.316e+02 7.083e+02, threshold=6.735e+02, percent-clipped=3.0 2023-06-19 08:14:41,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=311190.0, ans=0.0 2023-06-19 08:15:39,932 INFO [train.py:996] (3/4) Epoch 2, batch 21400, loss[loss=0.3096, simple_loss=0.3702, pruned_loss=0.1245, over 21746.00 frames. ], tot_loss[loss=0.2682, simple_loss=0.3276, pruned_loss=0.1044, over 4287058.01 frames. ], batch size: 332, lr: 1.55e-02, grad_scale: 32.0 2023-06-19 08:17:12,181 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=311490.0, ans=0.0 2023-06-19 08:17:22,165 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=311550.0, ans=0.2 2023-06-19 08:17:41,542 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.87 vs. limit=10.0 2023-06-19 08:17:59,656 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.59 vs. limit=22.5 2023-06-19 08:18:01,521 INFO [train.py:996] (3/4) Epoch 2, batch 21450, loss[loss=0.3007, simple_loss=0.3464, pruned_loss=0.1274, over 21871.00 frames. ], tot_loss[loss=0.2714, simple_loss=0.331, pruned_loss=0.1059, over 4280239.70 frames. ], batch size: 118, lr: 1.55e-02, grad_scale: 32.0 2023-06-19 08:18:15,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=311670.0, ans=0.125 2023-06-19 08:18:35,939 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.965e+02 2.772e+02 3.350e+02 3.887e+02 8.399e+02, threshold=6.699e+02, percent-clipped=2.0 2023-06-19 08:19:13,801 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=311790.0, ans=0.0 2023-06-19 08:20:14,679 INFO [train.py:996] (3/4) Epoch 2, batch 21500, loss[loss=0.3345, simple_loss=0.3474, pruned_loss=0.1608, over 21516.00 frames. ], tot_loss[loss=0.2744, simple_loss=0.3311, pruned_loss=0.1088, over 4284945.39 frames. ], batch size: 511, lr: 1.55e-02, grad_scale: 32.0 2023-06-19 08:20:43,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=311970.0, ans=0.125 2023-06-19 08:20:47,955 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-19 08:21:29,543 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.48 vs. limit=22.5 2023-06-19 08:21:33,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=312150.0, ans=0.125 2023-06-19 08:22:17,587 INFO [train.py:996] (3/4) Epoch 2, batch 21550, loss[loss=0.2314, simple_loss=0.2853, pruned_loss=0.08877, over 21647.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3238, pruned_loss=0.1042, over 4277027.72 frames. ], batch size: 264, lr: 1.55e-02, grad_scale: 32.0 2023-06-19 08:22:45,299 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.826e+02 2.454e+02 2.906e+02 3.459e+02 5.516e+02, threshold=5.812e+02, percent-clipped=0.0 2023-06-19 08:22:45,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=312330.0, ans=0.0 2023-06-19 08:22:48,779 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=312330.0, ans=0.125 2023-06-19 08:22:57,717 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=312330.0, ans=0.125 2023-06-19 08:23:14,039 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.93 vs. limit=10.0 2023-06-19 08:24:17,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=312510.0, ans=0.2 2023-06-19 08:24:33,359 INFO [train.py:996] (3/4) Epoch 2, batch 21600, loss[loss=0.2385, simple_loss=0.3081, pruned_loss=0.08442, over 21822.00 frames. ], tot_loss[loss=0.2596, simple_loss=0.3166, pruned_loss=0.1013, over 4270091.20 frames. ], batch size: 317, lr: 1.55e-02, grad_scale: 32.0 2023-06-19 08:24:46,411 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.20 vs. limit=15.0 2023-06-19 08:24:53,870 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 08:25:32,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=312690.0, ans=0.5 2023-06-19 08:26:33,276 INFO [train.py:996] (3/4) Epoch 2, batch 21650, loss[loss=0.2941, simple_loss=0.3915, pruned_loss=0.09835, over 21207.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.3216, pruned_loss=0.0989, over 4266416.73 frames. ], batch size: 548, lr: 1.55e-02, grad_scale: 32.0 2023-06-19 08:26:50,248 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.969e+02 2.980e+02 3.690e+02 4.543e+02 8.571e+02, threshold=7.379e+02, percent-clipped=9.0 2023-06-19 08:27:17,476 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=312990.0, ans=0.2 2023-06-19 08:27:18,018 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.15 vs. limit=22.5 2023-06-19 08:27:36,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=313050.0, ans=0.04949747468305833 2023-06-19 08:28:15,265 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=313110.0, ans=0.2 2023-06-19 08:28:19,140 INFO [train.py:996] (3/4) Epoch 2, batch 21700, loss[loss=0.1918, simple_loss=0.2561, pruned_loss=0.06377, over 17223.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.3211, pruned_loss=0.09625, over 4259691.89 frames. ], batch size: 68, lr: 1.55e-02, grad_scale: 32.0 2023-06-19 08:29:03,555 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.93 vs. limit=15.0 2023-06-19 08:29:17,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=313290.0, ans=0.1 2023-06-19 08:29:21,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=313290.0, ans=0.125 2023-06-19 08:29:28,736 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=313350.0, ans=0.0 2023-06-19 08:30:23,471 INFO [train.py:996] (3/4) Epoch 2, batch 21750, loss[loss=0.2856, simple_loss=0.3249, pruned_loss=0.1232, over 21503.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.3168, pruned_loss=0.09754, over 4253431.31 frames. ], batch size: 442, lr: 1.55e-02, grad_scale: 32.0 2023-06-19 08:30:24,083 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=313470.0, ans=0.0 2023-06-19 08:30:47,364 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.785e+02 2.599e+02 3.130e+02 4.451e+02 8.277e+02, threshold=6.259e+02, percent-clipped=1.0 2023-06-19 08:31:00,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=313530.0, ans=0.07 2023-06-19 08:31:10,070 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=313590.0, ans=0.2 2023-06-19 08:31:11,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=313590.0, ans=0.0 2023-06-19 08:32:05,244 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.87 vs. limit=10.0 2023-06-19 08:32:34,372 INFO [train.py:996] (3/4) Epoch 2, batch 21800, loss[loss=0.2314, simple_loss=0.2876, pruned_loss=0.08762, over 21818.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.3161, pruned_loss=0.09887, over 4263161.90 frames. ], batch size: 107, lr: 1.55e-02, grad_scale: 32.0 2023-06-19 08:33:44,654 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=313950.0, ans=0.0 2023-06-19 08:33:44,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=313950.0, ans=0.1 2023-06-19 08:34:22,212 INFO [train.py:996] (3/4) Epoch 2, batch 21850, loss[loss=0.2729, simple_loss=0.3314, pruned_loss=0.1071, over 21797.00 frames. ], tot_loss[loss=0.2616, simple_loss=0.3233, pruned_loss=0.09991, over 4256205.57 frames. ], batch size: 247, lr: 1.55e-02, grad_scale: 32.0 2023-06-19 08:34:41,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=314070.0, ans=0.1 2023-06-19 08:34:56,141 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.924e+02 2.670e+02 3.094e+02 3.698e+02 5.413e+02, threshold=6.187e+02, percent-clipped=0.0 2023-06-19 08:36:17,007 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=314310.0, ans=0.2 2023-06-19 08:36:37,151 INFO [train.py:996] (3/4) Epoch 2, batch 21900, loss[loss=0.3084, simple_loss=0.3514, pruned_loss=0.1327, over 21619.00 frames. ], tot_loss[loss=0.2631, simple_loss=0.3242, pruned_loss=0.101, over 4267708.82 frames. ], batch size: 471, lr: 1.55e-02, grad_scale: 32.0 2023-06-19 08:36:40,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=314370.0, ans=0.125 2023-06-19 08:36:40,222 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=314370.0, ans=0.1 2023-06-19 08:36:40,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=314370.0, ans=0.125 2023-06-19 08:37:11,384 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=314430.0, ans=0.2 2023-06-19 08:37:30,715 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.99 vs. limit=15.0 2023-06-19 08:38:33,243 INFO [train.py:996] (3/4) Epoch 2, batch 21950, loss[loss=0.1858, simple_loss=0.2719, pruned_loss=0.04986, over 21761.00 frames. ], tot_loss[loss=0.2595, simple_loss=0.3195, pruned_loss=0.09978, over 4266467.20 frames. ], batch size: 352, lr: 1.54e-02, grad_scale: 32.0 2023-06-19 08:39:07,654 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.054e+02 2.758e+02 3.297e+02 3.878e+02 5.596e+02, threshold=6.593e+02, percent-clipped=0.0 2023-06-19 08:39:42,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=314850.0, ans=0.125 2023-06-19 08:39:57,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=314910.0, ans=0.0 2023-06-19 08:40:20,222 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=314910.0, ans=0.125 2023-06-19 08:40:23,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=314910.0, ans=0.1 2023-06-19 08:40:39,710 INFO [train.py:996] (3/4) Epoch 2, batch 22000, loss[loss=0.2249, simple_loss=0.2801, pruned_loss=0.08489, over 21253.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.312, pruned_loss=0.09613, over 4263960.00 frames. ], batch size: 144, lr: 1.54e-02, grad_scale: 32.0 2023-06-19 08:41:26,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=315090.0, ans=0.1 2023-06-19 08:41:29,614 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=315090.0, ans=0.0 2023-06-19 08:41:54,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=315150.0, ans=0.125 2023-06-19 08:42:08,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=315150.0, ans=0.125 2023-06-19 08:42:51,923 INFO [train.py:996] (3/4) Epoch 2, batch 22050, loss[loss=0.3282, simple_loss=0.3952, pruned_loss=0.1306, over 21890.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.3189, pruned_loss=0.09917, over 4257872.74 frames. ], batch size: 372, lr: 1.54e-02, grad_scale: 32.0 2023-06-19 08:43:06,314 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.81 vs. limit=15.0 2023-06-19 08:43:21,678 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.796e+02 2.619e+02 3.176e+02 4.335e+02 6.749e+02, threshold=6.352e+02, percent-clipped=1.0 2023-06-19 08:43:32,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=315390.0, ans=0.125 2023-06-19 08:43:42,902 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=315390.0, ans=0.05 2023-06-19 08:44:28,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=315450.0, ans=10.0 2023-06-19 08:44:42,985 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=315510.0, ans=0.2 2023-06-19 08:44:49,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=315510.0, ans=0.0 2023-06-19 08:45:04,903 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.62 vs. limit=15.0 2023-06-19 08:45:06,829 INFO [train.py:996] (3/4) Epoch 2, batch 22100, loss[loss=0.2912, simple_loss=0.3482, pruned_loss=0.1171, over 21833.00 frames. ], tot_loss[loss=0.271, simple_loss=0.3312, pruned_loss=0.1054, over 4251911.81 frames. ], batch size: 332, lr: 1.54e-02, grad_scale: 32.0 2023-06-19 08:45:26,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=315630.0, ans=0.09899494936611666 2023-06-19 08:45:33,644 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=315630.0, ans=0.125 2023-06-19 08:45:39,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=315630.0, ans=0.1 2023-06-19 08:45:46,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=315690.0, ans=0.125 2023-06-19 08:46:02,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=315750.0, ans=0.125 2023-06-19 08:46:02,803 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=315750.0, ans=0.0 2023-06-19 08:46:20,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=315750.0, ans=0.125 2023-06-19 08:46:50,298 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=315810.0, ans=0.0 2023-06-19 08:47:01,478 INFO [train.py:996] (3/4) Epoch 2, batch 22150, loss[loss=0.3115, simple_loss=0.375, pruned_loss=0.124, over 19930.00 frames. ], tot_loss[loss=0.2757, simple_loss=0.3356, pruned_loss=0.1079, over 4260850.61 frames. ], batch size: 702, lr: 1.54e-02, grad_scale: 32.0 2023-06-19 08:47:23,872 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.126e+02 3.220e+02 3.673e+02 4.139e+02 7.886e+02, threshold=7.346e+02, percent-clipped=1.0 2023-06-19 08:47:44,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=315930.0, ans=0.0 2023-06-19 08:47:45,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=315990.0, ans=0.2 2023-06-19 08:48:08,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=315990.0, ans=0.125 2023-06-19 08:48:19,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=316050.0, ans=0.0 2023-06-19 08:49:07,793 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=316110.0, ans=0.125 2023-06-19 08:49:16,314 INFO [train.py:996] (3/4) Epoch 2, batch 22200, loss[loss=0.2855, simple_loss=0.3734, pruned_loss=0.09875, over 21896.00 frames. ], tot_loss[loss=0.2757, simple_loss=0.336, pruned_loss=0.1077, over 4272617.75 frames. ], batch size: 316, lr: 1.54e-02, grad_scale: 64.0 2023-06-19 08:49:17,032 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=316170.0, ans=0.125 2023-06-19 08:50:16,683 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=316290.0, ans=0.125 2023-06-19 08:51:04,281 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.37 vs. limit=15.0 2023-06-19 08:51:18,400 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=316410.0, ans=0.2 2023-06-19 08:51:24,986 INFO [train.py:996] (3/4) Epoch 2, batch 22250, loss[loss=0.307, simple_loss=0.3687, pruned_loss=0.1227, over 21505.00 frames. ], tot_loss[loss=0.2823, simple_loss=0.3445, pruned_loss=0.11, over 4278347.09 frames. ], batch size: 194, lr: 1.54e-02, grad_scale: 64.0 2023-06-19 08:51:45,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=316470.0, ans=0.2 2023-06-19 08:51:58,720 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 2.811e+02 3.425e+02 4.060e+02 7.172e+02, threshold=6.851e+02, percent-clipped=0.0 2023-06-19 08:52:41,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=316590.0, ans=0.0 2023-06-19 08:52:58,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=316650.0, ans=0.125 2023-06-19 08:53:19,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=316710.0, ans=0.0 2023-06-19 08:53:31,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=316710.0, ans=0.125 2023-06-19 08:53:34,326 INFO [train.py:996] (3/4) Epoch 2, batch 22300, loss[loss=0.2628, simple_loss=0.3139, pruned_loss=0.1059, over 21617.00 frames. ], tot_loss[loss=0.2855, simple_loss=0.3466, pruned_loss=0.1122, over 4279617.23 frames. ], batch size: 263, lr: 1.54e-02, grad_scale: 64.0 2023-06-19 08:54:15,949 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 08:55:46,283 INFO [train.py:996] (3/4) Epoch 2, batch 22350, loss[loss=0.2536, simple_loss=0.3135, pruned_loss=0.09683, over 21630.00 frames. ], tot_loss[loss=0.2857, simple_loss=0.3448, pruned_loss=0.1132, over 4282868.60 frames. ], batch size: 230, lr: 1.54e-02, grad_scale: 32.0 2023-06-19 08:56:25,196 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=317070.0, ans=0.1 2023-06-19 08:56:32,042 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.064e+02 2.798e+02 3.468e+02 4.054e+02 6.110e+02, threshold=6.936e+02, percent-clipped=0.0 2023-06-19 08:56:54,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=317190.0, ans=0.125 2023-06-19 08:58:00,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=317310.0, ans=0.125 2023-06-19 08:58:18,577 INFO [train.py:996] (3/4) Epoch 2, batch 22400, loss[loss=0.2543, simple_loss=0.3099, pruned_loss=0.09934, over 21190.00 frames. ], tot_loss[loss=0.2793, simple_loss=0.3402, pruned_loss=0.1092, over 4277554.82 frames. ], batch size: 548, lr: 1.54e-02, grad_scale: 32.0 2023-06-19 08:58:29,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=317370.0, ans=0.0 2023-06-19 08:58:44,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=317430.0, ans=0.0 2023-06-19 08:59:07,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=317490.0, ans=0.2 2023-06-19 08:59:38,414 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=317550.0, ans=0.0 2023-06-19 08:59:46,187 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.42 vs. limit=6.0 2023-06-19 08:59:48,576 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 09:00:12,675 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.57 vs. limit=15.0 2023-06-19 09:00:19,685 INFO [train.py:996] (3/4) Epoch 2, batch 22450, loss[loss=0.2508, simple_loss=0.303, pruned_loss=0.09933, over 21602.00 frames. ], tot_loss[loss=0.2742, simple_loss=0.3333, pruned_loss=0.1075, over 4282293.89 frames. ], batch size: 332, lr: 1.54e-02, grad_scale: 32.0 2023-06-19 09:00:27,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=317670.0, ans=0.0 2023-06-19 09:00:50,393 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.792e+02 2.620e+02 3.213e+02 3.522e+02 5.684e+02, threshold=6.426e+02, percent-clipped=0.0 2023-06-19 09:01:24,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=317790.0, ans=0.0 2023-06-19 09:02:25,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=317910.0, ans=0.125 2023-06-19 09:02:29,737 INFO [train.py:996] (3/4) Epoch 2, batch 22500, loss[loss=0.2278, simple_loss=0.2794, pruned_loss=0.08807, over 21465.00 frames. ], tot_loss[loss=0.2708, simple_loss=0.3286, pruned_loss=0.1065, over 4283240.28 frames. ], batch size: 195, lr: 1.54e-02, grad_scale: 32.0 2023-06-19 09:02:45,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=317970.0, ans=0.0 2023-06-19 09:04:50,917 INFO [train.py:996] (3/4) Epoch 2, batch 22550, loss[loss=0.2888, simple_loss=0.3457, pruned_loss=0.1159, over 21904.00 frames. ], tot_loss[loss=0.273, simple_loss=0.3327, pruned_loss=0.1066, over 4290842.08 frames. ], batch size: 414, lr: 1.54e-02, grad_scale: 32.0 2023-06-19 09:04:51,388 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=318270.0, ans=0.0 2023-06-19 09:05:20,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=318330.0, ans=0.125 2023-06-19 09:05:26,001 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.043e+02 2.743e+02 3.376e+02 4.420e+02 1.013e+03, threshold=6.752e+02, percent-clipped=6.0 2023-06-19 09:06:06,635 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.12 vs. limit=15.0 2023-06-19 09:06:29,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=318450.0, ans=0.05 2023-06-19 09:06:55,223 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.85 vs. limit=15.0 2023-06-19 09:07:05,486 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.44 vs. limit=22.5 2023-06-19 09:07:08,860 INFO [train.py:996] (3/4) Epoch 2, batch 22600, loss[loss=0.2333, simple_loss=0.2926, pruned_loss=0.08695, over 21599.00 frames. ], tot_loss[loss=0.2746, simple_loss=0.3348, pruned_loss=0.1072, over 4291522.93 frames. ], batch size: 230, lr: 1.54e-02, grad_scale: 32.0 2023-06-19 09:08:08,280 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=318630.0, ans=0.125 2023-06-19 09:08:24,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=318690.0, ans=0.125 2023-06-19 09:09:11,795 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.56 vs. limit=15.0 2023-06-19 09:09:22,027 INFO [train.py:996] (3/4) Epoch 2, batch 22650, loss[loss=0.2549, simple_loss=0.3133, pruned_loss=0.09825, over 21971.00 frames. ], tot_loss[loss=0.2715, simple_loss=0.3305, pruned_loss=0.1063, over 4282768.37 frames. ], batch size: 103, lr: 1.53e-02, grad_scale: 32.0 2023-06-19 09:09:57,711 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.874e+02 2.722e+02 3.113e+02 3.879e+02 6.383e+02, threshold=6.225e+02, percent-clipped=0.0 2023-06-19 09:10:15,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=318990.0, ans=0.0 2023-06-19 09:10:17,349 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=318990.0, ans=0.0 2023-06-19 09:11:07,740 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.16 vs. limit=15.0 2023-06-19 09:11:12,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=319110.0, ans=0.0 2023-06-19 09:11:29,474 INFO [train.py:996] (3/4) Epoch 2, batch 22700, loss[loss=0.2666, simple_loss=0.3101, pruned_loss=0.1116, over 20029.00 frames. ], tot_loss[loss=0.2679, simple_loss=0.3239, pruned_loss=0.106, over 4281693.07 frames. ], batch size: 703, lr: 1.53e-02, grad_scale: 32.0 2023-06-19 09:11:41,587 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=319170.0, ans=0.125 2023-06-19 09:12:37,651 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=319350.0, ans=0.125 2023-06-19 09:13:17,357 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-19 09:13:36,603 INFO [train.py:996] (3/4) Epoch 2, batch 22750, loss[loss=0.3165, simple_loss=0.3667, pruned_loss=0.1331, over 21728.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.327, pruned_loss=0.1083, over 4279356.88 frames. ], batch size: 298, lr: 1.53e-02, grad_scale: 32.0 2023-06-19 09:14:26,229 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.046e+02 2.878e+02 3.333e+02 3.990e+02 8.279e+02, threshold=6.666e+02, percent-clipped=1.0 2023-06-19 09:14:41,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=319590.0, ans=0.125 2023-06-19 09:15:59,103 INFO [train.py:996] (3/4) Epoch 2, batch 22800, loss[loss=0.2659, simple_loss=0.3146, pruned_loss=0.1086, over 21580.00 frames. ], tot_loss[loss=0.2781, simple_loss=0.3327, pruned_loss=0.1118, over 4286260.38 frames. ], batch size: 548, lr: 1.53e-02, grad_scale: 32.0 2023-06-19 09:15:59,612 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=319770.0, ans=0.125 2023-06-19 09:16:02,800 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=319770.0, ans=0.0 2023-06-19 09:18:06,540 INFO [train.py:996] (3/4) Epoch 2, batch 22850, loss[loss=0.3427, simple_loss=0.3735, pruned_loss=0.1559, over 21384.00 frames. ], tot_loss[loss=0.2758, simple_loss=0.3298, pruned_loss=0.1109, over 4289570.74 frames. ], batch size: 508, lr: 1.53e-02, grad_scale: 32.0 2023-06-19 09:18:43,487 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.133e+02 3.116e+02 3.903e+02 5.067e+02 7.447e+02, threshold=7.805e+02, percent-clipped=3.0 2023-06-19 09:19:40,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=320250.0, ans=0.125 2023-06-19 09:20:04,583 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.55 vs. limit=22.5 2023-06-19 09:20:32,906 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=320370.0, ans=0.0 2023-06-19 09:20:33,843 INFO [train.py:996] (3/4) Epoch 2, batch 22900, loss[loss=0.3157, simple_loss=0.4056, pruned_loss=0.1129, over 21613.00 frames. ], tot_loss[loss=0.2758, simple_loss=0.3317, pruned_loss=0.11, over 4272556.05 frames. ], batch size: 414, lr: 1.53e-02, grad_scale: 32.0 2023-06-19 09:20:37,266 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=320370.0, ans=0.125 2023-06-19 09:20:56,915 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 09:21:21,071 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=320430.0, ans=10.0 2023-06-19 09:21:39,839 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.17 vs. limit=15.0 2023-06-19 09:23:04,008 INFO [train.py:996] (3/4) Epoch 2, batch 22950, loss[loss=0.3171, simple_loss=0.4317, pruned_loss=0.1013, over 21592.00 frames. ], tot_loss[loss=0.2802, simple_loss=0.3453, pruned_loss=0.1076, over 4276705.84 frames. ], batch size: 414, lr: 1.53e-02, grad_scale: 32.0 2023-06-19 09:23:18,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=320670.0, ans=0.125 2023-06-19 09:23:29,081 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.110e+02 3.028e+02 3.874e+02 4.799e+02 7.839e+02, threshold=7.748e+02, percent-clipped=1.0 2023-06-19 09:23:52,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=320790.0, ans=0.1 2023-06-19 09:24:36,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=320850.0, ans=0.125 2023-06-19 09:24:42,201 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=320910.0, ans=0.05 2023-06-19 09:24:51,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=320910.0, ans=0.125 2023-06-19 09:24:59,148 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=320910.0, ans=0.0 2023-06-19 09:25:08,585 INFO [train.py:996] (3/4) Epoch 2, batch 23000, loss[loss=0.2864, simple_loss=0.335, pruned_loss=0.1189, over 21288.00 frames. ], tot_loss[loss=0.2774, simple_loss=0.3448, pruned_loss=0.105, over 4276870.20 frames. ], batch size: 143, lr: 1.53e-02, grad_scale: 32.0 2023-06-19 09:25:24,290 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=320970.0, ans=0.2 2023-06-19 09:26:16,906 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=321090.0, ans=10.0 2023-06-19 09:26:21,333 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=321090.0, ans=0.09899494936611666 2023-06-19 09:27:03,826 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=321150.0, ans=0.0 2023-06-19 09:27:36,523 INFO [train.py:996] (3/4) Epoch 2, batch 23050, loss[loss=0.2917, simple_loss=0.3488, pruned_loss=0.1173, over 21504.00 frames. ], tot_loss[loss=0.2832, simple_loss=0.3476, pruned_loss=0.1094, over 4283378.32 frames. ], batch size: 194, lr: 1.53e-02, grad_scale: 32.0 2023-06-19 09:27:55,682 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.753e+02 2.939e+02 3.561e+02 4.565e+02 8.100e+02, threshold=7.122e+02, percent-clipped=1.0 2023-06-19 09:28:40,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=321390.0, ans=0.125 2023-06-19 09:29:35,909 INFO [train.py:996] (3/4) Epoch 2, batch 23100, loss[loss=0.2551, simple_loss=0.3119, pruned_loss=0.09918, over 21595.00 frames. ], tot_loss[loss=0.28, simple_loss=0.3418, pruned_loss=0.1091, over 4286405.64 frames. ], batch size: 332, lr: 1.53e-02, grad_scale: 32.0 2023-06-19 09:32:02,682 INFO [train.py:996] (3/4) Epoch 2, batch 23150, loss[loss=0.2928, simple_loss=0.348, pruned_loss=0.1188, over 21413.00 frames. ], tot_loss[loss=0.2755, simple_loss=0.335, pruned_loss=0.108, over 4287923.31 frames. ], batch size: 143, lr: 1.53e-02, grad_scale: 32.0 2023-06-19 09:32:21,382 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.218e+02 2.749e+02 3.350e+02 4.019e+02 8.048e+02, threshold=6.700e+02, percent-clipped=1.0 2023-06-19 09:32:48,825 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.68 vs. limit=15.0 2023-06-19 09:33:01,468 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=321990.0, ans=0.2 2023-06-19 09:33:31,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=322050.0, ans=0.0 2023-06-19 09:33:34,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=322050.0, ans=0.125 2023-06-19 09:33:36,125 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.43 vs. limit=15.0 2023-06-19 09:33:37,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=322050.0, ans=0.0 2023-06-19 09:34:08,027 INFO [train.py:996] (3/4) Epoch 2, batch 23200, loss[loss=0.3193, simple_loss=0.3781, pruned_loss=0.1302, over 20089.00 frames. ], tot_loss[loss=0.2763, simple_loss=0.3353, pruned_loss=0.1087, over 4288353.56 frames. ], batch size: 703, lr: 1.53e-02, grad_scale: 32.0 2023-06-19 09:36:11,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=322410.0, ans=0.04949747468305833 2023-06-19 09:36:20,040 INFO [train.py:996] (3/4) Epoch 2, batch 23250, loss[loss=0.3172, simple_loss=0.3545, pruned_loss=0.14, over 21640.00 frames. ], tot_loss[loss=0.2774, simple_loss=0.3352, pruned_loss=0.1098, over 4295724.45 frames. ], batch size: 471, lr: 1.53e-02, grad_scale: 32.0 2023-06-19 09:37:00,561 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 2.833e+02 3.294e+02 4.025e+02 6.311e+02, threshold=6.588e+02, percent-clipped=0.0 2023-06-19 09:37:18,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=322590.0, ans=0.125 2023-06-19 09:37:24,838 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=322590.0, ans=0.125 2023-06-19 09:38:14,728 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.19 vs. limit=15.0 2023-06-19 09:38:32,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=322710.0, ans=0.125 2023-06-19 09:38:55,479 INFO [train.py:996] (3/4) Epoch 2, batch 23300, loss[loss=0.295, simple_loss=0.3869, pruned_loss=0.1016, over 21579.00 frames. ], tot_loss[loss=0.2834, simple_loss=0.3427, pruned_loss=0.1121, over 4296270.50 frames. ], batch size: 230, lr: 1.53e-02, grad_scale: 32.0 2023-06-19 09:40:08,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=322890.0, ans=0.1 2023-06-19 09:40:41,266 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.93 vs. limit=15.0 2023-06-19 09:40:54,650 INFO [train.py:996] (3/4) Epoch 2, batch 23350, loss[loss=0.2198, simple_loss=0.2897, pruned_loss=0.07491, over 21796.00 frames. ], tot_loss[loss=0.2835, simple_loss=0.3462, pruned_loss=0.1104, over 4290528.61 frames. ], batch size: 282, lr: 1.52e-02, grad_scale: 32.0 2023-06-19 09:41:25,305 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=323070.0, ans=0.015 2023-06-19 09:41:38,032 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.003e+02 2.727e+02 3.213e+02 3.768e+02 7.532e+02, threshold=6.426e+02, percent-clipped=1.0 2023-06-19 09:42:03,680 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=323190.0, ans=0.1 2023-06-19 09:42:54,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=323310.0, ans=0.125 2023-06-19 09:43:21,908 INFO [train.py:996] (3/4) Epoch 2, batch 23400, loss[loss=0.2149, simple_loss=0.2956, pruned_loss=0.06707, over 21616.00 frames. ], tot_loss[loss=0.2745, simple_loss=0.3381, pruned_loss=0.1054, over 4281622.77 frames. ], batch size: 389, lr: 1.52e-02, grad_scale: 32.0 2023-06-19 09:43:57,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=323430.0, ans=0.0 2023-06-19 09:45:02,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=323550.0, ans=0.125 2023-06-19 09:45:08,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=323550.0, ans=0.0 2023-06-19 09:45:11,346 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=323550.0, ans=0.2 2023-06-19 09:45:17,226 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=323610.0, ans=0.07 2023-06-19 09:45:19,003 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.03 vs. limit=15.0 2023-06-19 09:45:40,848 INFO [train.py:996] (3/4) Epoch 2, batch 23450, loss[loss=0.3045, simple_loss=0.3587, pruned_loss=0.1251, over 21939.00 frames. ], tot_loss[loss=0.2751, simple_loss=0.3371, pruned_loss=0.1065, over 4280255.10 frames. ], batch size: 372, lr: 1.52e-02, grad_scale: 32.0 2023-06-19 09:46:22,390 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.902e+02 2.895e+02 3.507e+02 4.209e+02 6.661e+02, threshold=7.013e+02, percent-clipped=1.0 2023-06-19 09:48:05,215 INFO [train.py:996] (3/4) Epoch 2, batch 23500, loss[loss=0.2744, simple_loss=0.3314, pruned_loss=0.1087, over 21835.00 frames. ], tot_loss[loss=0.2795, simple_loss=0.3395, pruned_loss=0.1098, over 4288835.85 frames. ], batch size: 298, lr: 1.52e-02, grad_scale: 32.0 2023-06-19 09:49:32,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=324150.0, ans=0.125 2023-06-19 09:49:40,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=324210.0, ans=0.0 2023-06-19 09:50:09,936 INFO [train.py:996] (3/4) Epoch 2, batch 23550, loss[loss=0.2599, simple_loss=0.3044, pruned_loss=0.1077, over 21218.00 frames. ], tot_loss[loss=0.2774, simple_loss=0.3358, pruned_loss=0.1095, over 4280130.32 frames. ], batch size: 159, lr: 1.52e-02, grad_scale: 32.0 2023-06-19 09:50:32,896 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.171e+02 2.771e+02 3.237e+02 3.882e+02 7.021e+02, threshold=6.473e+02, percent-clipped=1.0 2023-06-19 09:51:15,549 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=324390.0, ans=0.125 2023-06-19 09:52:12,595 INFO [train.py:996] (3/4) Epoch 2, batch 23600, loss[loss=0.3079, simple_loss=0.3632, pruned_loss=0.1263, over 21686.00 frames. ], tot_loss[loss=0.2768, simple_loss=0.3349, pruned_loss=0.1094, over 4276336.79 frames. ], batch size: 351, lr: 1.52e-02, grad_scale: 32.0 2023-06-19 09:52:51,007 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=324630.0, ans=0.0 2023-06-19 09:53:51,047 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=324750.0, ans=0.025 2023-06-19 09:54:31,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=324810.0, ans=0.0 2023-06-19 09:54:36,390 INFO [train.py:996] (3/4) Epoch 2, batch 23650, loss[loss=0.2319, simple_loss=0.3091, pruned_loss=0.0773, over 21592.00 frames. ], tot_loss[loss=0.2748, simple_loss=0.3351, pruned_loss=0.1073, over 4282047.00 frames. ], batch size: 263, lr: 1.52e-02, grad_scale: 32.0 2023-06-19 09:55:07,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=324870.0, ans=0.125 2023-06-19 09:55:19,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=324930.0, ans=0.0 2023-06-19 09:55:30,396 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.076e+02 2.894e+02 3.663e+02 5.031e+02 1.050e+03, threshold=7.326e+02, percent-clipped=9.0 2023-06-19 09:55:36,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=324930.0, ans=0.125 2023-06-19 09:57:18,212 INFO [train.py:996] (3/4) Epoch 2, batch 23700, loss[loss=0.1761, simple_loss=0.239, pruned_loss=0.05659, over 17141.00 frames. ], tot_loss[loss=0.2768, simple_loss=0.339, pruned_loss=0.1073, over 4268925.99 frames. ], batch size: 63, lr: 1.52e-02, grad_scale: 32.0 2023-06-19 09:57:53,745 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=325230.0, ans=0.05 2023-06-19 09:57:55,640 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.04 vs. limit=22.5 2023-06-19 09:58:38,088 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=325290.0, ans=0.125 2023-06-19 09:58:39,485 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=325350.0, ans=0.1 2023-06-19 09:58:45,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=325350.0, ans=0.95 2023-06-19 09:58:47,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=325350.0, ans=0.0 2023-06-19 09:58:49,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=325350.0, ans=0.0 2023-06-19 09:59:10,198 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.71 vs. limit=15.0 2023-06-19 09:59:18,537 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=325410.0, ans=0.125 2023-06-19 09:59:30,276 INFO [train.py:996] (3/4) Epoch 2, batch 23750, loss[loss=0.2161, simple_loss=0.3079, pruned_loss=0.06213, over 21417.00 frames. ], tot_loss[loss=0.2785, simple_loss=0.3406, pruned_loss=0.1082, over 4273265.84 frames. ], batch size: 194, lr: 1.52e-02, grad_scale: 32.0 2023-06-19 09:59:49,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=325470.0, ans=0.125 2023-06-19 10:00:13,667 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.908e+02 2.910e+02 3.446e+02 4.097e+02 6.175e+02, threshold=6.892e+02, percent-clipped=0.0 2023-06-19 10:00:18,679 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=325530.0, ans=0.125 2023-06-19 10:00:31,587 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=325590.0, ans=0.125 2023-06-19 10:00:39,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=325590.0, ans=0.125 2023-06-19 10:00:51,917 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=325650.0, ans=0.0 2023-06-19 10:00:58,364 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.76 vs. limit=15.0 2023-06-19 10:01:31,706 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=325710.0, ans=0.125 2023-06-19 10:01:37,213 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.60 vs. limit=6.0 2023-06-19 10:01:38,125 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=325710.0, ans=0.125 2023-06-19 10:01:52,692 INFO [train.py:996] (3/4) Epoch 2, batch 23800, loss[loss=0.274, simple_loss=0.3625, pruned_loss=0.09275, over 21674.00 frames. ], tot_loss[loss=0.2746, simple_loss=0.3388, pruned_loss=0.1052, over 4272309.39 frames. ], batch size: 247, lr: 1.52e-02, grad_scale: 32.0 2023-06-19 10:02:02,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=325770.0, ans=0.125 2023-06-19 10:02:08,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=325830.0, ans=0.2 2023-06-19 10:02:21,389 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=325830.0, ans=0.125 2023-06-19 10:02:29,986 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=325830.0, ans=0.0 2023-06-19 10:03:31,001 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=325950.0, ans=0.1 2023-06-19 10:03:58,554 INFO [train.py:996] (3/4) Epoch 2, batch 23850, loss[loss=0.3371, simple_loss=0.4064, pruned_loss=0.1339, over 21598.00 frames. ], tot_loss[loss=0.285, simple_loss=0.3506, pruned_loss=0.1097, over 4277760.01 frames. ], batch size: 414, lr: 1.52e-02, grad_scale: 32.0 2023-06-19 10:04:25,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=326070.0, ans=0.125 2023-06-19 10:04:41,599 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.209e+02 2.956e+02 3.514e+02 4.313e+02 7.177e+02, threshold=7.028e+02, percent-clipped=1.0 2023-06-19 10:04:53,629 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=326130.0, ans=0.0 2023-06-19 10:04:59,978 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.80 vs. limit=15.0 2023-06-19 10:05:28,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=326190.0, ans=0.125 2023-06-19 10:06:20,019 INFO [train.py:996] (3/4) Epoch 2, batch 23900, loss[loss=0.305, simple_loss=0.3618, pruned_loss=0.1241, over 21995.00 frames. ], tot_loss[loss=0.2905, simple_loss=0.3566, pruned_loss=0.1122, over 4270549.53 frames. ], batch size: 103, lr: 1.52e-02, grad_scale: 32.0 2023-06-19 10:06:20,494 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=326370.0, ans=0.0 2023-06-19 10:06:21,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.whiten.whitening_limit, batch_count=326370.0, ans=12.0 2023-06-19 10:07:17,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=326490.0, ans=0.2 2023-06-19 10:07:44,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=326550.0, ans=0.0 2023-06-19 10:08:19,554 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=326610.0, ans=0.04949747468305833 2023-06-19 10:08:27,712 INFO [train.py:996] (3/4) Epoch 2, batch 23950, loss[loss=0.2714, simple_loss=0.3335, pruned_loss=0.1046, over 21756.00 frames. ], tot_loss[loss=0.2863, simple_loss=0.3498, pruned_loss=0.1114, over 4266831.44 frames. ], batch size: 282, lr: 1.52e-02, grad_scale: 32.0 2023-06-19 10:08:46,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=326670.0, ans=0.125 2023-06-19 10:08:46,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=326670.0, ans=0.0 2023-06-19 10:09:19,304 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.098e+02 2.788e+02 3.140e+02 3.660e+02 5.549e+02, threshold=6.280e+02, percent-clipped=0.0 2023-06-19 10:10:30,350 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.05 vs. limit=15.0 2023-06-19 10:10:45,250 INFO [train.py:996] (3/4) Epoch 2, batch 24000, loss[loss=0.3385, simple_loss=0.3911, pruned_loss=0.1429, over 21688.00 frames. ], tot_loss[loss=0.2891, simple_loss=0.3505, pruned_loss=0.1139, over 4271879.45 frames. ], batch size: 351, lr: 1.52e-02, grad_scale: 32.0 2023-06-19 10:10:45,251 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-19 10:11:36,136 INFO [train.py:1028] (3/4) Epoch 2, validation: loss=0.2838, simple_loss=0.3817, pruned_loss=0.09297, over 1796401.00 frames. 2023-06-19 10:11:36,137 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-19 10:12:11,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=327030.0, ans=0.09899494936611666 2023-06-19 10:12:26,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=327090.0, ans=0.125 2023-06-19 10:13:39,996 INFO [train.py:996] (3/4) Epoch 2, batch 24050, loss[loss=0.2648, simple_loss=0.3421, pruned_loss=0.09376, over 21627.00 frames. ], tot_loss[loss=0.2899, simple_loss=0.3517, pruned_loss=0.1141, over 4274235.26 frames. ], batch size: 414, lr: 1.51e-02, grad_scale: 32.0 2023-06-19 10:14:14,498 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.167e+02 2.893e+02 3.313e+02 4.221e+02 6.916e+02, threshold=6.626e+02, percent-clipped=2.0 2023-06-19 10:14:45,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=327390.0, ans=0.0 2023-06-19 10:15:41,401 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=327510.0, ans=0.125 2023-06-19 10:15:58,854 INFO [train.py:996] (3/4) Epoch 2, batch 24100, loss[loss=0.3343, simple_loss=0.3882, pruned_loss=0.1402, over 21178.00 frames. ], tot_loss[loss=0.2869, simple_loss=0.351, pruned_loss=0.1114, over 4258082.54 frames. ], batch size: 143, lr: 1.51e-02, grad_scale: 32.0 2023-06-19 10:16:19,217 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=327570.0, ans=0.125 2023-06-19 10:16:45,966 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=327690.0, ans=0.2 2023-06-19 10:18:09,448 INFO [train.py:996] (3/4) Epoch 2, batch 24150, loss[loss=0.2898, simple_loss=0.3419, pruned_loss=0.1188, over 21589.00 frames. ], tot_loss[loss=0.2885, simple_loss=0.3505, pruned_loss=0.1132, over 4273694.26 frames. ], batch size: 212, lr: 1.51e-02, grad_scale: 32.0 2023-06-19 10:18:48,472 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.248e+02 2.793e+02 3.020e+02 3.796e+02 7.586e+02, threshold=6.040e+02, percent-clipped=1.0 2023-06-19 10:19:10,197 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=327990.0, ans=0.125 2023-06-19 10:19:16,308 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.53 vs. limit=22.5 2023-06-19 10:19:41,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=328050.0, ans=0.2 2023-06-19 10:20:38,078 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=328170.0, ans=0.125 2023-06-19 10:20:39,214 INFO [train.py:996] (3/4) Epoch 2, batch 24200, loss[loss=0.2545, simple_loss=0.3209, pruned_loss=0.09404, over 21184.00 frames. ], tot_loss[loss=0.2917, simple_loss=0.3532, pruned_loss=0.1151, over 4279345.60 frames. ], batch size: 176, lr: 1.51e-02, grad_scale: 16.0 2023-06-19 10:20:54,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=328170.0, ans=0.125 2023-06-19 10:21:06,981 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.79 vs. limit=15.0 2023-06-19 10:21:07,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=328230.0, ans=0.0 2023-06-19 10:22:37,706 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=328410.0, ans=0.125 2023-06-19 10:22:59,857 INFO [train.py:996] (3/4) Epoch 2, batch 24250, loss[loss=0.2276, simple_loss=0.3203, pruned_loss=0.06748, over 21747.00 frames. ], tot_loss[loss=0.2797, simple_loss=0.3475, pruned_loss=0.1059, over 4277889.28 frames. ], batch size: 298, lr: 1.51e-02, grad_scale: 16.0 2023-06-19 10:23:03,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=328470.0, ans=0.1 2023-06-19 10:23:03,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=328470.0, ans=0.125 2023-06-19 10:23:43,683 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.89 vs. limit=15.0 2023-06-19 10:23:44,250 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.718e+02 2.606e+02 2.962e+02 3.428e+02 4.906e+02, threshold=5.923e+02, percent-clipped=0.0 2023-06-19 10:23:51,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=328590.0, ans=0.125 2023-06-19 10:24:05,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=328590.0, ans=0.125 2023-06-19 10:24:05,579 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=328590.0, ans=0.0 2023-06-19 10:25:23,882 INFO [train.py:996] (3/4) Epoch 2, batch 24300, loss[loss=0.2233, simple_loss=0.2872, pruned_loss=0.0797, over 21266.00 frames. ], tot_loss[loss=0.2672, simple_loss=0.3383, pruned_loss=0.09811, over 4279934.74 frames. ], batch size: 159, lr: 1.51e-02, grad_scale: 16.0 2023-06-19 10:25:34,652 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=328770.0, ans=0.125 2023-06-19 10:26:31,326 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.42 vs. limit=6.0 2023-06-19 10:26:46,859 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=328950.0, ans=0.1 2023-06-19 10:27:36,327 INFO [train.py:996] (3/4) Epoch 2, batch 24350, loss[loss=0.3144, simple_loss=0.3639, pruned_loss=0.1325, over 21409.00 frames. ], tot_loss[loss=0.2669, simple_loss=0.3347, pruned_loss=0.09955, over 4278254.59 frames. ], batch size: 548, lr: 1.51e-02, grad_scale: 16.0 2023-06-19 10:27:46,287 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.71 vs. limit=15.0 2023-06-19 10:28:15,016 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.020e+02 2.750e+02 3.153e+02 3.822e+02 5.866e+02, threshold=6.306e+02, percent-clipped=0.0 2023-06-19 10:28:38,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=329190.0, ans=0.09899494936611666 2023-06-19 10:29:17,556 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=329250.0, ans=0.0 2023-06-19 10:29:17,557 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=329250.0, ans=0.1 2023-06-19 10:29:30,808 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.16 vs. limit=15.0 2023-06-19 10:29:42,023 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=329310.0, ans=0.125 2023-06-19 10:29:45,925 INFO [train.py:996] (3/4) Epoch 2, batch 24400, loss[loss=0.274, simple_loss=0.3485, pruned_loss=0.0998, over 21183.00 frames. ], tot_loss[loss=0.2742, simple_loss=0.3406, pruned_loss=0.104, over 4278480.88 frames. ], batch size: 143, lr: 1.51e-02, grad_scale: 32.0 2023-06-19 10:30:47,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=329490.0, ans=0.5 2023-06-19 10:31:13,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=329550.0, ans=0.125 2023-06-19 10:31:55,270 INFO [train.py:996] (3/4) Epoch 2, batch 24450, loss[loss=0.2598, simple_loss=0.3477, pruned_loss=0.08593, over 21751.00 frames. ], tot_loss[loss=0.2775, simple_loss=0.3428, pruned_loss=0.1061, over 4283909.77 frames. ], batch size: 332, lr: 1.51e-02, grad_scale: 32.0 2023-06-19 10:32:18,335 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.57 vs. limit=15.0 2023-06-19 10:32:19,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=329730.0, ans=0.125 2023-06-19 10:32:36,229 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.830e+02 2.674e+02 3.123e+02 3.615e+02 7.708e+02, threshold=6.247e+02, percent-clipped=3.0 2023-06-19 10:33:53,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=329910.0, ans=0.125 2023-06-19 10:34:09,713 INFO [train.py:996] (3/4) Epoch 2, batch 24500, loss[loss=0.268, simple_loss=0.3314, pruned_loss=0.1023, over 21866.00 frames. ], tot_loss[loss=0.2779, simple_loss=0.3437, pruned_loss=0.106, over 4285847.37 frames. ], batch size: 351, lr: 1.51e-02, grad_scale: 32.0 2023-06-19 10:34:58,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=330030.0, ans=0.1 2023-06-19 10:35:03,704 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.76 vs. limit=15.0 2023-06-19 10:35:47,527 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-06-19 10:36:19,129 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=330210.0, ans=0.125 2023-06-19 10:36:35,175 INFO [train.py:996] (3/4) Epoch 2, batch 24550, loss[loss=0.3081, simple_loss=0.371, pruned_loss=0.1226, over 21835.00 frames. ], tot_loss[loss=0.2824, simple_loss=0.3469, pruned_loss=0.109, over 4288986.32 frames. ], batch size: 282, lr: 1.51e-02, grad_scale: 32.0 2023-06-19 10:36:47,176 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=330270.0, ans=0.1 2023-06-19 10:36:51,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=330270.0, ans=0.02 2023-06-19 10:37:13,335 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.015e+02 2.779e+02 3.400e+02 4.076e+02 5.829e+02, threshold=6.799e+02, percent-clipped=0.0 2023-06-19 10:37:59,672 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.69 vs. limit=15.0 2023-06-19 10:38:04,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=330450.0, ans=0.0 2023-06-19 10:38:43,063 INFO [train.py:996] (3/4) Epoch 2, batch 24600, loss[loss=0.2552, simple_loss=0.2883, pruned_loss=0.111, over 20698.00 frames. ], tot_loss[loss=0.2802, simple_loss=0.3415, pruned_loss=0.1094, over 4275242.78 frames. ], batch size: 607, lr: 1.51e-02, grad_scale: 32.0 2023-06-19 10:39:55,110 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=17.00 vs. limit=15.0 2023-06-19 10:39:57,990 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.00 vs. limit=22.5 2023-06-19 10:40:46,683 INFO [train.py:996] (3/4) Epoch 2, batch 24650, loss[loss=0.2129, simple_loss=0.2642, pruned_loss=0.08083, over 21464.00 frames. ], tot_loss[loss=0.2727, simple_loss=0.3316, pruned_loss=0.1069, over 4267612.01 frames. ], batch size: 195, lr: 1.51e-02, grad_scale: 32.0 2023-06-19 10:40:57,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=330870.0, ans=0.5 2023-06-19 10:41:08,539 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=330870.0, ans=0.125 2023-06-19 10:41:19,179 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=330930.0, ans=0.125 2023-06-19 10:41:27,288 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.017e+02 2.833e+02 3.466e+02 4.292e+02 5.874e+02, threshold=6.932e+02, percent-clipped=0.0 2023-06-19 10:41:35,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=330990.0, ans=0.2 2023-06-19 10:41:58,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=330990.0, ans=0.0 2023-06-19 10:42:00,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=330990.0, ans=0.0 2023-06-19 10:43:00,292 INFO [train.py:996] (3/4) Epoch 2, batch 24700, loss[loss=0.2581, simple_loss=0.3238, pruned_loss=0.09622, over 21823.00 frames. ], tot_loss[loss=0.2692, simple_loss=0.3299, pruned_loss=0.1042, over 4275884.04 frames. ], batch size: 372, lr: 1.51e-02, grad_scale: 32.0 2023-06-19 10:43:56,973 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=331290.0, ans=0.0 2023-06-19 10:45:02,940 INFO [train.py:996] (3/4) Epoch 2, batch 24750, loss[loss=0.2039, simple_loss=0.2644, pruned_loss=0.07172, over 21508.00 frames. ], tot_loss[loss=0.2635, simple_loss=0.3238, pruned_loss=0.1016, over 4271760.27 frames. ], batch size: 230, lr: 1.51e-02, grad_scale: 32.0 2023-06-19 10:45:07,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=331470.0, ans=0.125 2023-06-19 10:45:07,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=331470.0, ans=0.125 2023-06-19 10:45:27,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=331530.0, ans=0.125 2023-06-19 10:45:40,273 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 2.432e+02 2.874e+02 3.533e+02 8.125e+02, threshold=5.749e+02, percent-clipped=3.0 2023-06-19 10:46:06,703 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=331590.0, ans=0.125 2023-06-19 10:46:32,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=331650.0, ans=0.125 2023-06-19 10:46:50,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=331710.0, ans=0.1 2023-06-19 10:47:10,097 INFO [train.py:996] (3/4) Epoch 2, batch 24800, loss[loss=0.255, simple_loss=0.3055, pruned_loss=0.1022, over 21436.00 frames. ], tot_loss[loss=0.2601, simple_loss=0.3194, pruned_loss=0.1004, over 4260802.90 frames. ], batch size: 548, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 10:48:41,957 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.59 vs. limit=15.0 2023-06-19 10:49:33,404 INFO [train.py:996] (3/4) Epoch 2, batch 24850, loss[loss=0.2821, simple_loss=0.3481, pruned_loss=0.1081, over 21702.00 frames. ], tot_loss[loss=0.2624, simple_loss=0.3206, pruned_loss=0.1021, over 4269536.22 frames. ], batch size: 389, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 10:49:58,729 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.90 vs. limit=22.5 2023-06-19 10:50:04,536 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.72 vs. limit=15.0 2023-06-19 10:50:04,677 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=332130.0, ans=15.0 2023-06-19 10:50:12,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=332130.0, ans=0.05 2023-06-19 10:50:13,146 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.192e+02 2.964e+02 3.527e+02 4.123e+02 6.284e+02, threshold=7.055e+02, percent-clipped=5.0 2023-06-19 10:50:43,508 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=332190.0, ans=0.1 2023-06-19 10:51:55,285 INFO [train.py:996] (3/4) Epoch 2, batch 24900, loss[loss=0.3064, simple_loss=0.3603, pruned_loss=0.1263, over 21348.00 frames. ], tot_loss[loss=0.2631, simple_loss=0.3218, pruned_loss=0.1022, over 4274166.33 frames. ], batch size: 548, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 10:53:06,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=332490.0, ans=0.125 2023-06-19 10:53:33,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=332550.0, ans=0.125 2023-06-19 10:53:35,855 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=12.0 2023-06-19 10:53:45,441 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 10:54:13,020 INFO [train.py:996] (3/4) Epoch 2, batch 24950, loss[loss=0.3098, simple_loss=0.3582, pruned_loss=0.1307, over 21796.00 frames. ], tot_loss[loss=0.2758, simple_loss=0.3332, pruned_loss=0.1092, over 4279150.72 frames. ], batch size: 282, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 10:54:25,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=332670.0, ans=0.125 2023-06-19 10:54:54,103 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.400e+02 3.083e+02 3.929e+02 5.227e+02 8.953e+02, threshold=7.858e+02, percent-clipped=7.0 2023-06-19 10:55:09,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=332790.0, ans=0.125 2023-06-19 10:55:51,262 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=332850.0, ans=0.2 2023-06-19 10:56:36,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=332970.0, ans=0.0 2023-06-19 10:56:37,136 INFO [train.py:996] (3/4) Epoch 2, batch 25000, loss[loss=0.3133, simple_loss=0.3441, pruned_loss=0.1413, over 21282.00 frames. ], tot_loss[loss=0.2827, simple_loss=0.3414, pruned_loss=0.112, over 4274183.15 frames. ], batch size: 507, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 10:56:37,596 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=332970.0, ans=0.125 2023-06-19 10:56:55,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=332970.0, ans=0.125 2023-06-19 10:56:58,109 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=15.0 2023-06-19 10:57:26,165 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.20 vs. limit=12.0 2023-06-19 10:57:37,232 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=333150.0, ans=0.04949747468305833 2023-06-19 10:58:02,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=333210.0, ans=0.1 2023-06-19 10:58:04,707 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.95 vs. limit=15.0 2023-06-19 10:58:20,700 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-06-19 10:58:31,334 INFO [train.py:996] (3/4) Epoch 2, batch 25050, loss[loss=0.2598, simple_loss=0.312, pruned_loss=0.1038, over 21711.00 frames. ], tot_loss[loss=0.2773, simple_loss=0.3339, pruned_loss=0.1104, over 4263576.99 frames. ], batch size: 124, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 10:59:09,115 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.982e+02 2.713e+02 2.983e+02 3.396e+02 5.843e+02, threshold=5.966e+02, percent-clipped=0.0 2023-06-19 10:59:23,558 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=333390.0, ans=15.0 2023-06-19 11:00:27,916 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.08 vs. limit=6.0 2023-06-19 11:00:30,859 INFO [train.py:996] (3/4) Epoch 2, batch 25100, loss[loss=0.262, simple_loss=0.3132, pruned_loss=0.1055, over 21802.00 frames. ], tot_loss[loss=0.2728, simple_loss=0.3281, pruned_loss=0.1088, over 4272816.52 frames. ], batch size: 98, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 11:00:34,074 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=333570.0, ans=0.0 2023-06-19 11:01:17,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=333630.0, ans=0.125 2023-06-19 11:01:36,185 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=333690.0, ans=0.0 2023-06-19 11:02:36,048 INFO [train.py:996] (3/4) Epoch 2, batch 25150, loss[loss=0.264, simple_loss=0.3358, pruned_loss=0.0961, over 21336.00 frames. ], tot_loss[loss=0.2714, simple_loss=0.3306, pruned_loss=0.1062, over 4272642.61 frames. ], batch size: 176, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 11:03:12,800 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.643e+02 2.528e+02 3.068e+02 3.908e+02 5.226e+02, threshold=6.135e+02, percent-clipped=0.0 2023-06-19 11:03:16,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=333930.0, ans=0.1 2023-06-19 11:03:49,585 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 11:04:07,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=334050.0, ans=0.125 2023-06-19 11:04:11,559 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 11:04:17,028 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=334110.0, ans=0.0 2023-06-19 11:04:47,133 INFO [train.py:996] (3/4) Epoch 2, batch 25200, loss[loss=0.2499, simple_loss=0.307, pruned_loss=0.09639, over 20010.00 frames. ], tot_loss[loss=0.2687, simple_loss=0.3296, pruned_loss=0.1039, over 4282066.94 frames. ], batch size: 702, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 11:05:17,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=334230.0, ans=0.125 2023-06-19 11:06:38,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=334410.0, ans=0.125 2023-06-19 11:06:47,737 INFO [train.py:996] (3/4) Epoch 2, batch 25250, loss[loss=0.209, simple_loss=0.27, pruned_loss=0.07398, over 16838.00 frames. ], tot_loss[loss=0.265, simple_loss=0.3266, pruned_loss=0.1017, over 4276421.69 frames. ], batch size: 62, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 11:07:20,669 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.853e+02 2.794e+02 3.264e+02 4.171e+02 6.058e+02, threshold=6.527e+02, percent-clipped=0.0 2023-06-19 11:07:28,804 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.70 vs. limit=15.0 2023-06-19 11:07:28,879 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-19 11:08:08,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=334650.0, ans=0.2 2023-06-19 11:08:24,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=334710.0, ans=0.0 2023-06-19 11:09:01,263 INFO [train.py:996] (3/4) Epoch 2, batch 25300, loss[loss=0.3254, simple_loss=0.3819, pruned_loss=0.1345, over 21726.00 frames. ], tot_loss[loss=0.2636, simple_loss=0.3248, pruned_loss=0.1012, over 4261745.05 frames. ], batch size: 441, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 11:09:39,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=334830.0, ans=0.1 2023-06-19 11:09:50,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=334830.0, ans=0.1 2023-06-19 11:10:20,937 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=334950.0, ans=0.1 2023-06-19 11:11:20,604 INFO [train.py:996] (3/4) Epoch 2, batch 25350, loss[loss=0.2123, simple_loss=0.2921, pruned_loss=0.06629, over 21676.00 frames. ], tot_loss[loss=0.266, simple_loss=0.3286, pruned_loss=0.1017, over 4258056.90 frames. ], batch size: 298, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 11:11:52,963 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.877e+02 2.646e+02 3.240e+02 3.926e+02 6.087e+02, threshold=6.480e+02, percent-clipped=0.0 2023-06-19 11:12:04,389 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=335130.0, ans=0.125 2023-06-19 11:12:13,449 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=335190.0, ans=0.125 2023-06-19 11:12:17,654 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=335190.0, ans=0.2 2023-06-19 11:13:03,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=335310.0, ans=0.1 2023-06-19 11:13:17,471 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.01 vs. limit=22.5 2023-06-19 11:13:19,513 INFO [train.py:996] (3/4) Epoch 2, batch 25400, loss[loss=0.2851, simple_loss=0.3358, pruned_loss=0.1172, over 21323.00 frames. ], tot_loss[loss=0.2623, simple_loss=0.3235, pruned_loss=0.1005, over 4251709.03 frames. ], batch size: 548, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 11:14:19,579 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=335490.0, ans=0.125 2023-06-19 11:14:30,093 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-19 11:14:55,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=335550.0, ans=0.1 2023-06-19 11:15:10,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=335610.0, ans=10.0 2023-06-19 11:15:12,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=335610.0, ans=0.125 2023-06-19 11:15:31,560 INFO [train.py:996] (3/4) Epoch 2, batch 25450, loss[loss=0.2444, simple_loss=0.3314, pruned_loss=0.07863, over 21539.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3255, pruned_loss=0.1033, over 4252699.70 frames. ], batch size: 195, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 11:15:49,073 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.35 vs. limit=15.0 2023-06-19 11:16:04,832 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.861e+02 2.784e+02 3.356e+02 3.897e+02 6.428e+02, threshold=6.713e+02, percent-clipped=0.0 2023-06-19 11:16:11,474 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 11:16:19,110 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.15 vs. limit=15.0 2023-06-19 11:17:52,955 INFO [train.py:996] (3/4) Epoch 2, batch 25500, loss[loss=0.2484, simple_loss=0.3359, pruned_loss=0.08043, over 21667.00 frames. ], tot_loss[loss=0.2619, simple_loss=0.3254, pruned_loss=0.09919, over 4259909.25 frames. ], batch size: 389, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 11:17:54,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=335970.0, ans=0.125 2023-06-19 11:18:44,832 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.03 vs. limit=15.0 2023-06-19 11:18:59,489 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 11:20:10,338 INFO [train.py:996] (3/4) Epoch 2, batch 25550, loss[loss=0.2977, simple_loss=0.3459, pruned_loss=0.1248, over 19989.00 frames. ], tot_loss[loss=0.2644, simple_loss=0.331, pruned_loss=0.09893, over 4248803.32 frames. ], batch size: 704, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:20:46,165 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.814e+02 2.372e+02 2.803e+02 3.273e+02 5.075e+02, threshold=5.607e+02, percent-clipped=0.0 2023-06-19 11:22:31,821 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=336570.0, ans=0.1 2023-06-19 11:22:43,092 INFO [train.py:996] (3/4) Epoch 2, batch 25600, loss[loss=0.3509, simple_loss=0.418, pruned_loss=0.1419, over 21788.00 frames. ], tot_loss[loss=0.2676, simple_loss=0.3349, pruned_loss=0.1002, over 4251170.28 frames. ], batch size: 118, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:23:01,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=336570.0, ans=0.125 2023-06-19 11:23:16,018 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=336630.0, ans=0.05 2023-06-19 11:23:50,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=336750.0, ans=0.0 2023-06-19 11:24:46,825 INFO [train.py:996] (3/4) Epoch 2, batch 25650, loss[loss=0.2602, simple_loss=0.3157, pruned_loss=0.1023, over 21446.00 frames. ], tot_loss[loss=0.2713, simple_loss=0.3361, pruned_loss=0.1033, over 4260177.31 frames. ], batch size: 389, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:25:00,232 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=336870.0, ans=0.2 2023-06-19 11:25:04,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=336870.0, ans=0.125 2023-06-19 11:25:04,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=336870.0, ans=0.125 2023-06-19 11:25:05,404 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.06 vs. limit=22.5 2023-06-19 11:25:07,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=336930.0, ans=0.125 2023-06-19 11:25:13,251 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.054e+02 2.891e+02 3.339e+02 4.051e+02 5.669e+02, threshold=6.678e+02, percent-clipped=1.0 2023-06-19 11:26:48,354 INFO [train.py:996] (3/4) Epoch 2, batch 25700, loss[loss=0.2629, simple_loss=0.3331, pruned_loss=0.09638, over 21166.00 frames. ], tot_loss[loss=0.2711, simple_loss=0.3334, pruned_loss=0.1044, over 4257142.05 frames. ], batch size: 143, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:28:33,198 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=337350.0, ans=0.2 2023-06-19 11:29:14,402 INFO [train.py:996] (3/4) Epoch 2, batch 25750, loss[loss=0.3866, simple_loss=0.4173, pruned_loss=0.178, over 21355.00 frames. ], tot_loss[loss=0.2788, simple_loss=0.3404, pruned_loss=0.1086, over 4265361.23 frames. ], batch size: 507, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:29:33,173 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.02 vs. limit=22.5 2023-06-19 11:30:07,488 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.743e+02 2.884e+02 3.379e+02 4.155e+02 6.943e+02, threshold=6.757e+02, percent-clipped=1.0 2023-06-19 11:30:34,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=337590.0, ans=0.2 2023-06-19 11:30:57,433 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.74 vs. limit=10.0 2023-06-19 11:31:29,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=337710.0, ans=0.0 2023-06-19 11:31:41,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=337710.0, ans=15.0 2023-06-19 11:31:43,382 INFO [train.py:996] (3/4) Epoch 2, batch 25800, loss[loss=0.2937, simple_loss=0.3538, pruned_loss=0.1169, over 21610.00 frames. ], tot_loss[loss=0.2929, simple_loss=0.3558, pruned_loss=0.115, over 4268162.21 frames. ], batch size: 263, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:32:28,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=337830.0, ans=0.0 2023-06-19 11:33:01,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=337890.0, ans=0.125 2023-06-19 11:33:22,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=337890.0, ans=0.1 2023-06-19 11:33:47,785 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=337950.0, ans=0.125 2023-06-19 11:34:12,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=338010.0, ans=0.07 2023-06-19 11:34:19,980 INFO [train.py:996] (3/4) Epoch 2, batch 25850, loss[loss=0.2923, simple_loss=0.3509, pruned_loss=0.1168, over 21872.00 frames. ], tot_loss[loss=0.2912, simple_loss=0.356, pruned_loss=0.1131, over 4276191.59 frames. ], batch size: 414, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:35:07,691 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.249e+02 2.822e+02 3.088e+02 3.805e+02 7.433e+02, threshold=6.176e+02, percent-clipped=1.0 2023-06-19 11:35:40,370 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=338190.0, ans=0.015 2023-06-19 11:35:40,510 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=338190.0, ans=0.2 2023-06-19 11:35:45,203 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.27 vs. limit=10.0 2023-06-19 11:36:46,146 INFO [train.py:996] (3/4) Epoch 2, batch 25900, loss[loss=0.2699, simple_loss=0.3416, pruned_loss=0.09911, over 21249.00 frames. ], tot_loss[loss=0.293, simple_loss=0.3574, pruned_loss=0.1143, over 4282797.62 frames. ], batch size: 143, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:36:57,973 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=338370.0, ans=0.015 2023-06-19 11:37:12,961 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=338430.0, ans=0.125 2023-06-19 11:37:24,580 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=338430.0, ans=0.125 2023-06-19 11:37:26,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=338430.0, ans=0.125 2023-06-19 11:37:52,620 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=338490.0, ans=0.125 2023-06-19 11:39:03,311 INFO [train.py:996] (3/4) Epoch 2, batch 25950, loss[loss=0.3094, simple_loss=0.3564, pruned_loss=0.1312, over 21318.00 frames. ], tot_loss[loss=0.2979, simple_loss=0.3623, pruned_loss=0.1167, over 4287259.75 frames. ], batch size: 159, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:39:03,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=338670.0, ans=0.015 2023-06-19 11:39:03,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=338670.0, ans=0.125 2023-06-19 11:39:16,342 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=338670.0, ans=0.125 2023-06-19 11:39:30,924 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=338730.0, ans=0.125 2023-06-19 11:39:36,489 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 2.799e+02 3.323e+02 3.883e+02 7.298e+02, threshold=6.645e+02, percent-clipped=1.0 2023-06-19 11:39:37,757 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.73 vs. limit=15.0 2023-06-19 11:39:53,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=338790.0, ans=0.125 2023-06-19 11:40:29,850 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.03 vs. limit=15.0 2023-06-19 11:41:01,846 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=338850.0, ans=0.07 2023-06-19 11:41:30,341 INFO [train.py:996] (3/4) Epoch 2, batch 26000, loss[loss=0.3489, simple_loss=0.4039, pruned_loss=0.1469, over 21740.00 frames. ], tot_loss[loss=0.2953, simple_loss=0.3613, pruned_loss=0.1146, over 4288031.30 frames. ], batch size: 441, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:41:33,710 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=338970.0, ans=0.1 2023-06-19 11:41:50,049 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.37 vs. limit=15.0 2023-06-19 11:41:58,071 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=339030.0, ans=0.125 2023-06-19 11:41:58,129 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=339030.0, ans=0.125 2023-06-19 11:42:35,580 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=339090.0, ans=0.1 2023-06-19 11:42:38,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=339090.0, ans=0.5 2023-06-19 11:42:45,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=339150.0, ans=0.015 2023-06-19 11:43:20,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=339150.0, ans=0.2 2023-06-19 11:43:43,844 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.23 vs. limit=12.0 2023-06-19 11:43:47,049 INFO [train.py:996] (3/4) Epoch 2, batch 26050, loss[loss=0.2579, simple_loss=0.3036, pruned_loss=0.1061, over 21058.00 frames. ], tot_loss[loss=0.2968, simple_loss=0.3614, pruned_loss=0.1161, over 4281175.83 frames. ], batch size: 608, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:44:17,721 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.961e+02 2.721e+02 3.311e+02 3.972e+02 6.601e+02, threshold=6.622e+02, percent-clipped=0.0 2023-06-19 11:45:15,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=339450.0, ans=0.07 2023-06-19 11:45:55,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=339510.0, ans=0.0 2023-06-19 11:46:04,718 INFO [train.py:996] (3/4) Epoch 2, batch 26100, loss[loss=0.2699, simple_loss=0.3202, pruned_loss=0.1098, over 21366.00 frames. ], tot_loss[loss=0.2933, simple_loss=0.3559, pruned_loss=0.1154, over 4283117.80 frames. ], batch size: 144, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:46:16,604 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.82 vs. limit=22.5 2023-06-19 11:46:46,166 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=339630.0, ans=0.0 2023-06-19 11:47:38,939 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 11:47:40,917 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.84 vs. limit=6.0 2023-06-19 11:47:50,970 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.05 vs. limit=15.0 2023-06-19 11:47:58,466 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.02 vs. limit=22.5 2023-06-19 11:48:33,525 INFO [train.py:996] (3/4) Epoch 2, batch 26150, loss[loss=0.3088, simple_loss=0.3637, pruned_loss=0.127, over 21750.00 frames. ], tot_loss[loss=0.2941, simple_loss=0.3555, pruned_loss=0.1163, over 4285023.06 frames. ], batch size: 441, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:49:14,429 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.236e+02 2.871e+02 3.280e+02 4.154e+02 8.858e+02, threshold=6.561e+02, percent-clipped=1.0 2023-06-19 11:49:23,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=339990.0, ans=0.125 2023-06-19 11:50:37,515 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.17 vs. limit=22.5 2023-06-19 11:50:47,963 INFO [train.py:996] (3/4) Epoch 2, batch 26200, loss[loss=0.281, simple_loss=0.3634, pruned_loss=0.09924, over 21439.00 frames. ], tot_loss[loss=0.2931, simple_loss=0.3571, pruned_loss=0.1146, over 4285884.52 frames. ], batch size: 211, lr: 1.49e-02, grad_scale: 64.0 2023-06-19 11:51:40,959 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.25 vs. limit=22.5 2023-06-19 11:53:01,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=340410.0, ans=0.0 2023-06-19 11:53:25,758 INFO [train.py:996] (3/4) Epoch 2, batch 26250, loss[loss=0.2849, simple_loss=0.3426, pruned_loss=0.1136, over 21320.00 frames. ], tot_loss[loss=0.2908, simple_loss=0.3572, pruned_loss=0.1122, over 4278205.08 frames. ], batch size: 176, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:53:32,825 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=340470.0, ans=0.05 2023-06-19 11:53:40,783 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=15.0 2023-06-19 11:54:03,520 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.932e+02 2.625e+02 2.924e+02 3.538e+02 7.396e+02, threshold=5.848e+02, percent-clipped=2.0 2023-06-19 11:54:10,505 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=340530.0, ans=0.1 2023-06-19 11:54:11,947 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=340590.0, ans=0.0 2023-06-19 11:54:46,614 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=340590.0, ans=0.125 2023-06-19 11:55:30,359 INFO [train.py:996] (3/4) Epoch 2, batch 26300, loss[loss=0.3003, simple_loss=0.3567, pruned_loss=0.1219, over 21762.00 frames. ], tot_loss[loss=0.29, simple_loss=0.354, pruned_loss=0.113, over 4283511.04 frames. ], batch size: 112, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:55:53,886 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=340770.0, ans=0.125 2023-06-19 11:56:39,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=340830.0, ans=0.125 2023-06-19 11:56:39,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=340830.0, ans=0.125 2023-06-19 11:56:43,509 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=340830.0, ans=0.0 2023-06-19 11:58:06,959 INFO [train.py:996] (3/4) Epoch 2, batch 26350, loss[loss=0.2845, simple_loss=0.3444, pruned_loss=0.1123, over 21889.00 frames. ], tot_loss[loss=0.2897, simple_loss=0.3522, pruned_loss=0.1136, over 4283654.01 frames. ], batch size: 316, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 11:58:24,055 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.05 vs. limit=15.0 2023-06-19 11:58:30,790 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=341070.0, ans=0.125 2023-06-19 11:58:57,176 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.355e+02 3.087e+02 3.544e+02 4.218e+02 8.701e+02, threshold=7.087e+02, percent-clipped=9.0 2023-06-19 11:59:19,681 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.34 vs. limit=10.0 2023-06-19 11:59:43,652 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.98 vs. limit=22.5 2023-06-19 12:00:00,726 INFO [train.py:996] (3/4) Epoch 2, batch 26400, loss[loss=0.2611, simple_loss=0.3036, pruned_loss=0.1093, over 21783.00 frames. ], tot_loss[loss=0.2859, simple_loss=0.3449, pruned_loss=0.1134, over 4284711.56 frames. ], batch size: 317, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 12:00:04,249 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=341370.0, ans=0.125 2023-06-19 12:00:05,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=341370.0, ans=0.2 2023-06-19 12:00:07,262 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=341370.0, ans=0.125 2023-06-19 12:02:27,436 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.42 vs. limit=15.0 2023-06-19 12:02:30,918 INFO [train.py:996] (3/4) Epoch 2, batch 26450, loss[loss=0.312, simple_loss=0.3945, pruned_loss=0.1148, over 21841.00 frames. ], tot_loss[loss=0.2855, simple_loss=0.3451, pruned_loss=0.113, over 4272550.96 frames. ], batch size: 317, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 12:03:21,326 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.049e+02 2.970e+02 3.486e+02 4.809e+02 7.983e+02, threshold=6.973e+02, percent-clipped=4.0 2023-06-19 12:04:32,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=341910.0, ans=0.0 2023-06-19 12:05:04,490 INFO [train.py:996] (3/4) Epoch 2, batch 26500, loss[loss=0.2499, simple_loss=0.3133, pruned_loss=0.09324, over 21622.00 frames. ], tot_loss[loss=0.2842, simple_loss=0.3466, pruned_loss=0.1109, over 4267488.43 frames. ], batch size: 263, lr: 1.48e-02, grad_scale: 16.0 2023-06-19 12:06:10,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=342090.0, ans=0.2 2023-06-19 12:07:39,687 INFO [train.py:996] (3/4) Epoch 2, batch 26550, loss[loss=0.1973, simple_loss=0.2761, pruned_loss=0.05926, over 21542.00 frames. ], tot_loss[loss=0.2783, simple_loss=0.3436, pruned_loss=0.1065, over 4255251.71 frames. ], batch size: 212, lr: 1.48e-02, grad_scale: 16.0 2023-06-19 12:08:27,568 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.125e+02 2.765e+02 3.396e+02 4.650e+02 6.569e+02, threshold=6.792e+02, percent-clipped=0.0 2023-06-19 12:08:49,478 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=342390.0, ans=0.125 2023-06-19 12:08:56,949 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=342390.0, ans=0.1 2023-06-19 12:09:49,746 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=342570.0, ans=0.0 2023-06-19 12:09:50,670 INFO [train.py:996] (3/4) Epoch 2, batch 26600, loss[loss=0.2416, simple_loss=0.3001, pruned_loss=0.09153, over 20720.00 frames. ], tot_loss[loss=0.2737, simple_loss=0.3415, pruned_loss=0.1029, over 4251420.50 frames. ], batch size: 608, lr: 1.48e-02, grad_scale: 16.0 2023-06-19 12:11:43,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=342810.0, ans=0.0 2023-06-19 12:11:53,035 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 12:12:01,204 INFO [train.py:996] (3/4) Epoch 2, batch 26650, loss[loss=0.1865, simple_loss=0.272, pruned_loss=0.05052, over 21656.00 frames. ], tot_loss[loss=0.2672, simple_loss=0.3331, pruned_loss=0.1007, over 4253755.57 frames. ], batch size: 298, lr: 1.48e-02, grad_scale: 16.0 2023-06-19 12:12:49,403 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.573e+02 2.521e+02 2.956e+02 3.686e+02 6.672e+02, threshold=5.912e+02, percent-clipped=0.0 2023-06-19 12:12:59,446 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=342990.0, ans=0.125 2023-06-19 12:14:24,018 INFO [train.py:996] (3/4) Epoch 2, batch 26700, loss[loss=0.2499, simple_loss=0.3028, pruned_loss=0.09847, over 21213.00 frames. ], tot_loss[loss=0.2616, simple_loss=0.3263, pruned_loss=0.09841, over 4253486.11 frames. ], batch size: 608, lr: 1.48e-02, grad_scale: 16.0 2023-06-19 12:15:08,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=343230.0, ans=0.125 2023-06-19 12:15:16,495 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.73 vs. limit=15.0 2023-06-19 12:15:39,399 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=12.0 2023-06-19 12:15:53,503 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.48 vs. limit=22.5 2023-06-19 12:15:55,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=343350.0, ans=0.125 2023-06-19 12:16:25,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=343410.0, ans=0.2 2023-06-19 12:16:33,927 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=343410.0, ans=0.0 2023-06-19 12:16:35,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=343410.0, ans=0.125 2023-06-19 12:16:37,959 INFO [train.py:996] (3/4) Epoch 2, batch 26750, loss[loss=0.3213, simple_loss=0.3822, pruned_loss=0.1302, over 21338.00 frames. ], tot_loss[loss=0.2602, simple_loss=0.3263, pruned_loss=0.09706, over 4262546.12 frames. ], batch size: 143, lr: 1.48e-02, grad_scale: 16.0 2023-06-19 12:16:38,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=343470.0, ans=0.125 2023-06-19 12:17:12,913 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=343530.0, ans=0.04949747468305833 2023-06-19 12:17:33,071 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.673e+02 2.465e+02 2.855e+02 3.693e+02 7.404e+02, threshold=5.711e+02, percent-clipped=5.0 2023-06-19 12:17:58,102 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.14 vs. limit=15.0 2023-06-19 12:17:59,088 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 12:18:47,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=343710.0, ans=0.125 2023-06-19 12:19:17,194 INFO [train.py:996] (3/4) Epoch 2, batch 26800, loss[loss=0.3087, simple_loss=0.3617, pruned_loss=0.1278, over 21490.00 frames. ], tot_loss[loss=0.2701, simple_loss=0.3355, pruned_loss=0.1023, over 4266701.52 frames. ], batch size: 194, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 12:19:43,841 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.59 vs. limit=15.0 2023-06-19 12:19:46,041 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=343830.0, ans=0.125 2023-06-19 12:19:48,713 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=343830.0, ans=0.0 2023-06-19 12:20:17,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=343890.0, ans=0.0 2023-06-19 12:20:18,712 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=343890.0, ans=0.07 2023-06-19 12:20:42,058 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=343950.0, ans=0.125 2023-06-19 12:21:25,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=344070.0, ans=0.035 2023-06-19 12:21:26,712 INFO [train.py:996] (3/4) Epoch 2, batch 26850, loss[loss=0.2544, simple_loss=0.3084, pruned_loss=0.1002, over 15035.00 frames. ], tot_loss[loss=0.2752, simple_loss=0.338, pruned_loss=0.1062, over 4261360.54 frames. ], batch size: 60, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 12:21:38,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=344070.0, ans=0.125 2023-06-19 12:22:01,094 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.242e+02 2.924e+02 3.551e+02 4.179e+02 8.286e+02, threshold=7.103e+02, percent-clipped=6.0 2023-06-19 12:22:32,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=344190.0, ans=0.0 2023-06-19 12:23:00,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=344310.0, ans=0.0 2023-06-19 12:23:26,405 INFO [train.py:996] (3/4) Epoch 2, batch 26900, loss[loss=0.2543, simple_loss=0.298, pruned_loss=0.1053, over 21657.00 frames. ], tot_loss[loss=0.2693, simple_loss=0.329, pruned_loss=0.1048, over 4265615.45 frames. ], batch size: 282, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 12:23:27,493 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.50 vs. limit=10.0 2023-06-19 12:23:39,243 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=344370.0, ans=0.125 2023-06-19 12:23:49,586 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=344370.0, ans=0.2 2023-06-19 12:25:37,867 INFO [train.py:996] (3/4) Epoch 2, batch 26950, loss[loss=0.2703, simple_loss=0.3423, pruned_loss=0.09918, over 21507.00 frames. ], tot_loss[loss=0.268, simple_loss=0.3277, pruned_loss=0.1041, over 4262914.12 frames. ], batch size: 212, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 12:25:40,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=344670.0, ans=0.125 2023-06-19 12:26:25,656 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.27 vs. limit=15.0 2023-06-19 12:26:25,930 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.913e+02 2.623e+02 3.082e+02 3.619e+02 5.950e+02, threshold=6.165e+02, percent-clipped=0.0 2023-06-19 12:26:57,509 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.95 vs. limit=22.5 2023-06-19 12:27:35,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=344910.0, ans=0.0 2023-06-19 12:27:59,445 INFO [train.py:996] (3/4) Epoch 2, batch 27000, loss[loss=0.2408, simple_loss=0.3236, pruned_loss=0.079, over 21705.00 frames. ], tot_loss[loss=0.2662, simple_loss=0.329, pruned_loss=0.1017, over 4272109.39 frames. ], batch size: 298, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 12:27:59,446 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-19 12:28:58,705 INFO [train.py:1028] (3/4) Epoch 2, validation: loss=0.2596, simple_loss=0.3558, pruned_loss=0.08164, over 1796401.00 frames. 2023-06-19 12:28:58,717 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-19 12:29:05,080 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=344970.0, ans=0.015 2023-06-19 12:29:34,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=345030.0, ans=0.125 2023-06-19 12:30:48,514 INFO [train.py:996] (3/4) Epoch 2, batch 27050, loss[loss=0.2134, simple_loss=0.3236, pruned_loss=0.05164, over 19791.00 frames. ], tot_loss[loss=0.2635, simple_loss=0.331, pruned_loss=0.09795, over 4273383.54 frames. ], batch size: 702, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 12:31:00,633 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=22.5 2023-06-19 12:31:23,394 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.864e+02 2.340e+02 2.656e+02 3.215e+02 6.128e+02, threshold=5.313e+02, percent-clipped=0.0 2023-06-19 12:31:57,460 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.96 vs. limit=10.0 2023-06-19 12:32:05,682 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-19 12:32:08,848 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.35 vs. limit=8.0 2023-06-19 12:33:09,922 INFO [train.py:996] (3/4) Epoch 2, batch 27100, loss[loss=0.2814, simple_loss=0.3516, pruned_loss=0.1056, over 21462.00 frames. ], tot_loss[loss=0.2682, simple_loss=0.3348, pruned_loss=0.1008, over 4281030.77 frames. ], batch size: 548, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 12:33:18,810 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=345570.0, ans=0.125 2023-06-19 12:33:48,510 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=345690.0, ans=0.125 2023-06-19 12:35:13,377 INFO [train.py:996] (3/4) Epoch 2, batch 27150, loss[loss=0.3016, simple_loss=0.3842, pruned_loss=0.1095, over 21625.00 frames. ], tot_loss[loss=0.2774, simple_loss=0.3459, pruned_loss=0.1044, over 4281483.78 frames. ], batch size: 263, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 12:35:32,924 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=345930.0, ans=0.0 2023-06-19 12:35:35,711 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=345930.0, ans=0.125 2023-06-19 12:35:36,686 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.016e+02 2.859e+02 3.282e+02 3.762e+02 6.108e+02, threshold=6.564e+02, percent-clipped=5.0 2023-06-19 12:35:40,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=345930.0, ans=0.125 2023-06-19 12:35:45,824 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.18 vs. limit=15.0 2023-06-19 12:36:14,384 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=345990.0, ans=0.0 2023-06-19 12:36:23,444 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=346050.0, ans=0.2 2023-06-19 12:36:37,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=346050.0, ans=10.0 2023-06-19 12:37:04,973 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=346110.0, ans=0.0 2023-06-19 12:37:10,439 INFO [train.py:996] (3/4) Epoch 2, batch 27200, loss[loss=0.2911, simple_loss=0.3664, pruned_loss=0.1078, over 21762.00 frames. ], tot_loss[loss=0.2818, simple_loss=0.3518, pruned_loss=0.1059, over 4282425.29 frames. ], batch size: 332, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 12:37:16,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=346170.0, ans=0.125 2023-06-19 12:38:11,702 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.24 vs. limit=15.0 2023-06-19 12:38:24,516 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.53 vs. limit=15.0 2023-06-19 12:39:12,436 INFO [train.py:996] (3/4) Epoch 2, batch 27250, loss[loss=0.3371, simple_loss=0.3901, pruned_loss=0.1421, over 21576.00 frames. ], tot_loss[loss=0.2908, simple_loss=0.3568, pruned_loss=0.1124, over 4284992.00 frames. ], batch size: 389, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 12:39:31,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=346470.0, ans=0.2 2023-06-19 12:39:32,328 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.83 vs. limit=22.5 2023-06-19 12:39:37,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=346470.0, ans=0.2 2023-06-19 12:39:51,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=346530.0, ans=0.0 2023-06-19 12:40:02,684 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.091e+02 2.829e+02 3.133e+02 3.689e+02 6.969e+02, threshold=6.265e+02, percent-clipped=1.0 2023-06-19 12:41:38,424 INFO [train.py:996] (3/4) Epoch 2, batch 27300, loss[loss=0.3032, simple_loss=0.3837, pruned_loss=0.1114, over 21588.00 frames. ], tot_loss[loss=0.2936, simple_loss=0.3594, pruned_loss=0.1139, over 4285552.49 frames. ], batch size: 414, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 12:41:38,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=346770.0, ans=0.0 2023-06-19 12:41:39,470 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.87 vs. limit=15.0 2023-06-19 12:42:22,138 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=346830.0, ans=0.0 2023-06-19 12:42:53,503 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=346890.0, ans=0.125 2023-06-19 12:43:00,614 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=346950.0, ans=0.125 2023-06-19 12:43:59,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=347010.0, ans=0.125 2023-06-19 12:44:02,148 INFO [train.py:996] (3/4) Epoch 2, batch 27350, loss[loss=0.28, simple_loss=0.3447, pruned_loss=0.1077, over 21254.00 frames. ], tot_loss[loss=0.2961, simple_loss=0.3615, pruned_loss=0.1153, over 4281192.71 frames. ], batch size: 143, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 12:44:53,032 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.034e+02 2.548e+02 2.973e+02 3.425e+02 6.706e+02, threshold=5.945e+02, percent-clipped=1.0 2023-06-19 12:45:10,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=347190.0, ans=0.125 2023-06-19 12:45:11,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=347190.0, ans=0.125 2023-06-19 12:45:36,514 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=347310.0, ans=0.0 2023-06-19 12:46:09,206 INFO [train.py:996] (3/4) Epoch 2, batch 27400, loss[loss=0.2314, simple_loss=0.283, pruned_loss=0.08989, over 21598.00 frames. ], tot_loss[loss=0.2929, simple_loss=0.357, pruned_loss=0.1144, over 4288874.17 frames. ], batch size: 231, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 12:46:41,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=347430.0, ans=0.0 2023-06-19 12:48:07,711 INFO [train.py:996] (3/4) Epoch 2, batch 27450, loss[loss=0.3149, simple_loss=0.3684, pruned_loss=0.1307, over 21293.00 frames. ], tot_loss[loss=0.2863, simple_loss=0.3496, pruned_loss=0.1115, over 4276868.38 frames. ], batch size: 143, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 12:49:01,173 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 3.006e+02 3.444e+02 4.102e+02 6.886e+02, threshold=6.888e+02, percent-clipped=2.0 2023-06-19 12:49:02,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=347730.0, ans=0.125 2023-06-19 12:49:05,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=347790.0, ans=0.2 2023-06-19 12:49:10,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=347790.0, ans=0.1 2023-06-19 12:50:24,234 INFO [train.py:996] (3/4) Epoch 2, batch 27500, loss[loss=0.2826, simple_loss=0.3406, pruned_loss=0.1123, over 21337.00 frames. ], tot_loss[loss=0.2856, simple_loss=0.3479, pruned_loss=0.1116, over 4281314.74 frames. ], batch size: 143, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 12:50:26,151 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=347970.0, ans=0.125 2023-06-19 12:50:44,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=348030.0, ans=0.95 2023-06-19 12:51:33,117 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.72 vs. limit=15.0 2023-06-19 12:52:05,140 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.20 vs. limit=22.5 2023-06-19 12:52:29,467 INFO [train.py:996] (3/4) Epoch 2, batch 27550, loss[loss=0.2311, simple_loss=0.2964, pruned_loss=0.08296, over 21662.00 frames. ], tot_loss[loss=0.2778, simple_loss=0.3409, pruned_loss=0.1073, over 4284002.82 frames. ], batch size: 247, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 12:53:08,722 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.914e+02 2.657e+02 3.197e+02 3.865e+02 5.896e+02, threshold=6.395e+02, percent-clipped=0.0 2023-06-19 12:54:01,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=348510.0, ans=0.125 2023-06-19 12:54:13,915 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.19 vs. limit=15.0 2023-06-19 12:54:36,377 INFO [train.py:996] (3/4) Epoch 2, batch 27600, loss[loss=0.2781, simple_loss=0.3212, pruned_loss=0.1175, over 21592.00 frames. ], tot_loss[loss=0.2731, simple_loss=0.3346, pruned_loss=0.1058, over 4279555.47 frames. ], batch size: 415, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 12:55:37,162 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=348750.0, ans=0.2 2023-06-19 12:55:40,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=348750.0, ans=0.0 2023-06-19 12:56:19,788 INFO [train.py:996] (3/4) Epoch 2, batch 27650, loss[loss=0.2657, simple_loss=0.3335, pruned_loss=0.09895, over 21436.00 frames. ], tot_loss[loss=0.2697, simple_loss=0.3291, pruned_loss=0.1052, over 4271538.51 frames. ], batch size: 131, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 12:56:32,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=348870.0, ans=10.0 2023-06-19 12:56:46,417 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=348930.0, ans=0.125 2023-06-19 12:56:49,787 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.14 vs. limit=15.0 2023-06-19 12:56:50,856 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=348930.0, ans=0.125 2023-06-19 12:57:01,757 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.096e+02 2.710e+02 3.133e+02 3.895e+02 7.823e+02, threshold=6.265e+02, percent-clipped=1.0 2023-06-19 12:57:18,418 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.26 vs. limit=15.0 2023-06-19 12:57:20,674 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=348990.0, ans=0.125 2023-06-19 12:58:03,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=349110.0, ans=0.2 2023-06-19 12:58:15,006 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=349110.0, ans=0.0 2023-06-19 12:58:17,476 INFO [train.py:996] (3/4) Epoch 2, batch 27700, loss[loss=0.2725, simple_loss=0.347, pruned_loss=0.09901, over 21778.00 frames. ], tot_loss[loss=0.2686, simple_loss=0.3297, pruned_loss=0.1038, over 4271313.14 frames. ], batch size: 332, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 12:59:05,877 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=349230.0, ans=0.5 2023-06-19 12:59:33,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=349350.0, ans=0.0 2023-06-19 12:59:33,285 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=349350.0, ans=0.1 2023-06-19 13:00:16,930 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=349410.0, ans=0.125 2023-06-19 13:00:30,255 INFO [train.py:996] (3/4) Epoch 2, batch 27750, loss[loss=0.2353, simple_loss=0.3029, pruned_loss=0.08383, over 21229.00 frames. ], tot_loss[loss=0.2698, simple_loss=0.3326, pruned_loss=0.1035, over 4279718.00 frames. ], batch size: 176, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 13:01:06,748 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.48 vs. limit=15.0 2023-06-19 13:01:12,829 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.913e+02 2.697e+02 3.256e+02 3.829e+02 6.521e+02, threshold=6.511e+02, percent-clipped=1.0 2023-06-19 13:01:27,444 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=349590.0, ans=0.125 2023-06-19 13:01:29,007 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=349590.0, ans=0.125 2023-06-19 13:01:33,302 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=349590.0, ans=0.125 2023-06-19 13:02:40,128 INFO [train.py:996] (3/4) Epoch 2, batch 27800, loss[loss=0.3031, simple_loss=0.3503, pruned_loss=0.128, over 21770.00 frames. ], tot_loss[loss=0.2689, simple_loss=0.3312, pruned_loss=0.1033, over 4277521.24 frames. ], batch size: 441, lr: 1.47e-02, grad_scale: 16.0 2023-06-19 13:03:08,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=349770.0, ans=0.125 2023-06-19 13:03:09,937 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=349770.0, ans=0.125 2023-06-19 13:03:24,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=349830.0, ans=0.125 2023-06-19 13:04:44,754 INFO [train.py:996] (3/4) Epoch 2, batch 27850, loss[loss=0.2648, simple_loss=0.3318, pruned_loss=0.09893, over 21860.00 frames. ], tot_loss[loss=0.2717, simple_loss=0.332, pruned_loss=0.1057, over 4284375.91 frames. ], batch size: 371, lr: 1.47e-02, grad_scale: 16.0 2023-06-19 13:04:56,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=350070.0, ans=0.0 2023-06-19 13:05:09,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=350070.0, ans=0.0 2023-06-19 13:05:41,174 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.118e+02 2.837e+02 3.296e+02 3.956e+02 7.636e+02, threshold=6.591e+02, percent-clipped=1.0 2023-06-19 13:05:47,410 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=350190.0, ans=0.125 2023-06-19 13:06:22,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=350250.0, ans=0.1 2023-06-19 13:07:15,638 INFO [train.py:996] (3/4) Epoch 2, batch 27900, loss[loss=0.2608, simple_loss=0.3462, pruned_loss=0.08775, over 21569.00 frames. ], tot_loss[loss=0.2767, simple_loss=0.3406, pruned_loss=0.1064, over 4288436.59 frames. ], batch size: 230, lr: 1.47e-02, grad_scale: 16.0 2023-06-19 13:07:59,927 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=350490.0, ans=0.0 2023-06-19 13:08:28,387 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=350550.0, ans=0.125 2023-06-19 13:08:29,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=350550.0, ans=0.125 2023-06-19 13:09:19,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=350670.0, ans=0.0 2023-06-19 13:09:20,637 INFO [train.py:996] (3/4) Epoch 2, batch 27950, loss[loss=0.2317, simple_loss=0.3228, pruned_loss=0.07033, over 21732.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.3406, pruned_loss=0.1015, over 4284234.25 frames. ], batch size: 351, lr: 1.46e-02, grad_scale: 16.0 2023-06-19 13:09:57,859 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.724e+02 2.612e+02 3.167e+02 4.012e+02 7.863e+02, threshold=6.333e+02, percent-clipped=3.0 2023-06-19 13:10:59,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=350850.0, ans=0.0 2023-06-19 13:11:20,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=350910.0, ans=0.2 2023-06-19 13:11:28,390 INFO [train.py:996] (3/4) Epoch 2, batch 28000, loss[loss=0.2391, simple_loss=0.295, pruned_loss=0.09159, over 21692.00 frames. ], tot_loss[loss=0.2679, simple_loss=0.3372, pruned_loss=0.09929, over 4287258.76 frames. ], batch size: 263, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:11:54,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=351030.0, ans=0.125 2023-06-19 13:11:54,943 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=351030.0, ans=0.125 2023-06-19 13:12:28,411 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 13:12:31,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=351090.0, ans=0.125 2023-06-19 13:12:33,064 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=351090.0, ans=0.1 2023-06-19 13:13:14,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=351210.0, ans=0.035 2023-06-19 13:13:26,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=351210.0, ans=0.125 2023-06-19 13:13:30,501 INFO [train.py:996] (3/4) Epoch 2, batch 28050, loss[loss=0.2727, simple_loss=0.3454, pruned_loss=0.1, over 21703.00 frames. ], tot_loss[loss=0.269, simple_loss=0.3357, pruned_loss=0.1012, over 4289185.24 frames. ], batch size: 389, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:13:36,534 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=351270.0, ans=0.0 2023-06-19 13:14:33,395 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.223e+02 3.018e+02 3.592e+02 4.421e+02 7.489e+02, threshold=7.184e+02, percent-clipped=8.0 2023-06-19 13:14:35,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=351330.0, ans=0.0 2023-06-19 13:15:39,819 INFO [train.py:996] (3/4) Epoch 2, batch 28100, loss[loss=0.2477, simple_loss=0.313, pruned_loss=0.09118, over 21830.00 frames. ], tot_loss[loss=0.2667, simple_loss=0.3321, pruned_loss=0.1007, over 4282797.12 frames. ], batch size: 107, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:15:52,256 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.28 vs. limit=15.0 2023-06-19 13:16:16,565 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=351630.0, ans=0.125 2023-06-19 13:16:45,719 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.43 vs. limit=6.0 2023-06-19 13:17:19,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=351810.0, ans=0.5 2023-06-19 13:17:35,510 INFO [train.py:996] (3/4) Epoch 2, batch 28150, loss[loss=0.2387, simple_loss=0.2906, pruned_loss=0.09334, over 21618.00 frames. ], tot_loss[loss=0.2635, simple_loss=0.325, pruned_loss=0.101, over 4287709.37 frames. ], batch size: 282, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:17:41,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=351870.0, ans=0.125 2023-06-19 13:18:30,792 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.042e+02 2.813e+02 3.244e+02 3.974e+02 6.604e+02, threshold=6.487e+02, percent-clipped=0.0 2023-06-19 13:18:56,082 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=351990.0, ans=0.2 2023-06-19 13:19:16,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=352050.0, ans=0.0 2023-06-19 13:19:17,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=352050.0, ans=0.1 2023-06-19 13:19:37,300 INFO [train.py:996] (3/4) Epoch 2, batch 28200, loss[loss=0.293, simple_loss=0.3408, pruned_loss=0.1226, over 21254.00 frames. ], tot_loss[loss=0.2648, simple_loss=0.3227, pruned_loss=0.1034, over 4288892.52 frames. ], batch size: 143, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:19:40,711 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=352170.0, ans=0.125 2023-06-19 13:20:43,563 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.21 vs. limit=12.0 2023-06-19 13:21:43,612 INFO [train.py:996] (3/4) Epoch 2, batch 28250, loss[loss=0.2685, simple_loss=0.3218, pruned_loss=0.1076, over 21791.00 frames. ], tot_loss[loss=0.2695, simple_loss=0.3263, pruned_loss=0.1064, over 4288423.57 frames. ], batch size: 352, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:22:30,580 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.044e+02 3.279e+02 3.953e+02 4.755e+02 7.153e+02, threshold=7.906e+02, percent-clipped=2.0 2023-06-19 13:24:01,142 INFO [train.py:996] (3/4) Epoch 2, batch 28300, loss[loss=0.206, simple_loss=0.2831, pruned_loss=0.06446, over 21376.00 frames. ], tot_loss[loss=0.2659, simple_loss=0.325, pruned_loss=0.1035, over 4277675.42 frames. ], batch size: 211, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:24:26,476 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=352830.0, ans=0.2 2023-06-19 13:24:28,220 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=352830.0, ans=0.0 2023-06-19 13:25:50,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=353010.0, ans=0.1 2023-06-19 13:26:11,228 INFO [train.py:996] (3/4) Epoch 2, batch 28350, loss[loss=0.2373, simple_loss=0.2957, pruned_loss=0.08945, over 21321.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.323, pruned_loss=0.09715, over 4270207.71 frames. ], batch size: 211, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:27:06,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=353130.0, ans=0.125 2023-06-19 13:27:10,661 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.490e+02 2.403e+02 3.006e+02 3.946e+02 7.827e+02, threshold=6.012e+02, percent-clipped=0.0 2023-06-19 13:27:32,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=353250.0, ans=0.125 2023-06-19 13:27:36,732 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-06-19 13:27:40,435 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.38 vs. limit=6.0 2023-06-19 13:28:34,679 INFO [train.py:996] (3/4) Epoch 2, batch 28400, loss[loss=0.2536, simple_loss=0.3092, pruned_loss=0.09895, over 21363.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.3201, pruned_loss=0.09734, over 4264392.04 frames. ], batch size: 211, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:29:11,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=353430.0, ans=0.0 2023-06-19 13:29:11,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=353430.0, ans=0.1 2023-06-19 13:29:55,138 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.53 vs. limit=15.0 2023-06-19 13:30:38,157 INFO [train.py:996] (3/4) Epoch 2, batch 28450, loss[loss=0.2905, simple_loss=0.3575, pruned_loss=0.1117, over 21776.00 frames. ], tot_loss[loss=0.2657, simple_loss=0.3272, pruned_loss=0.1021, over 4267851.99 frames. ], batch size: 112, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:30:40,755 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=15.0 2023-06-19 13:31:14,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=353730.0, ans=0.125 2023-06-19 13:31:20,607 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.233e+02 3.030e+02 3.579e+02 4.363e+02 8.439e+02, threshold=7.159e+02, percent-clipped=7.0 2023-06-19 13:31:37,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=353790.0, ans=0.2 2023-06-19 13:31:53,111 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-19 13:32:47,135 INFO [train.py:996] (3/4) Epoch 2, batch 28500, loss[loss=0.3087, simple_loss=0.3656, pruned_loss=0.1259, over 21768.00 frames. ], tot_loss[loss=0.2692, simple_loss=0.3295, pruned_loss=0.1044, over 4271397.57 frames. ], batch size: 124, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:33:18,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=354030.0, ans=0.1 2023-06-19 13:33:59,502 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.76 vs. limit=10.0 2023-06-19 13:34:28,106 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.42 vs. limit=15.0 2023-06-19 13:34:41,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=354210.0, ans=0.125 2023-06-19 13:34:50,567 INFO [train.py:996] (3/4) Epoch 2, batch 28550, loss[loss=0.3024, simple_loss=0.3848, pruned_loss=0.11, over 21472.00 frames. ], tot_loss[loss=0.2782, simple_loss=0.3398, pruned_loss=0.1083, over 4275930.16 frames. ], batch size: 211, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:35:35,185 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.197e+02 2.880e+02 3.273e+02 3.942e+02 6.177e+02, threshold=6.546e+02, percent-clipped=0.0 2023-06-19 13:36:58,164 INFO [train.py:996] (3/4) Epoch 2, batch 28600, loss[loss=0.2749, simple_loss=0.344, pruned_loss=0.1029, over 21573.00 frames. ], tot_loss[loss=0.2832, simple_loss=0.3461, pruned_loss=0.1101, over 4275724.74 frames. ], batch size: 230, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:37:14,725 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.48 vs. limit=22.5 2023-06-19 13:37:49,819 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.93 vs. limit=15.0 2023-06-19 13:37:54,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=354630.0, ans=0.2 2023-06-19 13:37:58,826 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=354690.0, ans=0.125 2023-06-19 13:38:30,088 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=354750.0, ans=0.125 2023-06-19 13:38:52,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=354810.0, ans=0.125 2023-06-19 13:39:02,544 INFO [train.py:996] (3/4) Epoch 2, batch 28650, loss[loss=0.2246, simple_loss=0.2697, pruned_loss=0.0897, over 21499.00 frames. ], tot_loss[loss=0.2777, simple_loss=0.3386, pruned_loss=0.1084, over 4262241.58 frames. ], batch size: 196, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:39:16,621 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.19 vs. limit=12.0 2023-06-19 13:39:44,449 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.150e+02 2.911e+02 3.310e+02 3.742e+02 7.048e+02, threshold=6.621e+02, percent-clipped=2.0 2023-06-19 13:40:56,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=355110.0, ans=0.1 2023-06-19 13:41:08,386 INFO [train.py:996] (3/4) Epoch 2, batch 28700, loss[loss=0.2822, simple_loss=0.3365, pruned_loss=0.114, over 21222.00 frames. ], tot_loss[loss=0.2787, simple_loss=0.3384, pruned_loss=0.1096, over 4259500.91 frames. ], batch size: 143, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:41:33,931 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=355230.0, ans=0.2 2023-06-19 13:42:10,575 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=355290.0, ans=0.125 2023-06-19 13:42:25,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=355290.0, ans=0.125 2023-06-19 13:43:04,244 INFO [train.py:996] (3/4) Epoch 2, batch 28750, loss[loss=0.2792, simple_loss=0.3428, pruned_loss=0.1078, over 21250.00 frames. ], tot_loss[loss=0.2792, simple_loss=0.3379, pruned_loss=0.1103, over 4265112.36 frames. ], batch size: 143, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 13:43:58,435 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.167e+02 2.722e+02 3.095e+02 3.469e+02 6.636e+02, threshold=6.190e+02, percent-clipped=1.0 2023-06-19 13:44:19,605 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.73 vs. limit=22.5 2023-06-19 13:45:21,932 INFO [train.py:996] (3/4) Epoch 2, batch 28800, loss[loss=0.3609, simple_loss=0.4081, pruned_loss=0.1569, over 21474.00 frames. ], tot_loss[loss=0.2825, simple_loss=0.3424, pruned_loss=0.1113, over 4269401.76 frames. ], batch size: 471, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 13:46:59,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=355950.0, ans=0.125 2023-06-19 13:47:33,454 INFO [train.py:996] (3/4) Epoch 2, batch 28850, loss[loss=0.2515, simple_loss=0.306, pruned_loss=0.0985, over 21628.00 frames. ], tot_loss[loss=0.2842, simple_loss=0.344, pruned_loss=0.1122, over 4266168.92 frames. ], batch size: 212, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 13:47:39,307 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=356070.0, ans=10.0 2023-06-19 13:48:10,928 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 13:48:13,289 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.315e+02 2.928e+02 3.488e+02 4.240e+02 7.326e+02, threshold=6.975e+02, percent-clipped=5.0 2023-06-19 13:49:55,045 INFO [train.py:996] (3/4) Epoch 2, batch 28900, loss[loss=0.3061, simple_loss=0.3624, pruned_loss=0.1249, over 21422.00 frames. ], tot_loss[loss=0.2861, simple_loss=0.3457, pruned_loss=0.1133, over 4271542.98 frames. ], batch size: 548, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 13:50:17,413 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.12 vs. limit=10.0 2023-06-19 13:52:05,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=356610.0, ans=0.0 2023-06-19 13:52:10,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=356670.0, ans=0.125 2023-06-19 13:52:11,440 INFO [train.py:996] (3/4) Epoch 2, batch 28950, loss[loss=0.3607, simple_loss=0.4152, pruned_loss=0.1531, over 21526.00 frames. ], tot_loss[loss=0.2883, simple_loss=0.3494, pruned_loss=0.1136, over 4265734.87 frames. ], batch size: 507, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 13:52:17,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=356670.0, ans=0.0 2023-06-19 13:53:03,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=356730.0, ans=0.0 2023-06-19 13:53:15,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=356730.0, ans=0.0 2023-06-19 13:53:15,939 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.063e+02 2.845e+02 3.297e+02 3.891e+02 9.356e+02, threshold=6.594e+02, percent-clipped=2.0 2023-06-19 13:53:20,024 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.05 vs. limit=15.0 2023-06-19 13:53:41,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=356790.0, ans=0.2 2023-06-19 13:54:32,552 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-06-19 13:54:36,138 INFO [train.py:996] (3/4) Epoch 2, batch 29000, loss[loss=0.309, simple_loss=0.3698, pruned_loss=0.1242, over 21306.00 frames. ], tot_loss[loss=0.2892, simple_loss=0.3533, pruned_loss=0.1125, over 4263445.47 frames. ], batch size: 548, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 13:54:56,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=356970.0, ans=0.1 2023-06-19 13:56:00,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=357150.0, ans=0.0 2023-06-19 13:56:45,621 INFO [train.py:996] (3/4) Epoch 2, batch 29050, loss[loss=0.2517, simple_loss=0.3159, pruned_loss=0.09378, over 21413.00 frames. ], tot_loss[loss=0.2882, simple_loss=0.3506, pruned_loss=0.113, over 4268201.15 frames. ], batch size: 131, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 13:57:18,904 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=357330.0, ans=0.125 2023-06-19 13:57:22,899 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.747e+02 2.800e+02 3.368e+02 3.834e+02 6.942e+02, threshold=6.736e+02, percent-clipped=1.0 2023-06-19 13:58:56,887 INFO [train.py:996] (3/4) Epoch 2, batch 29100, loss[loss=0.2649, simple_loss=0.3229, pruned_loss=0.1035, over 21819.00 frames. ], tot_loss[loss=0.2794, simple_loss=0.3397, pruned_loss=0.1095, over 4276863.08 frames. ], batch size: 98, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 13:59:37,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=357690.0, ans=0.2 2023-06-19 13:59:38,949 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=357690.0, ans=0.125 2023-06-19 14:00:17,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=357750.0, ans=0.0 2023-06-19 14:00:17,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=357750.0, ans=0.1 2023-06-19 14:00:35,233 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.43 vs. limit=22.5 2023-06-19 14:00:44,951 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=357810.0, ans=0.125 2023-06-19 14:00:46,478 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=357870.0, ans=0.05 2023-06-19 14:00:47,411 INFO [train.py:996] (3/4) Epoch 2, batch 29150, loss[loss=0.292, simple_loss=0.356, pruned_loss=0.114, over 21666.00 frames. ], tot_loss[loss=0.2783, simple_loss=0.3393, pruned_loss=0.1087, over 4271885.69 frames. ], batch size: 332, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 14:01:15,072 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=357870.0, ans=0.125 2023-06-19 14:01:20,921 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=357930.0, ans=0.125 2023-06-19 14:01:30,661 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.955e+02 2.793e+02 3.362e+02 4.160e+02 6.908e+02, threshold=6.724e+02, percent-clipped=1.0 2023-06-19 14:02:46,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=358110.0, ans=0.1 2023-06-19 14:02:53,705 INFO [train.py:996] (3/4) Epoch 2, batch 29200, loss[loss=0.2537, simple_loss=0.317, pruned_loss=0.09524, over 21802.00 frames. ], tot_loss[loss=0.2743, simple_loss=0.3334, pruned_loss=0.1076, over 4275601.66 frames. ], batch size: 102, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 14:02:55,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=358170.0, ans=0.125 2023-06-19 14:03:04,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=358170.0, ans=0.125 2023-06-19 14:03:18,685 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=358230.0, ans=0.125 2023-06-19 14:04:26,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=358410.0, ans=0.05 2023-06-19 14:04:58,584 INFO [train.py:996] (3/4) Epoch 2, batch 29250, loss[loss=0.2566, simple_loss=0.3412, pruned_loss=0.08601, over 21704.00 frames. ], tot_loss[loss=0.2694, simple_loss=0.3313, pruned_loss=0.1037, over 4275115.19 frames. ], batch size: 298, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 14:05:23,814 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.737e+02 2.478e+02 3.001e+02 3.725e+02 6.344e+02, threshold=6.002e+02, percent-clipped=0.0 2023-06-19 14:06:03,627 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=358650.0, ans=0.1 2023-06-19 14:06:23,023 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.68 vs. limit=15.0 2023-06-19 14:06:57,919 INFO [train.py:996] (3/4) Epoch 2, batch 29300, loss[loss=0.2399, simple_loss=0.298, pruned_loss=0.09091, over 21558.00 frames. ], tot_loss[loss=0.2683, simple_loss=0.3323, pruned_loss=0.1021, over 4272112.49 frames. ], batch size: 132, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 14:07:29,649 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.15 vs. limit=15.0 2023-06-19 14:07:59,890 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=358890.0, ans=0.1 2023-06-19 14:08:23,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=358950.0, ans=0.125 2023-06-19 14:08:48,322 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.16 vs. limit=6.0 2023-06-19 14:09:07,176 INFO [train.py:996] (3/4) Epoch 2, batch 29350, loss[loss=0.2585, simple_loss=0.3101, pruned_loss=0.1034, over 21843.00 frames. ], tot_loss[loss=0.2643, simple_loss=0.3266, pruned_loss=0.101, over 4256903.02 frames. ], batch size: 107, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 14:09:12,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=359070.0, ans=0.125 2023-06-19 14:09:27,456 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=359130.0, ans=0.0 2023-06-19 14:09:30,230 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=359130.0, ans=0.1 2023-06-19 14:10:02,473 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.127e+02 2.578e+02 2.975e+02 3.361e+02 4.111e+02, threshold=5.949e+02, percent-clipped=0.0 2023-06-19 14:11:03,629 INFO [train.py:996] (3/4) Epoch 2, batch 29400, loss[loss=0.2463, simple_loss=0.3286, pruned_loss=0.08198, over 21710.00 frames. ], tot_loss[loss=0.2609, simple_loss=0.3261, pruned_loss=0.09781, over 4260707.88 frames. ], batch size: 298, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 14:11:54,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=359430.0, ans=0.1 2023-06-19 14:12:07,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=359490.0, ans=0.0 2023-06-19 14:13:01,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=359610.0, ans=0.2 2023-06-19 14:13:07,543 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=359610.0, ans=0.2 2023-06-19 14:13:09,872 INFO [train.py:996] (3/4) Epoch 2, batch 29450, loss[loss=0.2713, simple_loss=0.3402, pruned_loss=0.1012, over 20687.00 frames. ], tot_loss[loss=0.2583, simple_loss=0.3234, pruned_loss=0.09658, over 4259378.57 frames. ], batch size: 607, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 14:13:38,385 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=359730.0, ans=0.125 2023-06-19 14:13:48,653 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=359730.0, ans=0.125 2023-06-19 14:14:04,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=359730.0, ans=0.0 2023-06-19 14:14:13,948 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.042e+02 2.887e+02 3.363e+02 4.208e+02 7.799e+02, threshold=6.726e+02, percent-clipped=7.0 2023-06-19 14:14:22,248 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 14:14:34,835 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.22 vs. limit=10.0 2023-06-19 14:14:47,466 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=12.0 2023-06-19 14:14:54,283 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.53 vs. limit=12.0 2023-06-19 14:15:23,685 INFO [train.py:996] (3/4) Epoch 2, batch 29500, loss[loss=0.1766, simple_loss=0.2242, pruned_loss=0.06444, over 21820.00 frames. ], tot_loss[loss=0.2657, simple_loss=0.3293, pruned_loss=0.101, over 4268479.55 frames. ], batch size: 102, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 14:15:28,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=359970.0, ans=0.125 2023-06-19 14:17:33,212 INFO [train.py:996] (3/4) Epoch 2, batch 29550, loss[loss=0.2776, simple_loss=0.3391, pruned_loss=0.108, over 21936.00 frames. ], tot_loss[loss=0.2688, simple_loss=0.3297, pruned_loss=0.1039, over 4281303.07 frames. ], batch size: 113, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 14:18:20,042 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 2.713e+02 3.208e+02 3.753e+02 5.993e+02, threshold=6.415e+02, percent-clipped=0.0 2023-06-19 14:18:29,912 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.52 vs. limit=15.0 2023-06-19 14:19:05,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=360450.0, ans=0.125 2023-06-19 14:19:32,887 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=360510.0, ans=0.2 2023-06-19 14:19:51,706 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=360570.0, ans=0.125 2023-06-19 14:19:52,642 INFO [train.py:996] (3/4) Epoch 2, batch 29600, loss[loss=0.3131, simple_loss=0.3851, pruned_loss=0.1205, over 21753.00 frames. ], tot_loss[loss=0.2755, simple_loss=0.3374, pruned_loss=0.1068, over 4287064.09 frames. ], batch size: 351, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 14:20:03,373 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=360570.0, ans=0.0 2023-06-19 14:20:04,153 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.33 vs. limit=15.0 2023-06-19 14:20:06,508 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=360570.0, ans=0.125 2023-06-19 14:20:27,368 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=360630.0, ans=0.125 2023-06-19 14:20:45,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=360690.0, ans=0.125 2023-06-19 14:20:48,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=360690.0, ans=0.125 2023-06-19 14:21:04,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=360690.0, ans=0.1 2023-06-19 14:21:54,335 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=360810.0, ans=0.125 2023-06-19 14:22:07,138 INFO [train.py:996] (3/4) Epoch 2, batch 29650, loss[loss=0.2488, simple_loss=0.3149, pruned_loss=0.09138, over 21849.00 frames. ], tot_loss[loss=0.2689, simple_loss=0.333, pruned_loss=0.1024, over 4286445.17 frames. ], batch size: 332, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 14:22:16,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=360870.0, ans=0.1 2023-06-19 14:22:16,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=360870.0, ans=0.125 2023-06-19 14:22:40,820 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=360930.0, ans=0.125 2023-06-19 14:22:43,174 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.808e+02 2.621e+02 3.260e+02 3.981e+02 6.748e+02, threshold=6.520e+02, percent-clipped=1.0 2023-06-19 14:23:02,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=360990.0, ans=0.5 2023-06-19 14:24:20,067 INFO [train.py:996] (3/4) Epoch 2, batch 29700, loss[loss=0.3757, simple_loss=0.4527, pruned_loss=0.1494, over 21529.00 frames. ], tot_loss[loss=0.2717, simple_loss=0.3369, pruned_loss=0.1033, over 4277194.76 frames. ], batch size: 471, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 14:25:03,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=361290.0, ans=0.1 2023-06-19 14:25:13,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=361290.0, ans=0.125 2023-06-19 14:25:14,384 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=361290.0, ans=0.125 2023-06-19 14:25:39,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=361410.0, ans=0.125 2023-06-19 14:26:05,126 INFO [train.py:996] (3/4) Epoch 2, batch 29750, loss[loss=0.2345, simple_loss=0.3021, pruned_loss=0.08347, over 21881.00 frames. ], tot_loss[loss=0.2719, simple_loss=0.3403, pruned_loss=0.1018, over 4264411.27 frames. ], batch size: 98, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 14:26:06,178 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.73 vs. limit=15.0 2023-06-19 14:26:41,004 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.997e+02 2.594e+02 3.244e+02 4.467e+02 8.629e+02, threshold=6.487e+02, percent-clipped=7.0 2023-06-19 14:27:13,574 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.39 vs. limit=6.0 2023-06-19 14:27:17,449 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=361650.0, ans=0.125 2023-06-19 14:27:24,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=361650.0, ans=0.125 2023-06-19 14:27:25,947 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=361650.0, ans=0.05 2023-06-19 14:27:31,830 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=361650.0, ans=0.025 2023-06-19 14:28:12,797 INFO [train.py:996] (3/4) Epoch 2, batch 29800, loss[loss=0.2579, simple_loss=0.309, pruned_loss=0.1034, over 21251.00 frames. ], tot_loss[loss=0.2742, simple_loss=0.3418, pruned_loss=0.1033, over 4267350.33 frames. ], batch size: 608, lr: 1.44e-02, grad_scale: 64.0 2023-06-19 14:28:30,502 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=18.00 vs. limit=22.5 2023-06-19 14:29:01,562 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=361890.0, ans=0.0 2023-06-19 14:29:18,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=361950.0, ans=0.0 2023-06-19 14:29:42,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=362010.0, ans=0.125 2023-06-19 14:30:10,957 INFO [train.py:996] (3/4) Epoch 2, batch 29850, loss[loss=0.2286, simple_loss=0.3066, pruned_loss=0.07531, over 21748.00 frames. ], tot_loss[loss=0.2695, simple_loss=0.3373, pruned_loss=0.1009, over 4274671.24 frames. ], batch size: 332, lr: 1.44e-02, grad_scale: 64.0 2023-06-19 14:30:47,952 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.864e+02 2.554e+02 3.049e+02 3.847e+02 7.515e+02, threshold=6.099e+02, percent-clipped=1.0 2023-06-19 14:32:22,163 INFO [train.py:996] (3/4) Epoch 2, batch 29900, loss[loss=0.3083, simple_loss=0.3591, pruned_loss=0.1288, over 21368.00 frames. ], tot_loss[loss=0.2705, simple_loss=0.3368, pruned_loss=0.1021, over 4270503.25 frames. ], batch size: 176, lr: 1.44e-02, grad_scale: 64.0 2023-06-19 14:32:41,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=362370.0, ans=22.5 2023-06-19 14:32:42,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=362430.0, ans=0.125 2023-06-19 14:32:45,067 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=362430.0, ans=0.125 2023-06-19 14:33:18,742 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=362490.0, ans=0.125 2023-06-19 14:33:24,627 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=362490.0, ans=0.0 2023-06-19 14:33:46,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=362550.0, ans=0.125 2023-06-19 14:34:10,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=362610.0, ans=0.125 2023-06-19 14:34:31,644 INFO [train.py:996] (3/4) Epoch 2, batch 29950, loss[loss=0.3038, simple_loss=0.3602, pruned_loss=0.1237, over 21648.00 frames. ], tot_loss[loss=0.2794, simple_loss=0.3427, pruned_loss=0.1081, over 4272391.36 frames. ], batch size: 351, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 14:34:38,148 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=362670.0, ans=0.125 2023-06-19 14:34:43,492 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.96 vs. limit=15.0 2023-06-19 14:35:14,923 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.105e+02 2.991e+02 3.617e+02 4.391e+02 6.676e+02, threshold=7.234e+02, percent-clipped=4.0 2023-06-19 14:36:19,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=362910.0, ans=0.125 2023-06-19 14:36:30,002 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.08 vs. limit=15.0 2023-06-19 14:36:34,456 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.58 vs. limit=6.0 2023-06-19 14:36:39,173 INFO [train.py:996] (3/4) Epoch 2, batch 30000, loss[loss=0.2321, simple_loss=0.3231, pruned_loss=0.07054, over 21878.00 frames. ], tot_loss[loss=0.2819, simple_loss=0.346, pruned_loss=0.1088, over 4271778.04 frames. ], batch size: 316, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 14:36:39,174 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-19 14:37:25,479 INFO [train.py:1028] (3/4) Epoch 2, validation: loss=0.2591, simple_loss=0.3611, pruned_loss=0.07848, over 1796401.00 frames. 2023-06-19 14:37:25,480 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-19 14:37:50,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=363030.0, ans=0.125 2023-06-19 14:38:20,828 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.53 vs. limit=22.5 2023-06-19 14:38:29,023 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=363150.0, ans=0.125 2023-06-19 14:38:30,248 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=363150.0, ans=0.125 2023-06-19 14:38:40,793 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=363150.0, ans=0.0 2023-06-19 14:38:44,771 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.34 vs. limit=15.0 2023-06-19 14:38:48,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=363150.0, ans=0.125 2023-06-19 14:39:18,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=363210.0, ans=0.125 2023-06-19 14:39:47,272 INFO [train.py:996] (3/4) Epoch 2, batch 30050, loss[loss=0.3201, simple_loss=0.4155, pruned_loss=0.1123, over 21696.00 frames. ], tot_loss[loss=0.2786, simple_loss=0.3472, pruned_loss=0.105, over 4261085.32 frames. ], batch size: 389, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 14:40:11,372 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=363270.0, ans=0.125 2023-06-19 14:40:26,211 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.931e+02 2.570e+02 3.053e+02 3.907e+02 7.518e+02, threshold=6.106e+02, percent-clipped=1.0 2023-06-19 14:40:28,151 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=363390.0, ans=0.125 2023-06-19 14:40:59,228 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 14:41:13,820 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.21 vs. limit=22.5 2023-06-19 14:41:25,604 INFO [train.py:996] (3/4) Epoch 2, batch 30100, loss[loss=0.2517, simple_loss=0.2964, pruned_loss=0.1035, over 21279.00 frames. ], tot_loss[loss=0.2767, simple_loss=0.3446, pruned_loss=0.1044, over 4252804.20 frames. ], batch size: 176, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 14:41:54,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=363570.0, ans=0.125 2023-06-19 14:41:55,733 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 14:42:07,521 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=363630.0, ans=0.5 2023-06-19 14:42:10,980 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.04 vs. limit=15.0 2023-06-19 14:42:11,763 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=363690.0, ans=0.1 2023-06-19 14:42:29,601 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.75 vs. limit=15.0 2023-06-19 14:43:02,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=363810.0, ans=0.2 2023-06-19 14:43:37,593 INFO [train.py:996] (3/4) Epoch 2, batch 30150, loss[loss=0.2909, simple_loss=0.3486, pruned_loss=0.1166, over 21598.00 frames. ], tot_loss[loss=0.2785, simple_loss=0.3426, pruned_loss=0.1072, over 4257439.61 frames. ], batch size: 230, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 14:43:46,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=363870.0, ans=0.09899494936611666 2023-06-19 14:44:32,789 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.104e+02 2.807e+02 3.248e+02 3.802e+02 5.683e+02, threshold=6.495e+02, percent-clipped=0.0 2023-06-19 14:44:47,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=363990.0, ans=0.125 2023-06-19 14:45:49,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=364110.0, ans=0.0 2023-06-19 14:45:53,860 INFO [train.py:996] (3/4) Epoch 2, batch 30200, loss[loss=0.2783, simple_loss=0.3646, pruned_loss=0.09601, over 20757.00 frames. ], tot_loss[loss=0.2794, simple_loss=0.3464, pruned_loss=0.1063, over 4258982.23 frames. ], batch size: 607, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 14:46:45,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=364230.0, ans=0.035 2023-06-19 14:46:52,968 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=364290.0, ans=0.125 2023-06-19 14:47:34,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=364350.0, ans=0.2 2023-06-19 14:47:34,709 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=364350.0, ans=15.0 2023-06-19 14:48:15,229 INFO [train.py:996] (3/4) Epoch 2, batch 30250, loss[loss=0.3906, simple_loss=0.4659, pruned_loss=0.1577, over 21533.00 frames. ], tot_loss[loss=0.286, simple_loss=0.3539, pruned_loss=0.109, over 4265470.13 frames. ], batch size: 471, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 14:48:42,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=364530.0, ans=0.0 2023-06-19 14:48:53,826 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.145e+02 2.937e+02 3.483e+02 4.410e+02 7.312e+02, threshold=6.966e+02, percent-clipped=2.0 2023-06-19 14:48:55,030 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.79 vs. limit=15.0 2023-06-19 14:48:55,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=364590.0, ans=0.125 2023-06-19 14:49:26,647 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=364590.0, ans=0.125 2023-06-19 14:50:12,997 INFO [train.py:996] (3/4) Epoch 2, batch 30300, loss[loss=0.2428, simple_loss=0.2984, pruned_loss=0.0936, over 21755.00 frames. ], tot_loss[loss=0.2846, simple_loss=0.3511, pruned_loss=0.1091, over 4253601.94 frames. ], batch size: 318, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 14:50:33,815 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=364770.0, ans=0.125 2023-06-19 14:51:28,575 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.86 vs. limit=15.0 2023-06-19 14:52:29,145 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=365010.0, ans=0.05 2023-06-19 14:52:32,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=365010.0, ans=0.125 2023-06-19 14:52:44,517 INFO [train.py:996] (3/4) Epoch 2, batch 30350, loss[loss=0.1665, simple_loss=0.2017, pruned_loss=0.06564, over 17262.00 frames. ], tot_loss[loss=0.2872, simple_loss=0.3526, pruned_loss=0.1109, over 4251504.10 frames. ], batch size: 65, lr: 1.44e-02, grad_scale: 16.0 2023-06-19 14:52:51,683 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=365070.0, ans=0.125 2023-06-19 14:52:56,014 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=365070.0, ans=0.2 2023-06-19 14:53:34,001 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.107e+02 2.858e+02 3.357e+02 4.811e+02 8.525e+02, threshold=6.714e+02, percent-clipped=9.0 2023-06-19 14:55:16,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=365310.0, ans=0.025 2023-06-19 14:55:28,877 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=365370.0, ans=0.0 2023-06-19 14:55:31,011 INFO [train.py:996] (3/4) Epoch 2, batch 30400, loss[loss=0.2616, simple_loss=0.3006, pruned_loss=0.1113, over 20095.00 frames. ], tot_loss[loss=0.2824, simple_loss=0.3464, pruned_loss=0.1092, over 4245797.52 frames. ], batch size: 702, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 14:57:33,332 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=365490.0, ans=0.0 2023-06-19 14:58:04,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=365490.0, ans=0.0 2023-06-19 14:58:05,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=365490.0, ans=0.0 2023-06-19 14:58:17,435 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=365550.0, ans=0.125 2023-06-19 15:00:10,541 INFO [train.py:996] (3/4) Epoch 2, batch 30450, loss[loss=0.359, simple_loss=0.4547, pruned_loss=0.1317, over 19819.00 frames. ], tot_loss[loss=0.2858, simple_loss=0.3493, pruned_loss=0.1111, over 4189880.66 frames. ], batch size: 702, lr: 1.43e-02, grad_scale: 32.0 2023-06-19 15:00:10,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=365670.0, ans=0.0 2023-06-19 15:01:46,282 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.336e+02 3.753e+02 4.971e+02 7.783e+02 2.032e+03, threshold=9.942e+02, percent-clipped=30.0 2023-06-19 15:02:26,937 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.57 vs. limit=15.0 2023-06-19 15:02:59,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=365910.0, ans=0.0 2023-06-19 15:05:22,970 INFO [train.py:996] (3/4) Epoch 3, batch 0, loss[loss=0.2712, simple_loss=0.3156, pruned_loss=0.1134, over 21279.00 frames. ], tot_loss[loss=0.2712, simple_loss=0.3156, pruned_loss=0.1134, over 21279.00 frames. ], batch size: 551, lr: 1.22e-02, grad_scale: 32.0 2023-06-19 15:05:22,971 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-19 15:06:09,289 INFO [train.py:1028] (3/4) Epoch 3, validation: loss=0.2643, simple_loss=0.3711, pruned_loss=0.07872, over 1796401.00 frames. 2023-06-19 15:06:09,294 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-19 15:06:13,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=365934.0, ans=0.125 2023-06-19 15:06:35,949 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.53 vs. limit=15.0 2023-06-19 15:06:58,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=366054.0, ans=0.2 2023-06-19 15:07:04,614 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=366114.0, ans=0.0 2023-06-19 15:07:40,174 INFO [train.py:996] (3/4) Epoch 3, batch 50, loss[loss=0.3036, simple_loss=0.3699, pruned_loss=0.1186, over 21416.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.3368, pruned_loss=0.1034, over 949405.61 frames. ], batch size: 471, lr: 1.22e-02, grad_scale: 32.0 2023-06-19 15:07:40,719 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=366234.0, ans=0.125 2023-06-19 15:08:45,477 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.132e+02 2.896e+02 3.528e+02 5.821e+02 1.512e+03, threshold=7.056e+02, percent-clipped=7.0 2023-06-19 15:08:47,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=366354.0, ans=0.125 2023-06-19 15:09:08,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=366414.0, ans=0.0 2023-06-19 15:09:12,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=366474.0, ans=0.125 2023-06-19 15:09:14,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=366474.0, ans=0.95 2023-06-19 15:09:15,398 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=366474.0, ans=0.0 2023-06-19 15:09:46,257 INFO [train.py:996] (3/4) Epoch 3, batch 100, loss[loss=0.2859, simple_loss=0.3582, pruned_loss=0.1069, over 21733.00 frames. ], tot_loss[loss=0.2899, simple_loss=0.3605, pruned_loss=0.1096, over 1674879.44 frames. ], batch size: 298, lr: 1.22e-02, grad_scale: 32.0 2023-06-19 15:10:01,110 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.21 vs. limit=22.5 2023-06-19 15:11:19,674 INFO [train.py:996] (3/4) Epoch 3, batch 150, loss[loss=0.3458, simple_loss=0.4226, pruned_loss=0.1345, over 21655.00 frames. ], tot_loss[loss=0.2883, simple_loss=0.3611, pruned_loss=0.1077, over 2255593.71 frames. ], batch size: 441, lr: 1.22e-02, grad_scale: 32.0 2023-06-19 15:11:20,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=366834.0, ans=0.125 2023-06-19 15:11:22,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=366834.0, ans=0.0 2023-06-19 15:11:37,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=366834.0, ans=0.125 2023-06-19 15:12:12,534 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.932e+02 2.572e+02 2.987e+02 3.801e+02 6.423e+02, threshold=5.974e+02, percent-clipped=0.0 2023-06-19 15:12:14,712 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.38 vs. limit=15.0 2023-06-19 15:13:17,094 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.77 vs. limit=15.0 2023-06-19 15:13:25,864 INFO [train.py:996] (3/4) Epoch 3, batch 200, loss[loss=0.2826, simple_loss=0.3732, pruned_loss=0.09603, over 19844.00 frames. ], tot_loss[loss=0.2842, simple_loss=0.3567, pruned_loss=0.1058, over 2706236.65 frames. ], batch size: 702, lr: 1.22e-02, grad_scale: 32.0 2023-06-19 15:13:56,787 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=367194.0, ans=0.125 2023-06-19 15:15:25,184 INFO [train.py:996] (3/4) Epoch 3, batch 250, loss[loss=0.3136, simple_loss=0.3744, pruned_loss=0.1263, over 21567.00 frames. ], tot_loss[loss=0.2848, simple_loss=0.3554, pruned_loss=0.1071, over 3043103.98 frames. ], batch size: 414, lr: 1.22e-02, grad_scale: 32.0 2023-06-19 15:15:28,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=367434.0, ans=0.125 2023-06-19 15:16:05,221 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=367494.0, ans=0.125 2023-06-19 15:16:20,117 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.154e+02 2.820e+02 3.135e+02 3.949e+02 6.710e+02, threshold=6.270e+02, percent-clipped=4.0 2023-06-19 15:16:22,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=367554.0, ans=0.0 2023-06-19 15:17:07,722 INFO [train.py:996] (3/4) Epoch 3, batch 300, loss[loss=0.2526, simple_loss=0.3097, pruned_loss=0.09774, over 21355.00 frames. ], tot_loss[loss=0.2787, simple_loss=0.3472, pruned_loss=0.1051, over 3309841.87 frames. ], batch size: 176, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:17:43,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=367794.0, ans=0.125 2023-06-19 15:17:53,917 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.26 vs. limit=10.0 2023-06-19 15:18:11,545 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=367854.0, ans=0.2 2023-06-19 15:18:51,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=367914.0, ans=0.0 2023-06-19 15:19:27,142 INFO [train.py:996] (3/4) Epoch 3, batch 350, loss[loss=0.3133, simple_loss=0.3605, pruned_loss=0.133, over 21429.00 frames. ], tot_loss[loss=0.2728, simple_loss=0.3389, pruned_loss=0.1033, over 3527506.20 frames. ], batch size: 473, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:19:31,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=368034.0, ans=0.125 2023-06-19 15:19:47,977 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.10 vs. limit=15.0 2023-06-19 15:20:03,221 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=368094.0, ans=0.0 2023-06-19 15:20:27,643 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.025e+02 2.616e+02 3.033e+02 3.600e+02 6.018e+02, threshold=6.066e+02, percent-clipped=0.0 2023-06-19 15:20:29,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=368154.0, ans=0.09899494936611666 2023-06-19 15:21:09,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=368274.0, ans=0.015 2023-06-19 15:21:13,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=368274.0, ans=0.0 2023-06-19 15:21:20,089 INFO [train.py:996] (3/4) Epoch 3, batch 400, loss[loss=0.2792, simple_loss=0.3707, pruned_loss=0.09387, over 21386.00 frames. ], tot_loss[loss=0.2651, simple_loss=0.3293, pruned_loss=0.1005, over 3685230.97 frames. ], batch size: 131, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:21:36,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=368334.0, ans=0.0 2023-06-19 15:21:57,609 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=368394.0, ans=0.035 2023-06-19 15:22:32,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=368454.0, ans=0.5 2023-06-19 15:22:32,349 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=368454.0, ans=0.125 2023-06-19 15:22:35,042 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=368454.0, ans=0.0 2023-06-19 15:22:57,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=368514.0, ans=0.1 2023-06-19 15:23:14,236 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=368574.0, ans=0.0 2023-06-19 15:23:15,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=368574.0, ans=0.0 2023-06-19 15:23:44,682 INFO [train.py:996] (3/4) Epoch 3, batch 450, loss[loss=0.3128, simple_loss=0.331, pruned_loss=0.1473, over 21420.00 frames. ], tot_loss[loss=0.2621, simple_loss=0.3255, pruned_loss=0.09933, over 3822957.19 frames. ], batch size: 509, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:23:54,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=368634.0, ans=0.1 2023-06-19 15:24:18,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=368634.0, ans=0.1 2023-06-19 15:24:53,235 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.675e+02 2.675e+02 3.174e+02 4.141e+02 7.803e+02, threshold=6.347e+02, percent-clipped=3.0 2023-06-19 15:25:19,808 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.77 vs. limit=10.0 2023-06-19 15:25:30,357 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 15:25:36,181 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=368874.0, ans=0.125 2023-06-19 15:25:38,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=368874.0, ans=0.0 2023-06-19 15:25:59,099 INFO [train.py:996] (3/4) Epoch 3, batch 500, loss[loss=0.2343, simple_loss=0.288, pruned_loss=0.09027, over 21465.00 frames. ], tot_loss[loss=0.2627, simple_loss=0.3291, pruned_loss=0.09819, over 3918755.98 frames. ], batch size: 212, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:26:59,620 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=369054.0, ans=0.0 2023-06-19 15:27:17,072 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=369054.0, ans=0.125 2023-06-19 15:27:24,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=369114.0, ans=0.125 2023-06-19 15:27:59,072 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=369174.0, ans=0.0 2023-06-19 15:28:12,602 INFO [train.py:996] (3/4) Epoch 3, batch 550, loss[loss=0.2965, simple_loss=0.3922, pruned_loss=0.1004, over 21654.00 frames. ], tot_loss[loss=0.2669, simple_loss=0.3353, pruned_loss=0.0993, over 3999663.24 frames. ], batch size: 414, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:29:01,866 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 15:29:06,950 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.955e+02 2.826e+02 3.270e+02 4.070e+02 7.651e+02, threshold=6.541e+02, percent-clipped=1.0 2023-06-19 15:29:17,800 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.21 vs. limit=22.5 2023-06-19 15:29:40,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=369414.0, ans=0.125 2023-06-19 15:29:40,401 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=369414.0, ans=0.0 2023-06-19 15:30:02,839 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=369474.0, ans=0.1 2023-06-19 15:30:15,623 INFO [train.py:996] (3/4) Epoch 3, batch 600, loss[loss=0.2229, simple_loss=0.2686, pruned_loss=0.08857, over 20805.00 frames. ], tot_loss[loss=0.2673, simple_loss=0.3364, pruned_loss=0.09911, over 4063105.40 frames. ], batch size: 609, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:30:38,991 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=369534.0, ans=0.2 2023-06-19 15:31:33,745 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=369714.0, ans=0.05 2023-06-19 15:31:41,494 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=369714.0, ans=0.1 2023-06-19 15:32:18,108 INFO [train.py:996] (3/4) Epoch 3, batch 650, loss[loss=0.2782, simple_loss=0.3364, pruned_loss=0.11, over 21889.00 frames. ], tot_loss[loss=0.2686, simple_loss=0.3378, pruned_loss=0.09975, over 4113034.58 frames. ], batch size: 332, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:32:38,680 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=369894.0, ans=0.125 2023-06-19 15:32:43,483 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.03 vs. limit=10.0 2023-06-19 15:33:22,686 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.090e+02 2.840e+02 3.848e+02 4.493e+02 8.755e+02, threshold=7.695e+02, percent-clipped=3.0 2023-06-19 15:33:47,570 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.50 vs. limit=15.0 2023-06-19 15:34:26,364 INFO [train.py:996] (3/4) Epoch 3, batch 700, loss[loss=0.3592, simple_loss=0.4275, pruned_loss=0.1454, over 21689.00 frames. ], tot_loss[loss=0.2727, simple_loss=0.3426, pruned_loss=0.1014, over 4153355.65 frames. ], batch size: 441, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:35:18,895 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 15:36:29,392 INFO [train.py:996] (3/4) Epoch 3, batch 750, loss[loss=0.2643, simple_loss=0.3156, pruned_loss=0.1065, over 21592.00 frames. ], tot_loss[loss=0.2724, simple_loss=0.3405, pruned_loss=0.1021, over 4190732.03 frames. ], batch size: 391, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:37:08,532 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.70 vs. limit=10.0 2023-06-19 15:37:15,205 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=370494.0, ans=0.125 2023-06-19 15:37:26,319 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=370554.0, ans=0.1 2023-06-19 15:37:27,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=370554.0, ans=0.2 2023-06-19 15:37:31,645 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.984e+02 2.860e+02 3.172e+02 4.099e+02 8.438e+02, threshold=6.343e+02, percent-clipped=1.0 2023-06-19 15:38:11,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=370674.0, ans=0.125 2023-06-19 15:38:37,261 INFO [train.py:996] (3/4) Epoch 3, batch 800, loss[loss=0.3233, simple_loss=0.3652, pruned_loss=0.1407, over 21584.00 frames. ], tot_loss[loss=0.2693, simple_loss=0.3365, pruned_loss=0.1011, over 4210940.84 frames. ], batch size: 471, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:38:50,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=370794.0, ans=0.0 2023-06-19 15:39:52,986 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.94 vs. limit=6.0 2023-06-19 15:40:36,938 INFO [train.py:996] (3/4) Epoch 3, batch 850, loss[loss=0.2653, simple_loss=0.319, pruned_loss=0.1058, over 21573.00 frames. ], tot_loss[loss=0.2679, simple_loss=0.3342, pruned_loss=0.1008, over 4225474.93 frames. ], batch size: 194, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:40:49,769 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.24 vs. limit=15.0 2023-06-19 15:41:35,041 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.029e+02 2.741e+02 3.074e+02 3.686e+02 5.946e+02, threshold=6.148e+02, percent-clipped=0.0 2023-06-19 15:41:35,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=371154.0, ans=0.125 2023-06-19 15:41:44,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=371214.0, ans=0.07 2023-06-19 15:42:11,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=371214.0, ans=0.1 2023-06-19 15:42:29,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=371274.0, ans=0.0 2023-06-19 15:42:37,788 INFO [train.py:996] (3/4) Epoch 3, batch 900, loss[loss=0.208, simple_loss=0.2828, pruned_loss=0.06663, over 21243.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.3302, pruned_loss=0.09912, over 4245003.75 frames. ], batch size: 159, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:42:53,419 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=371334.0, ans=0.125 2023-06-19 15:42:55,059 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=371334.0, ans=0.2 2023-06-19 15:43:00,891 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=371394.0, ans=0.0 2023-06-19 15:43:19,468 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.87 vs. limit=6.0 2023-06-19 15:43:23,709 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=371454.0, ans=0.125 2023-06-19 15:44:09,852 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=371514.0, ans=0.04949747468305833 2023-06-19 15:44:20,438 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=371574.0, ans=0.125 2023-06-19 15:44:40,530 INFO [train.py:996] (3/4) Epoch 3, batch 950, loss[loss=0.2459, simple_loss=0.3339, pruned_loss=0.07897, over 21746.00 frames. ], tot_loss[loss=0.2635, simple_loss=0.3296, pruned_loss=0.09869, over 4258229.86 frames. ], batch size: 332, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:44:58,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=371634.0, ans=0.125 2023-06-19 15:45:36,729 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.069e+02 2.723e+02 3.079e+02 3.859e+02 5.682e+02, threshold=6.158e+02, percent-clipped=0.0 2023-06-19 15:45:43,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=371754.0, ans=0.125 2023-06-19 15:45:48,958 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=371814.0, ans=0.125 2023-06-19 15:46:18,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=371814.0, ans=15.0 2023-06-19 15:46:46,774 INFO [train.py:996] (3/4) Epoch 3, batch 1000, loss[loss=0.2389, simple_loss=0.3091, pruned_loss=0.08429, over 21770.00 frames. ], tot_loss[loss=0.2634, simple_loss=0.3293, pruned_loss=0.09881, over 4272612.74 frames. ], batch size: 247, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:47:47,042 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.36 vs. limit=6.0 2023-06-19 15:47:47,913 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=372054.0, ans=0.1 2023-06-19 15:48:39,838 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=372174.0, ans=0.125 2023-06-19 15:49:10,913 INFO [train.py:996] (3/4) Epoch 3, batch 1050, loss[loss=0.2847, simple_loss=0.3343, pruned_loss=0.1175, over 21570.00 frames. ], tot_loss[loss=0.2633, simple_loss=0.329, pruned_loss=0.09882, over 4281410.39 frames. ], batch size: 471, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:49:18,055 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.99 vs. limit=15.0 2023-06-19 15:49:35,232 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=372294.0, ans=0.0 2023-06-19 15:49:41,787 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.91 vs. limit=15.0 2023-06-19 15:49:42,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=372294.0, ans=0.2 2023-06-19 15:49:49,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=372354.0, ans=0.125 2023-06-19 15:49:53,857 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.923e+02 2.550e+02 3.026e+02 4.001e+02 6.814e+02, threshold=6.053e+02, percent-clipped=1.0 2023-06-19 15:50:19,688 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=11.63 vs. limit=15.0 2023-06-19 15:51:06,129 INFO [train.py:996] (3/4) Epoch 3, batch 1100, loss[loss=0.2765, simple_loss=0.346, pruned_loss=0.1035, over 21428.00 frames. ], tot_loss[loss=0.2623, simple_loss=0.3283, pruned_loss=0.09816, over 4274128.85 frames. ], batch size: 194, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:51:52,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=372594.0, ans=0.1 2023-06-19 15:51:53,766 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 15:53:23,724 INFO [train.py:996] (3/4) Epoch 3, batch 1150, loss[loss=0.231, simple_loss=0.3001, pruned_loss=0.08093, over 21367.00 frames. ], tot_loss[loss=0.2626, simple_loss=0.3279, pruned_loss=0.09866, over 4273222.34 frames. ], batch size: 131, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:54:36,934 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.881e+02 2.547e+02 3.199e+02 3.726e+02 8.923e+02, threshold=6.397e+02, percent-clipped=6.0 2023-06-19 15:55:22,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=373074.0, ans=0.0 2023-06-19 15:55:35,107 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=373074.0, ans=0.1 2023-06-19 15:55:37,477 INFO [train.py:996] (3/4) Epoch 3, batch 1200, loss[loss=0.2325, simple_loss=0.3098, pruned_loss=0.07759, over 21272.00 frames. ], tot_loss[loss=0.2637, simple_loss=0.3293, pruned_loss=0.09899, over 4278424.10 frames. ], batch size: 176, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:56:43,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=373254.0, ans=0.025 2023-06-19 15:56:48,783 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.06 vs. limit=22.5 2023-06-19 15:56:50,071 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.87 vs. limit=15.0 2023-06-19 15:56:59,945 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.10 vs. limit=6.0 2023-06-19 15:57:45,298 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.16 vs. limit=12.0 2023-06-19 15:57:45,879 INFO [train.py:996] (3/4) Epoch 3, batch 1250, loss[loss=0.2597, simple_loss=0.3283, pruned_loss=0.09558, over 21848.00 frames. ], tot_loss[loss=0.2651, simple_loss=0.3314, pruned_loss=0.09937, over 4283719.14 frames. ], batch size: 107, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:58:01,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=373434.0, ans=0.125 2023-06-19 15:58:11,583 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.38 vs. limit=22.5 2023-06-19 15:58:33,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=373494.0, ans=0.125 2023-06-19 15:58:56,585 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.046e+02 2.686e+02 3.168e+02 3.917e+02 6.417e+02, threshold=6.337e+02, percent-clipped=1.0 2023-06-19 15:59:34,437 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=373674.0, ans=0.07 2023-06-19 15:59:44,116 INFO [train.py:996] (3/4) Epoch 3, batch 1300, loss[loss=0.2915, simple_loss=0.3535, pruned_loss=0.1147, over 21732.00 frames. ], tot_loss[loss=0.2672, simple_loss=0.334, pruned_loss=0.1003, over 4282886.69 frames. ], batch size: 332, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:59:59,122 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=373794.0, ans=0.125 2023-06-19 16:01:52,289 INFO [train.py:996] (3/4) Epoch 3, batch 1350, loss[loss=0.2436, simple_loss=0.3117, pruned_loss=0.08771, over 21438.00 frames. ], tot_loss[loss=0.2685, simple_loss=0.335, pruned_loss=0.101, over 4289270.80 frames. ], batch size: 194, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:02:01,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=374034.0, ans=0.125 2023-06-19 16:02:43,570 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=374154.0, ans=0.125 2023-06-19 16:02:47,154 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.55 vs. limit=15.0 2023-06-19 16:02:52,127 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.213e+02 2.921e+02 3.498e+02 4.356e+02 8.229e+02, threshold=6.996e+02, percent-clipped=3.0 2023-06-19 16:03:51,338 INFO [train.py:996] (3/4) Epoch 3, batch 1400, loss[loss=0.2479, simple_loss=0.3054, pruned_loss=0.09516, over 21753.00 frames. ], tot_loss[loss=0.2681, simple_loss=0.3348, pruned_loss=0.1007, over 4295080.50 frames. ], batch size: 124, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:04:24,463 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=374394.0, ans=0.1 2023-06-19 16:04:57,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=374454.0, ans=0.1 2023-06-19 16:05:26,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=374514.0, ans=0.125 2023-06-19 16:05:54,405 INFO [train.py:996] (3/4) Epoch 3, batch 1450, loss[loss=0.258, simple_loss=0.3062, pruned_loss=0.1049, over 21636.00 frames. ], tot_loss[loss=0.27, simple_loss=0.3357, pruned_loss=0.1021, over 4291973.70 frames. ], batch size: 415, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:06:33,309 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.08 vs. limit=22.5 2023-06-19 16:06:34,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=374694.0, ans=0.125 2023-06-19 16:06:54,413 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=374754.0, ans=0.0 2023-06-19 16:06:55,449 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.003e+02 2.676e+02 2.954e+02 3.782e+02 5.807e+02, threshold=5.909e+02, percent-clipped=0.0 2023-06-19 16:07:18,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=374814.0, ans=0.2 2023-06-19 16:08:00,893 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.81 vs. limit=6.0 2023-06-19 16:08:02,919 INFO [train.py:996] (3/4) Epoch 3, batch 1500, loss[loss=0.2906, simple_loss=0.3517, pruned_loss=0.1148, over 21738.00 frames. ], tot_loss[loss=0.2698, simple_loss=0.3345, pruned_loss=0.1026, over 4299771.52 frames. ], batch size: 112, lr: 1.20e-02, grad_scale: 16.0 2023-06-19 16:08:42,554 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.10 vs. limit=22.5 2023-06-19 16:09:00,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=375054.0, ans=0.125 2023-06-19 16:09:01,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=375054.0, ans=0.125 2023-06-19 16:10:02,050 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.62 vs. limit=15.0 2023-06-19 16:10:02,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=375174.0, ans=0.125 2023-06-19 16:10:03,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=375174.0, ans=0.2 2023-06-19 16:10:12,676 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.00 vs. limit=10.0 2023-06-19 16:10:13,228 INFO [train.py:996] (3/4) Epoch 3, batch 1550, loss[loss=0.2004, simple_loss=0.281, pruned_loss=0.05989, over 21137.00 frames. ], tot_loss[loss=0.2667, simple_loss=0.3318, pruned_loss=0.1008, over 4300898.85 frames. ], batch size: 143, lr: 1.20e-02, grad_scale: 16.0 2023-06-19 16:10:35,181 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 16:10:47,723 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=375294.0, ans=0.125 2023-06-19 16:11:10,283 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.913e+02 2.586e+02 2.974e+02 3.700e+02 5.786e+02, threshold=5.949e+02, percent-clipped=0.0 2023-06-19 16:11:32,007 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=375414.0, ans=0.0 2023-06-19 16:12:22,182 INFO [train.py:996] (3/4) Epoch 3, batch 1600, loss[loss=0.3211, simple_loss=0.3905, pruned_loss=0.1259, over 21635.00 frames. ], tot_loss[loss=0.2637, simple_loss=0.3291, pruned_loss=0.09917, over 4296868.15 frames. ], batch size: 414, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:12:34,485 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=375534.0, ans=0.125 2023-06-19 16:12:51,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=375594.0, ans=0.0 2023-06-19 16:13:04,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=375594.0, ans=0.2 2023-06-19 16:13:07,116 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=375654.0, ans=0.125 2023-06-19 16:13:28,992 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.60 vs. limit=15.0 2023-06-19 16:13:31,877 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.76 vs. limit=22.5 2023-06-19 16:14:36,775 INFO [train.py:996] (3/4) Epoch 3, batch 1650, loss[loss=0.2362, simple_loss=0.287, pruned_loss=0.09273, over 21472.00 frames. ], tot_loss[loss=0.2628, simple_loss=0.3281, pruned_loss=0.0987, over 4292279.80 frames. ], batch size: 212, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:15:44,521 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.135e+02 2.739e+02 3.147e+02 3.944e+02 6.533e+02, threshold=6.293e+02, percent-clipped=5.0 2023-06-19 16:15:54,976 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 16:16:31,400 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.30 vs. limit=12.0 2023-06-19 16:17:09,617 INFO [train.py:996] (3/4) Epoch 3, batch 1700, loss[loss=0.2827, simple_loss=0.3417, pruned_loss=0.1119, over 21593.00 frames. ], tot_loss[loss=0.2684, simple_loss=0.3345, pruned_loss=0.1011, over 4290361.24 frames. ], batch size: 263, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:18:00,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=376194.0, ans=0.09899494936611666 2023-06-19 16:18:13,376 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.55 vs. limit=6.0 2023-06-19 16:18:39,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=376314.0, ans=0.125 2023-06-19 16:19:35,964 INFO [train.py:996] (3/4) Epoch 3, batch 1750, loss[loss=0.2702, simple_loss=0.3216, pruned_loss=0.1094, over 20144.00 frames. ], tot_loss[loss=0.267, simple_loss=0.3342, pruned_loss=0.09986, over 4289506.81 frames. ], batch size: 704, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:19:52,974 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=376434.0, ans=0.1 2023-06-19 16:20:28,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=376554.0, ans=0.0 2023-06-19 16:20:31,781 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=376554.0, ans=0.2 2023-06-19 16:20:46,784 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.788e+02 2.780e+02 3.176e+02 3.729e+02 7.377e+02, threshold=6.353e+02, percent-clipped=1.0 2023-06-19 16:21:59,648 INFO [train.py:996] (3/4) Epoch 3, batch 1800, loss[loss=0.2626, simple_loss=0.3312, pruned_loss=0.09694, over 21047.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.3283, pruned_loss=0.0958, over 4277481.34 frames. ], batch size: 607, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:24:08,973 INFO [train.py:996] (3/4) Epoch 3, batch 1850, loss[loss=0.2732, simple_loss=0.3397, pruned_loss=0.1033, over 21425.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.3289, pruned_loss=0.09331, over 4276231.99 frames. ], batch size: 144, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:24:10,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=377034.0, ans=0.0 2023-06-19 16:24:34,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=377034.0, ans=0.0 2023-06-19 16:25:25,465 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=377154.0, ans=0.125 2023-06-19 16:25:26,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=377154.0, ans=0.125 2023-06-19 16:25:27,902 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.791e+02 2.637e+02 3.036e+02 3.803e+02 8.113e+02, threshold=6.071e+02, percent-clipped=1.0 2023-06-19 16:25:39,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=377214.0, ans=0.2 2023-06-19 16:25:40,950 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=377214.0, ans=0.0 2023-06-19 16:26:10,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=377274.0, ans=0.1 2023-06-19 16:26:13,060 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.51 vs. limit=22.5 2023-06-19 16:26:15,653 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=377274.0, ans=0.125 2023-06-19 16:26:15,691 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=377274.0, ans=0.0 2023-06-19 16:26:29,759 INFO [train.py:996] (3/4) Epoch 3, batch 1900, loss[loss=0.3271, simple_loss=0.356, pruned_loss=0.1491, over 21801.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.3288, pruned_loss=0.094, over 4271480.24 frames. ], batch size: 508, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:27:42,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=377514.0, ans=0.1 2023-06-19 16:28:36,276 INFO [train.py:996] (3/4) Epoch 3, batch 1950, loss[loss=0.21, simple_loss=0.2957, pruned_loss=0.06211, over 21611.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.3249, pruned_loss=0.09451, over 4278170.08 frames. ], batch size: 263, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:29:05,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=377634.0, ans=0.015 2023-06-19 16:29:48,371 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.43 vs. limit=22.5 2023-06-19 16:29:55,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=377754.0, ans=0.1 2023-06-19 16:29:56,251 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.026e+02 2.787e+02 3.264e+02 3.675e+02 5.955e+02, threshold=6.529e+02, percent-clipped=0.0 2023-06-19 16:30:00,266 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.08 vs. limit=6.0 2023-06-19 16:30:14,449 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=377814.0, ans=0.125 2023-06-19 16:30:21,578 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=377814.0, ans=0.2 2023-06-19 16:30:51,317 INFO [train.py:996] (3/4) Epoch 3, batch 2000, loss[loss=0.1832, simple_loss=0.2508, pruned_loss=0.05782, over 21224.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.3186, pruned_loss=0.09202, over 4265317.65 frames. ], batch size: 159, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:31:32,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=377994.0, ans=0.125 2023-06-19 16:32:58,797 INFO [train.py:996] (3/4) Epoch 3, batch 2050, loss[loss=0.29, simple_loss=0.3505, pruned_loss=0.1148, over 21716.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3214, pruned_loss=0.09317, over 4276204.34 frames. ], batch size: 389, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:33:50,913 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=378354.0, ans=0.025 2023-06-19 16:34:07,972 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.882e+02 2.758e+02 3.171e+02 3.876e+02 8.323e+02, threshold=6.343e+02, percent-clipped=2.0 2023-06-19 16:34:54,472 INFO [train.py:996] (3/4) Epoch 3, batch 2100, loss[loss=0.2982, simple_loss=0.36, pruned_loss=0.1182, over 21865.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3255, pruned_loss=0.09525, over 4282275.35 frames. ], batch size: 98, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:35:04,230 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=378534.0, ans=0.125 2023-06-19 16:35:05,431 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=378534.0, ans=0.1 2023-06-19 16:35:28,756 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.86 vs. limit=15.0 2023-06-19 16:36:24,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=378714.0, ans=0.1 2023-06-19 16:37:16,905 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.08 vs. limit=10.0 2023-06-19 16:37:17,378 INFO [train.py:996] (3/4) Epoch 3, batch 2150, loss[loss=0.2096, simple_loss=0.2961, pruned_loss=0.06155, over 20827.00 frames. ], tot_loss[loss=0.2573, simple_loss=0.3237, pruned_loss=0.09548, over 4274414.48 frames. ], batch size: 608, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:37:38,404 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=378894.0, ans=0.04949747468305833 2023-06-19 16:38:06,563 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.31 vs. limit=15.0 2023-06-19 16:38:11,442 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=378954.0, ans=0.125 2023-06-19 16:38:23,712 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.954e+02 2.947e+02 3.794e+02 4.834e+02 7.445e+02, threshold=7.587e+02, percent-clipped=4.0 2023-06-19 16:39:18,295 INFO [train.py:996] (3/4) Epoch 3, batch 2200, loss[loss=0.2472, simple_loss=0.3007, pruned_loss=0.09683, over 21446.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.328, pruned_loss=0.09584, over 4273595.85 frames. ], batch size: 177, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:39:23,853 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=379134.0, ans=0.125 2023-06-19 16:41:09,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=379374.0, ans=0.04949747468305833 2023-06-19 16:41:20,318 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=379374.0, ans=0.07 2023-06-19 16:41:38,422 INFO [train.py:996] (3/4) Epoch 3, batch 2250, loss[loss=0.2633, simple_loss=0.3153, pruned_loss=0.1057, over 21587.00 frames. ], tot_loss[loss=0.2566, simple_loss=0.3254, pruned_loss=0.09395, over 4266985.58 frames. ], batch size: 414, lr: 1.20e-02, grad_scale: 16.0 2023-06-19 16:42:27,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=379494.0, ans=0.0 2023-06-19 16:42:47,068 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.766e+02 2.563e+02 3.119e+02 3.980e+02 5.506e+02, threshold=6.238e+02, percent-clipped=0.0 2023-06-19 16:43:17,257 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=379614.0, ans=0.2 2023-06-19 16:43:18,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=379674.0, ans=0.0 2023-06-19 16:43:46,264 INFO [train.py:996] (3/4) Epoch 3, batch 2300, loss[loss=0.2913, simple_loss=0.3634, pruned_loss=0.1097, over 20687.00 frames. ], tot_loss[loss=0.255, simple_loss=0.322, pruned_loss=0.09398, over 4273188.56 frames. ], batch size: 607, lr: 1.20e-02, grad_scale: 16.0 2023-06-19 16:43:54,698 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.17 vs. limit=15.0 2023-06-19 16:45:02,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=379854.0, ans=0.125 2023-06-19 16:45:14,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=379914.0, ans=0.1 2023-06-19 16:45:31,293 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=379974.0, ans=0.1 2023-06-19 16:45:41,091 INFO [train.py:996] (3/4) Epoch 3, batch 2350, loss[loss=0.2786, simple_loss=0.335, pruned_loss=0.1111, over 21536.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3207, pruned_loss=0.09464, over 4262837.57 frames. ], batch size: 230, lr: 1.20e-02, grad_scale: 16.0 2023-06-19 16:46:17,647 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=380034.0, ans=10.0 2023-06-19 16:46:31,387 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.95 vs. limit=22.5 2023-06-19 16:46:32,302 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=380094.0, ans=0.125 2023-06-19 16:46:58,126 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.986e+02 2.669e+02 3.076e+02 3.679e+02 5.519e+02, threshold=6.152e+02, percent-clipped=0.0 2023-06-19 16:48:11,842 INFO [train.py:996] (3/4) Epoch 3, batch 2400, loss[loss=0.268, simple_loss=0.3366, pruned_loss=0.09971, over 21470.00 frames. ], tot_loss[loss=0.2605, simple_loss=0.3252, pruned_loss=0.09791, over 4267387.44 frames. ], batch size: 131, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 16:48:27,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=380334.0, ans=0.0 2023-06-19 16:48:37,118 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=380394.0, ans=0.1 2023-06-19 16:50:00,341 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.92 vs. limit=15.0 2023-06-19 16:50:33,695 INFO [train.py:996] (3/4) Epoch 3, batch 2450, loss[loss=0.2872, simple_loss=0.3557, pruned_loss=0.1093, over 21190.00 frames. ], tot_loss[loss=0.2654, simple_loss=0.3308, pruned_loss=0.1, over 4268288.91 frames. ], batch size: 143, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 16:51:25,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=380754.0, ans=0.125 2023-06-19 16:51:32,009 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.042e+02 2.697e+02 3.047e+02 3.544e+02 7.014e+02, threshold=6.094e+02, percent-clipped=1.0 2023-06-19 16:51:32,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=380754.0, ans=0.05 2023-06-19 16:51:34,402 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.80 vs. limit=22.5 2023-06-19 16:51:36,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=380814.0, ans=0.2 2023-06-19 16:52:22,741 INFO [train.py:996] (3/4) Epoch 3, batch 2500, loss[loss=0.2516, simple_loss=0.3409, pruned_loss=0.08119, over 21688.00 frames. ], tot_loss[loss=0.2636, simple_loss=0.3294, pruned_loss=0.09893, over 4266029.13 frames. ], batch size: 332, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 16:52:23,090 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=380934.0, ans=0.125 2023-06-19 16:53:40,619 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=381114.0, ans=0.125 2023-06-19 16:53:42,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=381114.0, ans=0.0 2023-06-19 16:54:19,207 INFO [train.py:996] (3/4) Epoch 3, batch 2550, loss[loss=0.2384, simple_loss=0.3505, pruned_loss=0.06316, over 19683.00 frames. ], tot_loss[loss=0.261, simple_loss=0.3275, pruned_loss=0.0973, over 4263505.35 frames. ], batch size: 702, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 16:55:19,252 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.81 vs. limit=6.0 2023-06-19 16:55:27,975 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.09 vs. limit=5.0 2023-06-19 16:55:33,968 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.850e+02 2.648e+02 3.147e+02 3.811e+02 6.835e+02, threshold=6.294e+02, percent-clipped=1.0 2023-06-19 16:56:09,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=381474.0, ans=0.125 2023-06-19 16:56:34,138 INFO [train.py:996] (3/4) Epoch 3, batch 2600, loss[loss=0.3039, simple_loss=0.3572, pruned_loss=0.1253, over 21580.00 frames. ], tot_loss[loss=0.2631, simple_loss=0.3288, pruned_loss=0.09867, over 4262898.31 frames. ], batch size: 415, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 16:57:23,039 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.08 vs. limit=15.0 2023-06-19 16:57:28,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=381654.0, ans=0.1 2023-06-19 16:57:48,137 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=381714.0, ans=0.0 2023-06-19 16:57:58,438 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=381714.0, ans=0.125 2023-06-19 16:58:59,105 INFO [train.py:996] (3/4) Epoch 3, batch 2650, loss[loss=0.3102, simple_loss=0.3711, pruned_loss=0.1246, over 21593.00 frames. ], tot_loss[loss=0.2654, simple_loss=0.3312, pruned_loss=0.09983, over 4267725.83 frames. ], batch size: 389, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 16:59:25,099 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.64 vs. limit=22.5 2023-06-19 17:00:02,531 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.157e+02 2.896e+02 3.459e+02 3.981e+02 6.985e+02, threshold=6.919e+02, percent-clipped=4.0 2023-06-19 17:01:11,823 INFO [train.py:996] (3/4) Epoch 3, batch 2700, loss[loss=0.2652, simple_loss=0.3325, pruned_loss=0.09898, over 21803.00 frames. ], tot_loss[loss=0.2614, simple_loss=0.3272, pruned_loss=0.09781, over 4263228.80 frames. ], batch size: 351, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:02:24,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=382314.0, ans=0.0 2023-06-19 17:02:31,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=382314.0, ans=0.0 2023-06-19 17:03:03,514 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=382374.0, ans=0.2 2023-06-19 17:03:20,416 INFO [train.py:996] (3/4) Epoch 3, batch 2750, loss[loss=0.2823, simple_loss=0.3431, pruned_loss=0.1108, over 21719.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.3263, pruned_loss=0.09839, over 4266806.70 frames. ], batch size: 389, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:03:56,430 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=382494.0, ans=0.125 2023-06-19 17:04:33,263 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.194e+02 2.812e+02 3.458e+02 3.888e+02 7.269e+02, threshold=6.916e+02, percent-clipped=2.0 2023-06-19 17:05:43,916 INFO [train.py:996] (3/4) Epoch 3, batch 2800, loss[loss=0.3071, simple_loss=0.376, pruned_loss=0.1191, over 21849.00 frames. ], tot_loss[loss=0.2657, simple_loss=0.3306, pruned_loss=0.1004, over 4272940.32 frames. ], batch size: 316, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:06:28,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=382794.0, ans=0.0 2023-06-19 17:06:45,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=382854.0, ans=0.125 2023-06-19 17:06:48,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=382914.0, ans=0.125 2023-06-19 17:06:49,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=382914.0, ans=0.2 2023-06-19 17:07:08,902 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=382914.0, ans=0.0 2023-06-19 17:07:10,355 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=382914.0, ans=0.125 2023-06-19 17:07:48,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=382974.0, ans=0.125 2023-06-19 17:07:51,275 INFO [train.py:996] (3/4) Epoch 3, batch 2850, loss[loss=0.2116, simple_loss=0.2844, pruned_loss=0.06935, over 21558.00 frames. ], tot_loss[loss=0.2662, simple_loss=0.3309, pruned_loss=0.1008, over 4276587.75 frames. ], batch size: 212, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:08:46,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=383154.0, ans=0.125 2023-06-19 17:08:55,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=383154.0, ans=0.125 2023-06-19 17:08:57,674 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.123e+02 2.952e+02 3.438e+02 4.041e+02 6.558e+02, threshold=6.876e+02, percent-clipped=0.0 2023-06-19 17:09:20,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=383214.0, ans=0.125 2023-06-19 17:09:45,460 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.66 vs. limit=15.0 2023-06-19 17:10:03,353 INFO [train.py:996] (3/4) Epoch 3, batch 2900, loss[loss=0.2764, simple_loss=0.3287, pruned_loss=0.112, over 21891.00 frames. ], tot_loss[loss=0.2652, simple_loss=0.3291, pruned_loss=0.1006, over 4274322.02 frames. ], batch size: 351, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:11:29,545 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=383454.0, ans=0.125 2023-06-19 17:12:16,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=383634.0, ans=0.1 2023-06-19 17:12:17,738 INFO [train.py:996] (3/4) Epoch 3, batch 2950, loss[loss=0.2573, simple_loss=0.3383, pruned_loss=0.08813, over 21799.00 frames. ], tot_loss[loss=0.2674, simple_loss=0.3315, pruned_loss=0.1016, over 4279095.31 frames. ], batch size: 247, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:12:18,417 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=383634.0, ans=0.2 2023-06-19 17:12:19,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=383634.0, ans=0.125 2023-06-19 17:13:27,105 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.917e+02 2.666e+02 3.329e+02 3.985e+02 6.298e+02, threshold=6.658e+02, percent-clipped=0.0 2023-06-19 17:13:28,038 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.92 vs. limit=22.5 2023-06-19 17:14:27,048 INFO [train.py:996] (3/4) Epoch 3, batch 3000, loss[loss=0.2925, simple_loss=0.3515, pruned_loss=0.1168, over 20650.00 frames. ], tot_loss[loss=0.2701, simple_loss=0.335, pruned_loss=0.1026, over 4281018.44 frames. ], batch size: 607, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:14:27,049 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-19 17:15:26,750 INFO [train.py:1028] (3/4) Epoch 3, validation: loss=0.2641, simple_loss=0.3582, pruned_loss=0.08497, over 1796401.00 frames. 2023-06-19 17:15:26,751 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-19 17:15:30,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=383934.0, ans=0.125 2023-06-19 17:17:32,336 INFO [train.py:996] (3/4) Epoch 3, batch 3050, loss[loss=0.2037, simple_loss=0.2742, pruned_loss=0.06663, over 21319.00 frames. ], tot_loss[loss=0.2679, simple_loss=0.3348, pruned_loss=0.1005, over 4280652.14 frames. ], batch size: 176, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:17:35,810 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=384234.0, ans=0.07 2023-06-19 17:17:46,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=384294.0, ans=0.125 2023-06-19 17:18:36,308 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.634e+02 2.612e+02 3.068e+02 3.884e+02 6.954e+02, threshold=6.136e+02, percent-clipped=1.0 2023-06-19 17:19:18,628 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=384474.0, ans=0.125 2023-06-19 17:19:24,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=384534.0, ans=0.125 2023-06-19 17:19:24,785 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=384534.0, ans=0.025 2023-06-19 17:19:25,762 INFO [train.py:996] (3/4) Epoch 3, batch 3100, loss[loss=0.2293, simple_loss=0.3109, pruned_loss=0.0738, over 21694.00 frames. ], tot_loss[loss=0.2654, simple_loss=0.3338, pruned_loss=0.09852, over 4290938.93 frames. ], batch size: 247, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:20:33,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=384654.0, ans=0.0 2023-06-19 17:21:13,052 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.61 vs. limit=15.0 2023-06-19 17:21:32,968 INFO [train.py:996] (3/4) Epoch 3, batch 3150, loss[loss=0.2268, simple_loss=0.3132, pruned_loss=0.07023, over 21590.00 frames. ], tot_loss[loss=0.2676, simple_loss=0.3355, pruned_loss=0.09981, over 4284757.75 frames. ], batch size: 263, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:21:43,226 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=384834.0, ans=0.1 2023-06-19 17:22:54,647 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.858e+02 2.666e+02 3.183e+02 4.146e+02 7.472e+02, threshold=6.366e+02, percent-clipped=3.0 2023-06-19 17:24:00,456 INFO [train.py:996] (3/4) Epoch 3, batch 3200, loss[loss=0.2407, simple_loss=0.3175, pruned_loss=0.08195, over 21673.00 frames. ], tot_loss[loss=0.2672, simple_loss=0.3354, pruned_loss=0.09949, over 4280905.16 frames. ], batch size: 298, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:24:34,300 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=385194.0, ans=0.1 2023-06-19 17:25:53,105 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=385374.0, ans=0.1 2023-06-19 17:25:56,117 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=385374.0, ans=0.125 2023-06-19 17:26:14,503 INFO [train.py:996] (3/4) Epoch 3, batch 3250, loss[loss=0.3015, simple_loss=0.3581, pruned_loss=0.1225, over 21466.00 frames. ], tot_loss[loss=0.2695, simple_loss=0.3359, pruned_loss=0.1016, over 4274040.66 frames. ], batch size: 211, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:26:51,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=385494.0, ans=0.125 2023-06-19 17:27:15,361 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 3.186e+02 3.700e+02 4.457e+02 5.967e+02, threshold=7.400e+02, percent-clipped=0.0 2023-06-19 17:28:25,314 INFO [train.py:996] (3/4) Epoch 3, batch 3300, loss[loss=0.263, simple_loss=0.3278, pruned_loss=0.09907, over 21490.00 frames. ], tot_loss[loss=0.2659, simple_loss=0.3305, pruned_loss=0.1007, over 4270372.43 frames. ], batch size: 389, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:28:52,121 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=385794.0, ans=0.0 2023-06-19 17:29:18,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=385794.0, ans=0.0 2023-06-19 17:29:33,645 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-06-19 17:29:50,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=385854.0, ans=0.0 2023-06-19 17:30:48,663 INFO [train.py:996] (3/4) Epoch 3, batch 3350, loss[loss=0.2581, simple_loss=0.3166, pruned_loss=0.09977, over 21829.00 frames. ], tot_loss[loss=0.2659, simple_loss=0.3312, pruned_loss=0.1003, over 4258806.75 frames. ], batch size: 282, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:31:12,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=386094.0, ans=0.0 2023-06-19 17:32:02,750 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.837e+02 2.950e+02 3.389e+02 4.289e+02 7.899e+02, threshold=6.778e+02, percent-clipped=1.0 2023-06-19 17:32:15,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=386214.0, ans=0.1 2023-06-19 17:33:05,259 INFO [train.py:996] (3/4) Epoch 3, batch 3400, loss[loss=0.2731, simple_loss=0.3551, pruned_loss=0.09553, over 21479.00 frames. ], tot_loss[loss=0.2673, simple_loss=0.3331, pruned_loss=0.1008, over 4265374.26 frames. ], batch size: 211, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:33:16,253 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=386334.0, ans=0.125 2023-06-19 17:33:34,032 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=386394.0, ans=0.125 2023-06-19 17:35:07,130 INFO [train.py:996] (3/4) Epoch 3, batch 3450, loss[loss=0.3406, simple_loss=0.3686, pruned_loss=0.1563, over 21381.00 frames. ], tot_loss[loss=0.2656, simple_loss=0.33, pruned_loss=0.1006, over 4258030.01 frames. ], batch size: 507, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:36:23,987 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.980e+02 2.906e+02 3.291e+02 4.087e+02 6.835e+02, threshold=6.581e+02, percent-clipped=1.0 2023-06-19 17:36:30,686 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.51 vs. limit=15.0 2023-06-19 17:37:03,549 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=386874.0, ans=0.2 2023-06-19 17:37:07,514 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.10 vs. limit=15.0 2023-06-19 17:37:08,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=386874.0, ans=0.0 2023-06-19 17:37:16,194 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.48 vs. limit=10.0 2023-06-19 17:37:16,557 INFO [train.py:996] (3/4) Epoch 3, batch 3500, loss[loss=0.2426, simple_loss=0.3392, pruned_loss=0.07297, over 19823.00 frames. ], tot_loss[loss=0.2728, simple_loss=0.3387, pruned_loss=0.1035, over 4252115.55 frames. ], batch size: 703, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 17:38:16,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=387054.0, ans=0.2 2023-06-19 17:38:23,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=387054.0, ans=0.5 2023-06-19 17:39:15,946 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.16 vs. limit=15.0 2023-06-19 17:39:34,542 INFO [train.py:996] (3/4) Epoch 3, batch 3550, loss[loss=0.2321, simple_loss=0.276, pruned_loss=0.09408, over 20221.00 frames. ], tot_loss[loss=0.2749, simple_loss=0.3404, pruned_loss=0.1047, over 4251322.13 frames. ], batch size: 703, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 17:40:10,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=387294.0, ans=0.2 2023-06-19 17:40:23,771 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=387294.0, ans=0.1 2023-06-19 17:40:26,661 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=387354.0, ans=0.125 2023-06-19 17:40:51,392 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.232e+02 2.929e+02 3.306e+02 4.196e+02 6.685e+02, threshold=6.611e+02, percent-clipped=1.0 2023-06-19 17:40:51,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=387354.0, ans=0.125 2023-06-19 17:41:52,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=387534.0, ans=0.2 2023-06-19 17:41:53,763 INFO [train.py:996] (3/4) Epoch 3, batch 3600, loss[loss=0.3035, simple_loss=0.3596, pruned_loss=0.1237, over 21602.00 frames. ], tot_loss[loss=0.2731, simple_loss=0.3374, pruned_loss=0.1043, over 4252506.25 frames. ], batch size: 263, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 17:43:47,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=387774.0, ans=0.125 2023-06-19 17:44:19,134 INFO [train.py:996] (3/4) Epoch 3, batch 3650, loss[loss=0.2364, simple_loss=0.3036, pruned_loss=0.08462, over 21466.00 frames. ], tot_loss[loss=0.2741, simple_loss=0.3398, pruned_loss=0.1042, over 4253817.19 frames. ], batch size: 194, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 17:45:31,523 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.989e+02 2.846e+02 3.301e+02 4.049e+02 6.625e+02, threshold=6.601e+02, percent-clipped=2.0 2023-06-19 17:45:35,058 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=388014.0, ans=0.0 2023-06-19 17:46:25,667 INFO [train.py:996] (3/4) Epoch 3, batch 3700, loss[loss=0.2504, simple_loss=0.323, pruned_loss=0.08893, over 21877.00 frames. ], tot_loss[loss=0.2732, simple_loss=0.3385, pruned_loss=0.104, over 4257139.79 frames. ], batch size: 371, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 17:48:56,821 INFO [train.py:996] (3/4) Epoch 3, batch 3750, loss[loss=0.1955, simple_loss=0.2354, pruned_loss=0.0778, over 17107.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.337, pruned_loss=0.1034, over 4254569.91 frames. ], batch size: 63, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 17:49:06,583 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.10 vs. limit=15.0 2023-06-19 17:50:07,714 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.966e+02 3.161e+02 3.639e+02 4.340e+02 7.555e+02, threshold=7.277e+02, percent-clipped=2.0 2023-06-19 17:50:23,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=388614.0, ans=0.125 2023-06-19 17:50:56,768 INFO [train.py:996] (3/4) Epoch 3, batch 3800, loss[loss=0.2792, simple_loss=0.3413, pruned_loss=0.1085, over 21254.00 frames. ], tot_loss[loss=0.2706, simple_loss=0.3358, pruned_loss=0.1027, over 4263087.26 frames. ], batch size: 159, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 17:51:44,307 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=388794.0, ans=0.2 2023-06-19 17:51:53,124 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=388794.0, ans=0.2 2023-06-19 17:51:54,547 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=388794.0, ans=0.0 2023-06-19 17:53:09,363 INFO [train.py:996] (3/4) Epoch 3, batch 3850, loss[loss=0.2174, simple_loss=0.2756, pruned_loss=0.07964, over 21668.00 frames. ], tot_loss[loss=0.2686, simple_loss=0.3315, pruned_loss=0.1028, over 4269274.51 frames. ], batch size: 282, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 17:53:13,283 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.00 vs. limit=22.5 2023-06-19 17:54:03,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=389154.0, ans=0.0 2023-06-19 17:54:12,932 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.159e+02 2.931e+02 3.555e+02 4.316e+02 7.141e+02, threshold=7.110e+02, percent-clipped=0.0 2023-06-19 17:54:34,254 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=389214.0, ans=0.0 2023-06-19 17:54:47,652 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.00 vs. limit=6.0 2023-06-19 17:55:21,648 INFO [train.py:996] (3/4) Epoch 3, batch 3900, loss[loss=0.2981, simple_loss=0.345, pruned_loss=0.1256, over 21746.00 frames. ], tot_loss[loss=0.2671, simple_loss=0.3284, pruned_loss=0.1029, over 4275606.38 frames. ], batch size: 441, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 17:56:43,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=389514.0, ans=0.0 2023-06-19 17:57:02,116 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=389574.0, ans=0.0 2023-06-19 17:57:28,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=389574.0, ans=0.07 2023-06-19 17:57:29,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=389574.0, ans=0.125 2023-06-19 17:57:32,063 INFO [train.py:996] (3/4) Epoch 3, batch 3950, loss[loss=0.184, simple_loss=0.2621, pruned_loss=0.05294, over 21761.00 frames. ], tot_loss[loss=0.2649, simple_loss=0.3274, pruned_loss=0.1012, over 4276281.61 frames. ], batch size: 282, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 17:58:05,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=389694.0, ans=0.1 2023-06-19 17:58:07,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=389694.0, ans=0.125 2023-06-19 17:58:36,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=389754.0, ans=0.0 2023-06-19 17:58:44,406 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.99 vs. limit=15.0 2023-06-19 17:58:47,714 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.753e+02 2.533e+02 2.923e+02 3.801e+02 7.027e+02, threshold=5.846e+02, percent-clipped=0.0 2023-06-19 17:58:51,068 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=389814.0, ans=0.1 2023-06-19 17:59:41,385 INFO [train.py:996] (3/4) Epoch 3, batch 4000, loss[loss=0.2442, simple_loss=0.2998, pruned_loss=0.09425, over 21829.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.3215, pruned_loss=0.09706, over 4271122.36 frames. ], batch size: 107, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 17:59:41,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=389934.0, ans=0.1 2023-06-19 17:59:46,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=389934.0, ans=0.125 2023-06-19 18:00:21,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=389994.0, ans=0.0 2023-06-19 18:00:35,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=390054.0, ans=0.1 2023-06-19 18:01:10,151 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.40 vs. limit=12.0 2023-06-19 18:01:23,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=390114.0, ans=0.125 2023-06-19 18:01:44,574 INFO [train.py:996] (3/4) Epoch 3, batch 4050, loss[loss=0.2502, simple_loss=0.3009, pruned_loss=0.0997, over 21474.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3208, pruned_loss=0.09451, over 4265788.24 frames. ], batch size: 441, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 18:01:57,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=390234.0, ans=0.125 2023-06-19 18:02:49,673 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.71 vs. limit=22.5 2023-06-19 18:02:50,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=390354.0, ans=0.125 2023-06-19 18:03:07,896 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.889e+02 2.511e+02 2.852e+02 3.605e+02 5.912e+02, threshold=5.705e+02, percent-clipped=1.0 2023-06-19 18:03:59,018 INFO [train.py:996] (3/4) Epoch 3, batch 4100, loss[loss=0.2268, simple_loss=0.3069, pruned_loss=0.07335, over 21627.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.325, pruned_loss=0.0957, over 4270089.06 frames. ], batch size: 263, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 18:04:40,115 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 18:05:33,009 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=390714.0, ans=0.2 2023-06-19 18:05:33,064 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=390714.0, ans=0.2 2023-06-19 18:05:55,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=390774.0, ans=0.125 2023-06-19 18:06:25,551 INFO [train.py:996] (3/4) Epoch 3, batch 4150, loss[loss=0.2649, simple_loss=0.3382, pruned_loss=0.09584, over 21580.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.3234, pruned_loss=0.09194, over 4277625.13 frames. ], batch size: 389, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 18:06:49,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=390834.0, ans=0.2 2023-06-19 18:06:55,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=390894.0, ans=0.1 2023-06-19 18:07:30,122 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.626e+02 2.414e+02 2.926e+02 3.479e+02 5.759e+02, threshold=5.851e+02, percent-clipped=1.0 2023-06-19 18:07:32,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=391014.0, ans=0.2 2023-06-19 18:08:35,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=391134.0, ans=0.125 2023-06-19 18:08:36,109 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=391134.0, ans=0.0 2023-06-19 18:08:37,046 INFO [train.py:996] (3/4) Epoch 3, batch 4200, loss[loss=0.2476, simple_loss=0.3245, pruned_loss=0.08537, over 21684.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3241, pruned_loss=0.09263, over 4276553.45 frames. ], batch size: 247, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 18:09:46,480 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=15.0 2023-06-19 18:10:18,413 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=391314.0, ans=0.0 2023-06-19 18:10:59,596 INFO [train.py:996] (3/4) Epoch 3, batch 4250, loss[loss=0.2533, simple_loss=0.3252, pruned_loss=0.09074, over 21520.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.332, pruned_loss=0.09575, over 4280939.80 frames. ], batch size: 441, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 18:12:15,435 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.957e+02 3.056e+02 3.919e+02 5.866e+02 1.121e+03, threshold=7.838e+02, percent-clipped=25.0 2023-06-19 18:12:22,415 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.75 vs. limit=22.5 2023-06-19 18:13:13,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=391674.0, ans=0.1 2023-06-19 18:13:17,352 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.13 vs. limit=10.0 2023-06-19 18:13:22,652 INFO [train.py:996] (3/4) Epoch 3, batch 4300, loss[loss=0.2536, simple_loss=0.3233, pruned_loss=0.09194, over 21216.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3367, pruned_loss=0.0977, over 4277117.85 frames. ], batch size: 548, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 18:13:27,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=391734.0, ans=0.0 2023-06-19 18:14:16,700 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=391854.0, ans=0.0 2023-06-19 18:15:12,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=391914.0, ans=0.1 2023-06-19 18:15:38,830 INFO [train.py:996] (3/4) Epoch 3, batch 4350, loss[loss=0.2466, simple_loss=0.3121, pruned_loss=0.09057, over 21773.00 frames. ], tot_loss[loss=0.263, simple_loss=0.3334, pruned_loss=0.09631, over 4272699.53 frames. ], batch size: 351, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 18:15:40,783 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 18:16:46,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=392154.0, ans=0.035 2023-06-19 18:16:50,575 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=392154.0, ans=0.0 2023-06-19 18:16:56,981 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.860e+02 2.727e+02 3.088e+02 4.174e+02 7.759e+02, threshold=6.176e+02, percent-clipped=0.0 2023-06-19 18:17:06,116 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=392214.0, ans=0.0 2023-06-19 18:17:11,479 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.89 vs. limit=12.0 2023-06-19 18:17:39,772 INFO [train.py:996] (3/4) Epoch 3, batch 4400, loss[loss=0.266, simple_loss=0.3505, pruned_loss=0.09076, over 21619.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.3307, pruned_loss=0.09638, over 4273976.22 frames. ], batch size: 389, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 18:18:01,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=392334.0, ans=0.125 2023-06-19 18:18:01,501 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=392334.0, ans=0.125 2023-06-19 18:18:25,373 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=392394.0, ans=0.125 2023-06-19 18:18:27,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=392394.0, ans=0.1 2023-06-19 18:18:38,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=392454.0, ans=0.125 2023-06-19 18:19:48,472 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=392574.0, ans=0.125 2023-06-19 18:20:02,749 INFO [train.py:996] (3/4) Epoch 3, batch 4450, loss[loss=0.2699, simple_loss=0.3443, pruned_loss=0.09778, over 21740.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3372, pruned_loss=0.0975, over 4276861.24 frames. ], batch size: 247, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 18:21:24,212 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.204e+02 2.794e+02 3.143e+02 3.836e+02 6.823e+02, threshold=6.286e+02, percent-clipped=3.0 2023-06-19 18:22:00,537 INFO [train.py:996] (3/4) Epoch 3, batch 4500, loss[loss=0.2959, simple_loss=0.3731, pruned_loss=0.1093, over 21748.00 frames. ], tot_loss[loss=0.2686, simple_loss=0.3383, pruned_loss=0.09945, over 4283503.40 frames. ], batch size: 389, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 18:23:12,067 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=393054.0, ans=0.125 2023-06-19 18:23:40,221 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.06 vs. limit=15.0 2023-06-19 18:23:56,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=393114.0, ans=0.1 2023-06-19 18:23:56,718 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.49 vs. limit=15.0 2023-06-19 18:24:31,336 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.23 vs. limit=6.0 2023-06-19 18:24:34,826 INFO [train.py:996] (3/4) Epoch 3, batch 4550, loss[loss=0.2793, simple_loss=0.3525, pruned_loss=0.1031, over 21572.00 frames. ], tot_loss[loss=0.2727, simple_loss=0.344, pruned_loss=0.1007, over 4280210.30 frames. ], batch size: 230, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 18:24:41,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=393234.0, ans=0.0 2023-06-19 18:24:42,865 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=393234.0, ans=0.125 2023-06-19 18:24:50,842 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.10 vs. limit=22.5 2023-06-19 18:25:13,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=393294.0, ans=0.1 2023-06-19 18:25:53,410 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.086e+02 2.924e+02 3.565e+02 4.370e+02 6.839e+02, threshold=7.130e+02, percent-clipped=5.0 2023-06-19 18:25:56,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=393414.0, ans=0.09899494936611666 2023-06-19 18:25:58,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=393414.0, ans=0.5 2023-06-19 18:26:03,220 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=393414.0, ans=15.0 2023-06-19 18:26:17,771 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=393414.0, ans=0.125 2023-06-19 18:26:57,125 INFO [train.py:996] (3/4) Epoch 3, batch 4600, loss[loss=0.2468, simple_loss=0.3153, pruned_loss=0.08908, over 21357.00 frames. ], tot_loss[loss=0.2751, simple_loss=0.3458, pruned_loss=0.1022, over 4276313.34 frames. ], batch size: 143, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 18:27:14,138 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=393534.0, ans=0.0 2023-06-19 18:27:49,570 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=393654.0, ans=0.125 2023-06-19 18:29:14,447 INFO [train.py:996] (3/4) Epoch 3, batch 4650, loss[loss=0.1804, simple_loss=0.2567, pruned_loss=0.05208, over 21754.00 frames. ], tot_loss[loss=0.2671, simple_loss=0.3371, pruned_loss=0.09858, over 4285178.73 frames. ], batch size: 298, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 18:29:17,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=393834.0, ans=0.125 2023-06-19 18:30:21,159 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.898e+02 2.410e+02 2.813e+02 3.483e+02 6.632e+02, threshold=5.627e+02, percent-clipped=0.0 2023-06-19 18:30:33,333 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=394014.0, ans=0.05 2023-06-19 18:31:17,313 INFO [train.py:996] (3/4) Epoch 3, batch 4700, loss[loss=0.2164, simple_loss=0.2772, pruned_loss=0.07782, over 21551.00 frames. ], tot_loss[loss=0.2607, simple_loss=0.3282, pruned_loss=0.09654, over 4277199.44 frames. ], batch size: 263, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 18:31:33,136 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.27 vs. limit=15.0 2023-06-19 18:32:04,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=394194.0, ans=0.125 2023-06-19 18:32:05,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=394194.0, ans=0.04949747468305833 2023-06-19 18:32:20,051 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=394254.0, ans=0.0 2023-06-19 18:33:07,541 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.38 vs. limit=15.0 2023-06-19 18:33:35,280 INFO [train.py:996] (3/4) Epoch 3, batch 4750, loss[loss=0.3142, simple_loss=0.354, pruned_loss=0.1372, over 21623.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.3234, pruned_loss=0.097, over 4280461.94 frames. ], batch size: 473, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 18:33:56,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=394434.0, ans=0.0 2023-06-19 18:34:40,261 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.088e+02 2.729e+02 3.140e+02 4.884e+02 6.960e+02, threshold=6.280e+02, percent-clipped=14.0 2023-06-19 18:34:42,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=394614.0, ans=0.125 2023-06-19 18:35:53,971 INFO [train.py:996] (3/4) Epoch 3, batch 4800, loss[loss=0.2464, simple_loss=0.3285, pruned_loss=0.08216, over 21406.00 frames. ], tot_loss[loss=0.2606, simple_loss=0.3253, pruned_loss=0.09793, over 4285133.14 frames. ], batch size: 176, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 18:36:10,636 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.45 vs. limit=22.5 2023-06-19 18:36:34,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=394854.0, ans=0.125 2023-06-19 18:37:56,650 INFO [train.py:996] (3/4) Epoch 3, batch 4850, loss[loss=0.3173, simple_loss=0.3649, pruned_loss=0.1349, over 21531.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.3248, pruned_loss=0.0975, over 4293050.76 frames. ], batch size: 471, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 18:38:21,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=395094.0, ans=0.125 2023-06-19 18:38:46,350 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=395154.0, ans=0.0 2023-06-19 18:39:00,889 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.128e+02 2.909e+02 3.496e+02 4.405e+02 5.702e+02, threshold=6.991e+02, percent-clipped=0.0 2023-06-19 18:40:00,488 INFO [train.py:996] (3/4) Epoch 3, batch 4900, loss[loss=0.2655, simple_loss=0.3546, pruned_loss=0.08817, over 21740.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3258, pruned_loss=0.09826, over 4294074.93 frames. ], batch size: 247, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 18:41:07,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=395454.0, ans=0.125 2023-06-19 18:42:07,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=395574.0, ans=0.0 2023-06-19 18:42:21,347 INFO [train.py:996] (3/4) Epoch 3, batch 4950, loss[loss=0.2798, simple_loss=0.3877, pruned_loss=0.08592, over 20744.00 frames. ], tot_loss[loss=0.2627, simple_loss=0.3306, pruned_loss=0.09747, over 4280428.40 frames. ], batch size: 608, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 18:42:45,275 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 18:42:51,220 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=395694.0, ans=0.2 2023-06-19 18:43:20,967 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.29 vs. limit=15.0 2023-06-19 18:43:36,182 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.860e+02 2.438e+02 2.884e+02 3.463e+02 6.412e+02, threshold=5.767e+02, percent-clipped=0.0 2023-06-19 18:43:50,078 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=395814.0, ans=0.125 2023-06-19 18:44:30,720 INFO [train.py:996] (3/4) Epoch 3, batch 5000, loss[loss=0.3162, simple_loss=0.3638, pruned_loss=0.1343, over 21633.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.329, pruned_loss=0.09391, over 4284753.33 frames. ], batch size: 471, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 18:44:57,259 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.49 vs. limit=22.5 2023-06-19 18:45:02,606 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=395994.0, ans=0.125 2023-06-19 18:45:18,114 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=396054.0, ans=0.125 2023-06-19 18:46:02,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=396114.0, ans=0.025 2023-06-19 18:46:29,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=396174.0, ans=0.1 2023-06-19 18:46:39,107 INFO [train.py:996] (3/4) Epoch 3, batch 5050, loss[loss=0.2793, simple_loss=0.3374, pruned_loss=0.1106, over 21849.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.3306, pruned_loss=0.09622, over 4288933.45 frames. ], batch size: 391, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 18:47:13,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=396294.0, ans=0.125 2023-06-19 18:47:46,312 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 2.997e+02 3.476e+02 4.224e+02 8.088e+02, threshold=6.952e+02, percent-clipped=5.0 2023-06-19 18:48:28,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=396474.0, ans=0.125 2023-06-19 18:48:32,285 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=396474.0, ans=0.125 2023-06-19 18:49:06,032 INFO [train.py:996] (3/4) Epoch 3, batch 5100, loss[loss=0.2359, simple_loss=0.2985, pruned_loss=0.08662, over 21930.00 frames. ], tot_loss[loss=0.261, simple_loss=0.3294, pruned_loss=0.09629, over 4284568.89 frames. ], batch size: 316, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 18:49:21,620 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.86 vs. limit=15.0 2023-06-19 18:49:22,514 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=396534.0, ans=10.0 2023-06-19 18:49:41,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=396594.0, ans=0.2 2023-06-19 18:49:49,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=396654.0, ans=0.2 2023-06-19 18:50:02,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=396714.0, ans=0.04949747468305833 2023-06-19 18:51:09,500 INFO [train.py:996] (3/4) Epoch 3, batch 5150, loss[loss=0.2463, simple_loss=0.3052, pruned_loss=0.09374, over 21367.00 frames. ], tot_loss[loss=0.2605, simple_loss=0.3276, pruned_loss=0.09669, over 4282958.35 frames. ], batch size: 176, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 18:51:09,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=396834.0, ans=0.09899494936611666 2023-06-19 18:51:21,631 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.76 vs. limit=15.0 2023-06-19 18:52:25,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=396954.0, ans=0.125 2023-06-19 18:52:27,684 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.267e+02 2.779e+02 3.148e+02 3.948e+02 7.558e+02, threshold=6.295e+02, percent-clipped=3.0 2023-06-19 18:52:38,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=397014.0, ans=0.04949747468305833 2023-06-19 18:52:38,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=397014.0, ans=0.1 2023-06-19 18:53:36,797 INFO [train.py:996] (3/4) Epoch 3, batch 5200, loss[loss=0.2475, simple_loss=0.3329, pruned_loss=0.08109, over 21403.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.326, pruned_loss=0.0959, over 4278262.30 frames. ], batch size: 211, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 18:53:37,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=397134.0, ans=0.0 2023-06-19 18:54:41,523 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=397254.0, ans=0.125 2023-06-19 18:55:43,380 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.35 vs. limit=22.5 2023-06-19 18:55:53,384 INFO [train.py:996] (3/4) Epoch 3, batch 5250, loss[loss=0.2966, simple_loss=0.4233, pruned_loss=0.08492, over 19640.00 frames. ], tot_loss[loss=0.2601, simple_loss=0.3307, pruned_loss=0.09475, over 4280825.13 frames. ], batch size: 702, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 18:56:07,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=397494.0, ans=0.2 2023-06-19 18:56:17,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=397494.0, ans=0.125 2023-06-19 18:56:21,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=397554.0, ans=0.1 2023-06-19 18:56:53,725 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.062e+02 2.637e+02 3.155e+02 4.003e+02 7.471e+02, threshold=6.309e+02, percent-clipped=2.0 2023-06-19 18:57:38,219 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=397674.0, ans=0.0 2023-06-19 18:57:49,399 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=397674.0, ans=0.125 2023-06-19 18:57:49,909 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.91 vs. limit=10.0 2023-06-19 18:57:56,747 INFO [train.py:996] (3/4) Epoch 3, batch 5300, loss[loss=0.2575, simple_loss=0.3118, pruned_loss=0.1016, over 21468.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.3311, pruned_loss=0.09614, over 4289282.19 frames. ], batch size: 194, lr: 1.17e-02, grad_scale: 16.0 2023-06-19 18:57:57,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=397734.0, ans=0.0 2023-06-19 18:58:09,719 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 18:58:25,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=397794.0, ans=0.2 2023-06-19 18:59:12,746 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=397914.0, ans=0.125 2023-06-19 18:59:45,628 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=397974.0, ans=0.2 2023-06-19 18:59:57,888 INFO [train.py:996] (3/4) Epoch 3, batch 5350, loss[loss=0.2583, simple_loss=0.3202, pruned_loss=0.09822, over 21990.00 frames. ], tot_loss[loss=0.2638, simple_loss=0.3316, pruned_loss=0.09796, over 4290328.46 frames. ], batch size: 113, lr: 1.17e-02, grad_scale: 16.0 2023-06-19 19:01:03,192 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=398154.0, ans=0.125 2023-06-19 19:01:15,594 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.192e+02 2.703e+02 3.245e+02 4.023e+02 6.387e+02, threshold=6.490e+02, percent-clipped=1.0 2023-06-19 19:01:23,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=398214.0, ans=0.125 2023-06-19 19:02:04,138 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=398274.0, ans=0.125 2023-06-19 19:02:24,845 INFO [train.py:996] (3/4) Epoch 3, batch 5400, loss[loss=0.2895, simple_loss=0.3449, pruned_loss=0.1171, over 20003.00 frames. ], tot_loss[loss=0.2641, simple_loss=0.3307, pruned_loss=0.09875, over 4294425.87 frames. ], batch size: 702, lr: 1.17e-02, grad_scale: 16.0 2023-06-19 19:03:17,104 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.00 vs. limit=10.0 2023-06-19 19:03:18,361 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.99 vs. limit=15.0 2023-06-19 19:04:00,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=398514.0, ans=0.125 2023-06-19 19:04:42,267 INFO [train.py:996] (3/4) Epoch 3, batch 5450, loss[loss=0.271, simple_loss=0.3735, pruned_loss=0.0842, over 21693.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.3306, pruned_loss=0.09645, over 4291072.74 frames. ], batch size: 247, lr: 1.17e-02, grad_scale: 16.0 2023-06-19 19:04:57,545 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=398694.0, ans=0.0 2023-06-19 19:05:14,334 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=398694.0, ans=0.1 2023-06-19 19:05:31,854 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=398754.0, ans=0.2 2023-06-19 19:06:03,150 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.605e+02 2.356e+02 2.919e+02 3.477e+02 6.016e+02, threshold=5.839e+02, percent-clipped=0.0 2023-06-19 19:06:19,070 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=398814.0, ans=0.0 2023-06-19 19:06:23,631 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.68 vs. limit=15.0 2023-06-19 19:06:54,314 INFO [train.py:996] (3/4) Epoch 3, batch 5500, loss[loss=0.3034, simple_loss=0.3859, pruned_loss=0.1104, over 21630.00 frames. ], tot_loss[loss=0.261, simple_loss=0.3349, pruned_loss=0.09349, over 4286146.25 frames. ], batch size: 441, lr: 1.17e-02, grad_scale: 16.0 2023-06-19 19:08:13,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=399054.0, ans=0.0 2023-06-19 19:08:26,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=399114.0, ans=0.125 2023-06-19 19:09:12,510 INFO [train.py:996] (3/4) Epoch 3, batch 5550, loss[loss=0.2262, simple_loss=0.3223, pruned_loss=0.06507, over 21574.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3337, pruned_loss=0.09072, over 4277397.82 frames. ], batch size: 441, lr: 1.17e-02, grad_scale: 16.0 2023-06-19 19:09:23,508 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 19:09:23,537 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=399234.0, ans=0.125 2023-06-19 19:09:26,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=399234.0, ans=0.0 2023-06-19 19:09:51,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=399294.0, ans=0.95 2023-06-19 19:10:34,195 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=399354.0, ans=10.0 2023-06-19 19:10:42,138 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.622e+02 2.351e+02 2.803e+02 3.299e+02 6.466e+02, threshold=5.606e+02, percent-clipped=1.0 2023-06-19 19:11:52,839 INFO [train.py:996] (3/4) Epoch 3, batch 5600, loss[loss=0.3398, simple_loss=0.4474, pruned_loss=0.1161, over 19789.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.3314, pruned_loss=0.08786, over 4274946.41 frames. ], batch size: 702, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 19:12:39,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=399594.0, ans=0.125 2023-06-19 19:13:06,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=399654.0, ans=0.0 2023-06-19 19:13:15,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=399714.0, ans=0.2 2023-06-19 19:14:09,699 INFO [train.py:996] (3/4) Epoch 3, batch 5650, loss[loss=0.2879, simple_loss=0.351, pruned_loss=0.1124, over 20164.00 frames. ], tot_loss[loss=0.259, simple_loss=0.3359, pruned_loss=0.09101, over 4283889.31 frames. ], batch size: 702, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 19:14:44,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=399894.0, ans=0.125 2023-06-19 19:14:45,543 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=399894.0, ans=0.0 2023-06-19 19:15:28,359 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 2.620e+02 2.997e+02 3.705e+02 7.555e+02, threshold=5.994e+02, percent-clipped=4.0 2023-06-19 19:16:35,377 INFO [train.py:996] (3/4) Epoch 3, batch 5700, loss[loss=0.2343, simple_loss=0.3063, pruned_loss=0.08119, over 21655.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3353, pruned_loss=0.09349, over 4291226.99 frames. ], batch size: 263, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 19:17:08,229 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-06-19 19:17:30,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=400254.0, ans=0.2 2023-06-19 19:17:32,930 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.63 vs. limit=12.0 2023-06-19 19:18:34,260 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=400374.0, ans=0.0 2023-06-19 19:18:47,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=400374.0, ans=0.125 2023-06-19 19:18:58,569 INFO [train.py:996] (3/4) Epoch 3, batch 5750, loss[loss=0.2164, simple_loss=0.3065, pruned_loss=0.06314, over 21747.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.3321, pruned_loss=0.09066, over 4288120.00 frames. ], batch size: 298, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:19:40,159 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.94 vs. limit=15.0 2023-06-19 19:20:11,182 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.787e+02 2.423e+02 2.924e+02 3.462e+02 7.613e+02, threshold=5.849e+02, percent-clipped=6.0 2023-06-19 19:20:56,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=400674.0, ans=0.0 2023-06-19 19:21:04,747 INFO [train.py:996] (3/4) Epoch 3, batch 5800, loss[loss=0.2539, simple_loss=0.3387, pruned_loss=0.08453, over 21659.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.3292, pruned_loss=0.08893, over 4283104.01 frames. ], batch size: 263, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:21:17,162 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.35 vs. limit=22.5 2023-06-19 19:21:57,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=400794.0, ans=0.1 2023-06-19 19:22:24,412 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=400854.0, ans=0.0 2023-06-19 19:22:54,023 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=400914.0, ans=0.125 2023-06-19 19:22:55,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=400914.0, ans=0.2 2023-06-19 19:23:13,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=400974.0, ans=0.0 2023-06-19 19:23:32,815 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=401034.0, ans=0.2 2023-06-19 19:23:33,787 INFO [train.py:996] (3/4) Epoch 3, batch 5850, loss[loss=0.201, simple_loss=0.3057, pruned_loss=0.04815, over 21591.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3263, pruned_loss=0.08391, over 4282807.54 frames. ], batch size: 263, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:24:25,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=401094.0, ans=0.0 2023-06-19 19:24:32,905 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.17 vs. limit=15.0 2023-06-19 19:24:44,432 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=401154.0, ans=0.125 2023-06-19 19:25:07,395 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.401e+02 1.976e+02 2.345e+02 2.902e+02 5.016e+02, threshold=4.690e+02, percent-clipped=0.0 2023-06-19 19:25:21,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=401274.0, ans=0.2 2023-06-19 19:25:47,463 INFO [train.py:996] (3/4) Epoch 3, batch 5900, loss[loss=0.2879, simple_loss=0.3885, pruned_loss=0.0937, over 20774.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3177, pruned_loss=0.07768, over 4272916.32 frames. ], batch size: 607, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:25:56,902 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.53 vs. limit=10.0 2023-06-19 19:26:25,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=401394.0, ans=0.125 2023-06-19 19:27:03,192 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=401454.0, ans=0.2 2023-06-19 19:27:04,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=401454.0, ans=0.125 2023-06-19 19:27:19,857 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.93 vs. limit=15.0 2023-06-19 19:27:22,785 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.89 vs. limit=15.0 2023-06-19 19:28:00,302 INFO [train.py:996] (3/4) Epoch 3, batch 5950, loss[loss=0.278, simple_loss=0.3248, pruned_loss=0.1156, over 21638.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3178, pruned_loss=0.08246, over 4278698.67 frames. ], batch size: 414, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:28:00,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=401634.0, ans=0.125 2023-06-19 19:29:02,019 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=22.5 2023-06-19 19:29:06,743 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 2.649e+02 3.303e+02 4.502e+02 8.568e+02, threshold=6.607e+02, percent-clipped=21.0 2023-06-19 19:29:33,192 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 19:29:50,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=401874.0, ans=0.2 2023-06-19 19:29:50,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=401874.0, ans=0.0 2023-06-19 19:29:54,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=401934.0, ans=0.09899494936611666 2023-06-19 19:30:02,814 INFO [train.py:996] (3/4) Epoch 3, batch 6000, loss[loss=0.2322, simple_loss=0.2862, pruned_loss=0.08911, over 15041.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3137, pruned_loss=0.08587, over 4271020.34 frames. ], batch size: 60, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:30:02,815 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-19 19:30:49,408 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([4.5116, 2.2888, 3.9179, 2.0332], device='cuda:3') 2023-06-19 19:30:50,353 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.2.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.5609, 3.2108, 3.0515, 3.1490], device='cuda:3') 2023-06-19 19:30:52,517 INFO [train.py:1028] (3/4) Epoch 3, validation: loss=0.2725, simple_loss=0.3668, pruned_loss=0.0891, over 1796401.00 frames. 2023-06-19 19:30:52,518 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-19 19:30:54,846 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=401934.0, ans=0.125 2023-06-19 19:30:54,857 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=401934.0, ans=0.125 2023-06-19 19:30:59,757 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.66 vs. limit=15.0 2023-06-19 19:31:29,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=401994.0, ans=0.04949747468305833 2023-06-19 19:31:52,755 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.58 vs. limit=22.5 2023-06-19 19:32:05,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=402114.0, ans=0.125 2023-06-19 19:32:07,354 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.52 vs. limit=6.0 2023-06-19 19:32:12,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=402174.0, ans=0.125 2023-06-19 19:32:45,297 INFO [train.py:996] (3/4) Epoch 3, batch 6050, loss[loss=0.2686, simple_loss=0.3208, pruned_loss=0.1082, over 21360.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3093, pruned_loss=0.08721, over 4272915.33 frames. ], batch size: 507, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:33:21,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=402294.0, ans=0.125 2023-06-19 19:33:28,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=402294.0, ans=0.125 2023-06-19 19:34:07,189 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.734e+02 2.314e+02 2.772e+02 3.160e+02 4.254e+02, threshold=5.544e+02, percent-clipped=0.0 2023-06-19 19:34:51,207 INFO [train.py:996] (3/4) Epoch 3, batch 6100, loss[loss=0.2281, simple_loss=0.2963, pruned_loss=0.07995, over 21706.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3077, pruned_loss=0.086, over 4278051.72 frames. ], batch size: 230, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:35:31,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=402594.0, ans=0.0 2023-06-19 19:35:40,917 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.49 vs. limit=12.0 2023-06-19 19:35:43,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=402654.0, ans=0.125 2023-06-19 19:37:00,838 INFO [train.py:996] (3/4) Epoch 3, batch 6150, loss[loss=0.2245, simple_loss=0.2972, pruned_loss=0.07592, over 21504.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3125, pruned_loss=0.08836, over 4277963.77 frames. ], batch size: 212, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:37:36,206 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.10 vs. limit=12.0 2023-06-19 19:37:58,667 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=402954.0, ans=0.1 2023-06-19 19:37:58,746 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=402954.0, ans=0.05 2023-06-19 19:38:08,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=402954.0, ans=0.125 2023-06-19 19:38:11,183 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.776e+02 2.614e+02 3.017e+02 3.568e+02 5.916e+02, threshold=6.034e+02, percent-clipped=1.0 2023-06-19 19:38:52,131 INFO [train.py:996] (3/4) Epoch 3, batch 6200, loss[loss=0.2667, simple_loss=0.3371, pruned_loss=0.09812, over 21441.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3173, pruned_loss=0.08915, over 4287197.06 frames. ], batch size: 548, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:38:58,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=403134.0, ans=0.125 2023-06-19 19:38:58,647 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.36 vs. limit=15.0 2023-06-19 19:39:38,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=403194.0, ans=0.0 2023-06-19 19:40:12,716 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=403254.0, ans=0.125 2023-06-19 19:41:13,512 INFO [train.py:996] (3/4) Epoch 3, batch 6250, loss[loss=0.2598, simple_loss=0.3503, pruned_loss=0.08462, over 21791.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3215, pruned_loss=0.08896, over 4279893.75 frames. ], batch size: 332, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:42:08,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=403554.0, ans=0.0 2023-06-19 19:42:24,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=403554.0, ans=0.2 2023-06-19 19:42:32,757 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.039e+02 2.931e+02 3.643e+02 4.697e+02 7.748e+02, threshold=7.286e+02, percent-clipped=9.0 2023-06-19 19:42:52,469 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=403614.0, ans=0.125 2023-06-19 19:43:33,700 INFO [train.py:996] (3/4) Epoch 3, batch 6300, loss[loss=0.2932, simple_loss=0.347, pruned_loss=0.1197, over 21878.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3241, pruned_loss=0.08792, over 4274343.28 frames. ], batch size: 414, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:44:03,291 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 19:44:26,627 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=403854.0, ans=0.0 2023-06-19 19:45:37,159 INFO [train.py:996] (3/4) Epoch 3, batch 6350, loss[loss=0.3172, simple_loss=0.3765, pruned_loss=0.129, over 21527.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.3283, pruned_loss=0.09236, over 4280768.48 frames. ], batch size: 131, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:45:40,931 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=404034.0, ans=0.2 2023-06-19 19:46:24,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=404154.0, ans=0.125 2023-06-19 19:46:50,244 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.035e+02 2.707e+02 3.149e+02 3.828e+02 5.758e+02, threshold=6.298e+02, percent-clipped=0.0 2023-06-19 19:47:44,507 INFO [train.py:996] (3/4) Epoch 3, batch 6400, loss[loss=0.2962, simple_loss=0.3602, pruned_loss=0.1161, over 21677.00 frames. ], tot_loss[loss=0.2643, simple_loss=0.3349, pruned_loss=0.09682, over 4282355.26 frames. ], batch size: 351, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:49:10,647 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=404514.0, ans=0.125 2023-06-19 19:49:24,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=404574.0, ans=0.125 2023-06-19 19:49:25,586 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.57 vs. limit=22.5 2023-06-19 19:49:39,741 INFO [train.py:996] (3/4) Epoch 3, batch 6450, loss[loss=0.2257, simple_loss=0.3043, pruned_loss=0.07352, over 21696.00 frames. ], tot_loss[loss=0.2634, simple_loss=0.3363, pruned_loss=0.0952, over 4284890.94 frames. ], batch size: 282, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:49:49,120 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=404634.0, ans=0.2 2023-06-19 19:50:00,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=404694.0, ans=0.04949747468305833 2023-06-19 19:50:42,064 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.852e+02 2.507e+02 2.854e+02 3.745e+02 6.123e+02, threshold=5.708e+02, percent-clipped=0.0 2023-06-19 19:50:42,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=404814.0, ans=0.0 2023-06-19 19:51:36,617 INFO [train.py:996] (3/4) Epoch 3, batch 6500, loss[loss=0.2288, simple_loss=0.3125, pruned_loss=0.07261, over 21739.00 frames. ], tot_loss[loss=0.2583, simple_loss=0.3291, pruned_loss=0.09377, over 4281639.65 frames. ], batch size: 282, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:51:52,426 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-06-19 19:52:40,296 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.78 vs. limit=15.0 2023-06-19 19:52:41,632 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.70 vs. limit=15.0 2023-06-19 19:53:27,023 INFO [train.py:996] (3/4) Epoch 3, batch 6550, loss[loss=0.2893, simple_loss=0.3402, pruned_loss=0.1192, over 21649.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.328, pruned_loss=0.09322, over 4278696.44 frames. ], batch size: 441, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:54:08,825 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=405354.0, ans=0.1 2023-06-19 19:54:42,035 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.989e+02 2.670e+02 3.153e+02 4.224e+02 7.538e+02, threshold=6.306e+02, percent-clipped=8.0 2023-06-19 19:54:54,497 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.83 vs. limit=10.0 2023-06-19 19:55:22,193 INFO [train.py:996] (3/4) Epoch 3, batch 6600, loss[loss=0.2222, simple_loss=0.2843, pruned_loss=0.08005, over 21661.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3215, pruned_loss=0.09257, over 4270214.68 frames. ], batch size: 282, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:56:34,498 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.95 vs. limit=10.0 2023-06-19 19:56:35,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=405654.0, ans=0.125 2023-06-19 19:56:35,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=405654.0, ans=0.125 2023-06-19 19:57:01,537 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=405774.0, ans=0.07 2023-06-19 19:57:13,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=405774.0, ans=0.0 2023-06-19 19:57:18,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=405774.0, ans=0.0 2023-06-19 19:57:21,383 INFO [train.py:996] (3/4) Epoch 3, batch 6650, loss[loss=0.2252, simple_loss=0.2879, pruned_loss=0.08123, over 21695.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3131, pruned_loss=0.09011, over 4272611.45 frames. ], batch size: 282, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:58:17,413 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.714e+02 2.325e+02 2.612e+02 3.036e+02 4.161e+02, threshold=5.224e+02, percent-clipped=0.0 2023-06-19 19:58:57,727 INFO [train.py:996] (3/4) Epoch 3, batch 6700, loss[loss=0.2077, simple_loss=0.2688, pruned_loss=0.07329, over 21796.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3078, pruned_loss=0.08997, over 4281298.14 frames. ], batch size: 102, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:59:15,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=406194.0, ans=0.0 2023-06-19 20:00:44,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=406374.0, ans=0.0 2023-06-19 20:00:49,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=406374.0, ans=0.125 2023-06-19 20:00:51,750 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=406374.0, ans=0.125 2023-06-19 20:00:58,757 INFO [train.py:996] (3/4) Epoch 3, batch 6750, loss[loss=0.2567, simple_loss=0.3215, pruned_loss=0.09593, over 21424.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3061, pruned_loss=0.08996, over 4277846.61 frames. ], batch size: 194, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 20:01:12,370 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=406494.0, ans=0.0 2023-06-19 20:01:18,316 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=406494.0, ans=0.125 2023-06-19 20:01:19,669 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=406494.0, ans=0.07 2023-06-19 20:01:20,217 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.36 vs. limit=22.5 2023-06-19 20:02:00,798 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.964e+02 2.669e+02 3.031e+02 3.481e+02 8.147e+02, threshold=6.062e+02, percent-clipped=3.0 2023-06-19 20:02:34,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=406674.0, ans=0.125 2023-06-19 20:02:34,510 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=406674.0, ans=0.125 2023-06-19 20:02:40,711 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.47 vs. limit=15.0 2023-06-19 20:02:41,227 INFO [train.py:996] (3/4) Epoch 3, batch 6800, loss[loss=0.2529, simple_loss=0.3142, pruned_loss=0.09578, over 21612.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3092, pruned_loss=0.09307, over 4286175.19 frames. ], batch size: 389, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 20:02:47,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=406734.0, ans=0.125 2023-06-19 20:04:35,920 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.65 vs. limit=22.5 2023-06-19 20:04:39,330 INFO [train.py:996] (3/4) Epoch 3, batch 6850, loss[loss=0.2425, simple_loss=0.3044, pruned_loss=0.09033, over 21847.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3082, pruned_loss=0.09428, over 4292717.83 frames. ], batch size: 98, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 20:04:46,701 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.88 vs. limit=22.5 2023-06-19 20:04:49,124 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=407034.0, ans=0.1 2023-06-19 20:05:33,629 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.38 vs. limit=22.5 2023-06-19 20:05:53,822 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.970e+02 2.761e+02 3.161e+02 3.726e+02 8.116e+02, threshold=6.323e+02, percent-clipped=2.0 2023-06-19 20:05:57,420 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=407214.0, ans=0.125 2023-06-19 20:06:11,181 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 20:06:36,346 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=407274.0, ans=0.125 2023-06-19 20:06:47,900 INFO [train.py:996] (3/4) Epoch 3, batch 6900, loss[loss=0.2283, simple_loss=0.323, pruned_loss=0.06683, over 21841.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3112, pruned_loss=0.09385, over 4284795.29 frames. ], batch size: 371, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 20:07:29,014 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=407394.0, ans=0.125 2023-06-19 20:08:32,558 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=407574.0, ans=0.2 2023-06-19 20:08:35,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=407574.0, ans=15.0 2023-06-19 20:08:57,084 INFO [train.py:996] (3/4) Epoch 3, batch 6950, loss[loss=0.2735, simple_loss=0.353, pruned_loss=0.09701, over 21446.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3123, pruned_loss=0.09019, over 4285471.38 frames. ], batch size: 131, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:08:59,680 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.88 vs. limit=10.0 2023-06-19 20:09:45,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=407694.0, ans=0.1 2023-06-19 20:10:12,700 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.685e+02 2.540e+02 2.981e+02 3.670e+02 6.199e+02, threshold=5.963e+02, percent-clipped=0.0 2023-06-19 20:10:27,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=407814.0, ans=0.125 2023-06-19 20:10:33,459 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 20:10:58,579 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.68 vs. limit=15.0 2023-06-19 20:10:59,036 INFO [train.py:996] (3/4) Epoch 3, batch 7000, loss[loss=0.2324, simple_loss=0.2906, pruned_loss=0.08706, over 21329.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3165, pruned_loss=0.09387, over 4286489.42 frames. ], batch size: 194, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:11:16,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=407934.0, ans=0.125 2023-06-19 20:11:38,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=407994.0, ans=0.125 2023-06-19 20:12:02,124 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=408054.0, ans=0.0 2023-06-19 20:12:09,387 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=408054.0, ans=0.1 2023-06-19 20:12:13,008 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.82 vs. limit=15.0 2023-06-19 20:12:25,842 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 20:12:40,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=408174.0, ans=0.1 2023-06-19 20:12:47,634 INFO [train.py:996] (3/4) Epoch 3, batch 7050, loss[loss=0.2536, simple_loss=0.3168, pruned_loss=0.09517, over 21478.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3154, pruned_loss=0.09324, over 4284583.35 frames. ], batch size: 194, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:12:49,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=408234.0, ans=0.05 2023-06-19 20:13:15,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=408234.0, ans=0.125 2023-06-19 20:14:11,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=408354.0, ans=0.125 2023-06-19 20:14:15,669 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.899e+02 2.619e+02 2.979e+02 3.619e+02 9.670e+02, threshold=5.957e+02, percent-clipped=3.0 2023-06-19 20:14:16,373 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=408414.0, ans=0.025 2023-06-19 20:14:16,843 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.29 vs. limit=15.0 2023-06-19 20:14:24,175 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.92 vs. limit=15.0 2023-06-19 20:14:26,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=408414.0, ans=0.125 2023-06-19 20:14:52,229 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=408474.0, ans=0.125 2023-06-19 20:14:59,111 INFO [train.py:996] (3/4) Epoch 3, batch 7100, loss[loss=0.3052, simple_loss=0.3605, pruned_loss=0.125, over 21426.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3187, pruned_loss=0.09465, over 4286206.93 frames. ], batch size: 507, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:15:08,404 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.89 vs. limit=15.0 2023-06-19 20:16:49,490 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=408774.0, ans=0.125 2023-06-19 20:16:59,512 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.59 vs. limit=22.5 2023-06-19 20:17:00,975 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.28 vs. limit=12.0 2023-06-19 20:17:16,297 INFO [train.py:996] (3/4) Epoch 3, batch 7150, loss[loss=0.3613, simple_loss=0.3986, pruned_loss=0.162, over 21455.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3151, pruned_loss=0.091, over 4278972.09 frames. ], batch size: 510, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:17:31,422 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=408834.0, ans=0.125 2023-06-19 20:17:43,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=408894.0, ans=0.07 2023-06-19 20:18:32,439 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.824e+02 2.361e+02 2.956e+02 3.378e+02 5.911e+02, threshold=5.912e+02, percent-clipped=0.0 2023-06-19 20:18:50,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=409074.0, ans=0.2 2023-06-19 20:19:18,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=409074.0, ans=0.1 2023-06-19 20:19:21,388 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.57 vs. limit=15.0 2023-06-19 20:19:23,242 INFO [train.py:996] (3/4) Epoch 3, batch 7200, loss[loss=0.2206, simple_loss=0.2795, pruned_loss=0.08083, over 21377.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.3188, pruned_loss=0.09424, over 4283242.28 frames. ], batch size: 211, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:19:29,524 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=409134.0, ans=0.0 2023-06-19 20:19:42,691 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=409194.0, ans=0.0 2023-06-19 20:20:05,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=409254.0, ans=0.5 2023-06-19 20:20:05,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=409254.0, ans=0.125 2023-06-19 20:20:15,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=409254.0, ans=0.04949747468305833 2023-06-19 20:20:15,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=409254.0, ans=0.125 2023-06-19 20:20:26,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=409314.0, ans=0.0 2023-06-19 20:20:54,769 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=409374.0, ans=0.125 2023-06-19 20:21:31,195 INFO [train.py:996] (3/4) Epoch 3, batch 7250, loss[loss=0.2317, simple_loss=0.2872, pruned_loss=0.08815, over 21768.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.3159, pruned_loss=0.09422, over 4270085.47 frames. ], batch size: 352, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:22:03,234 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=409494.0, ans=0.0 2023-06-19 20:22:43,331 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.149e+02 2.618e+02 2.951e+02 3.716e+02 8.405e+02, threshold=5.903e+02, percent-clipped=1.0 2023-06-19 20:23:12,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=409674.0, ans=0.125 2023-06-19 20:23:13,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=409674.0, ans=0.0 2023-06-19 20:23:16,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=409674.0, ans=0.025 2023-06-19 20:23:19,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=409674.0, ans=0.125 2023-06-19 20:23:23,304 INFO [train.py:996] (3/4) Epoch 3, batch 7300, loss[loss=0.2145, simple_loss=0.2703, pruned_loss=0.07939, over 21207.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3088, pruned_loss=0.0928, over 4267830.64 frames. ], batch size: 159, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:23:24,342 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.99 vs. limit=15.0 2023-06-19 20:23:40,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=409734.0, ans=0.125 2023-06-19 20:24:35,272 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.59 vs. limit=22.5 2023-06-19 20:24:54,229 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 20:25:26,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=410034.0, ans=0.0 2023-06-19 20:25:27,393 INFO [train.py:996] (3/4) Epoch 3, batch 7350, loss[loss=0.3225, simple_loss=0.3865, pruned_loss=0.1293, over 21802.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3054, pruned_loss=0.09248, over 4271149.92 frames. ], batch size: 124, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:25:32,236 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=410034.0, ans=0.1 2023-06-19 20:26:11,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=410154.0, ans=0.04949747468305833 2023-06-19 20:26:46,544 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.936e+02 2.697e+02 3.166e+02 3.579e+02 5.616e+02, threshold=6.332e+02, percent-clipped=0.0 2023-06-19 20:27:34,135 INFO [train.py:996] (3/4) Epoch 3, batch 7400, loss[loss=0.2412, simple_loss=0.3183, pruned_loss=0.08208, over 21688.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3116, pruned_loss=0.09498, over 4268850.57 frames. ], batch size: 247, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:27:42,770 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.43 vs. limit=15.0 2023-06-19 20:28:45,870 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=9.28 vs. limit=15.0 2023-06-19 20:29:18,839 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=410574.0, ans=0.125 2023-06-19 20:29:32,618 INFO [train.py:996] (3/4) Epoch 3, batch 7450, loss[loss=0.2531, simple_loss=0.3072, pruned_loss=0.09951, over 21589.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3116, pruned_loss=0.09461, over 4260455.88 frames. ], batch size: 247, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:29:32,917 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=410634.0, ans=0.2 2023-06-19 20:29:41,434 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=410634.0, ans=0.125 2023-06-19 20:29:44,357 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=410634.0, ans=0.0 2023-06-19 20:29:54,785 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=410634.0, ans=0.0 2023-06-19 20:30:55,143 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.038e+02 2.751e+02 3.404e+02 4.260e+02 8.554e+02, threshold=6.809e+02, percent-clipped=4.0 2023-06-19 20:31:51,009 INFO [train.py:996] (3/4) Epoch 3, batch 7500, loss[loss=0.2915, simple_loss=0.3851, pruned_loss=0.09892, over 21786.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3173, pruned_loss=0.0957, over 4270349.86 frames. ], batch size: 282, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:32:13,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=410994.0, ans=0.1 2023-06-19 20:33:42,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=411174.0, ans=0.125 2023-06-19 20:33:53,284 INFO [train.py:996] (3/4) Epoch 3, batch 7550, loss[loss=0.3293, simple_loss=0.3987, pruned_loss=0.1299, over 21469.00 frames. ], tot_loss[loss=0.2571, simple_loss=0.325, pruned_loss=0.09457, over 4269244.62 frames. ], batch size: 507, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:33:56,906 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=411234.0, ans=0.125 2023-06-19 20:34:28,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=411354.0, ans=0.125 2023-06-19 20:34:54,085 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.947e+02 2.594e+02 3.226e+02 4.168e+02 6.750e+02, threshold=6.453e+02, percent-clipped=0.0 2023-06-19 20:34:54,566 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=411414.0, ans=0.5 2023-06-19 20:35:39,585 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=411474.0, ans=0.125 2023-06-19 20:35:53,977 INFO [train.py:996] (3/4) Epoch 3, batch 7600, loss[loss=0.2461, simple_loss=0.3106, pruned_loss=0.09083, over 21309.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3243, pruned_loss=0.09352, over 4271495.73 frames. ], batch size: 159, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:36:02,726 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.87 vs. limit=22.5 2023-06-19 20:36:37,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=411654.0, ans=0.125 2023-06-19 20:36:39,908 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=411654.0, ans=0.125 2023-06-19 20:36:58,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=411714.0, ans=15.0 2023-06-19 20:37:05,982 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.99 vs. limit=12.0 2023-06-19 20:37:39,156 INFO [train.py:996] (3/4) Epoch 3, batch 7650, loss[loss=0.261, simple_loss=0.3242, pruned_loss=0.09889, over 21844.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.3236, pruned_loss=0.09505, over 4279111.68 frames. ], batch size: 124, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:37:52,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=411834.0, ans=0.2 2023-06-19 20:38:48,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=411954.0, ans=0.125 2023-06-19 20:38:54,168 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.958e+02 2.672e+02 3.022e+02 3.544e+02 5.089e+02, threshold=6.045e+02, percent-clipped=0.0 2023-06-19 20:39:48,295 INFO [train.py:996] (3/4) Epoch 3, batch 7700, loss[loss=0.3085, simple_loss=0.366, pruned_loss=0.1255, over 21868.00 frames. ], tot_loss[loss=0.2632, simple_loss=0.3279, pruned_loss=0.09919, over 4285016.52 frames. ], batch size: 371, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:40:20,242 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.24 vs. limit=15.0 2023-06-19 20:40:45,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=412194.0, ans=0.1 2023-06-19 20:40:54,526 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.02 vs. limit=15.0 2023-06-19 20:41:47,120 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.21 vs. limit=15.0 2023-06-19 20:42:14,289 INFO [train.py:996] (3/4) Epoch 3, batch 7750, loss[loss=0.297, simple_loss=0.3587, pruned_loss=0.1177, over 21421.00 frames. ], tot_loss[loss=0.2668, simple_loss=0.3345, pruned_loss=0.09959, over 4280234.27 frames. ], batch size: 548, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:42:30,083 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.27 vs. limit=12.0 2023-06-19 20:43:14,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=412554.0, ans=0.125 2023-06-19 20:43:25,544 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.119e+02 3.175e+02 3.701e+02 4.532e+02 8.848e+02, threshold=7.402e+02, percent-clipped=6.0 2023-06-19 20:43:58,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=412674.0, ans=0.0 2023-06-19 20:44:17,432 INFO [train.py:996] (3/4) Epoch 3, batch 7800, loss[loss=0.2857, simple_loss=0.3509, pruned_loss=0.1102, over 21833.00 frames. ], tot_loss[loss=0.2657, simple_loss=0.3335, pruned_loss=0.09898, over 4276871.17 frames. ], batch size: 372, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:44:17,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=412734.0, ans=0.02 2023-06-19 20:44:54,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=412854.0, ans=0.0 2023-06-19 20:45:12,859 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=412854.0, ans=0.0 2023-06-19 20:45:12,901 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=412854.0, ans=0.125 2023-06-19 20:45:22,990 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=412914.0, ans=0.125 2023-06-19 20:45:35,292 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=412974.0, ans=0.125 2023-06-19 20:45:38,365 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=412974.0, ans=0.125 2023-06-19 20:45:40,457 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.32 vs. limit=15.0 2023-06-19 20:45:46,092 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.75 vs. limit=10.0 2023-06-19 20:46:01,773 INFO [train.py:996] (3/4) Epoch 3, batch 7850, loss[loss=0.2956, simple_loss=0.3286, pruned_loss=0.1312, over 21341.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.3269, pruned_loss=0.09809, over 4270849.66 frames. ], batch size: 473, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:46:03,692 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 20:46:13,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=413034.0, ans=0.07 2023-06-19 20:46:42,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=413154.0, ans=0.0 2023-06-19 20:47:05,855 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.951e+02 2.925e+02 3.666e+02 4.492e+02 8.258e+02, threshold=7.332e+02, percent-clipped=3.0 2023-06-19 20:47:21,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=413274.0, ans=0.0 2023-06-19 20:47:58,507 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.29 vs. limit=6.0 2023-06-19 20:48:04,381 INFO [train.py:996] (3/4) Epoch 3, batch 7900, loss[loss=0.2699, simple_loss=0.3567, pruned_loss=0.09154, over 21785.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.3227, pruned_loss=0.09684, over 4264284.34 frames. ], batch size: 371, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:48:39,573 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=22.5 2023-06-19 20:49:35,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=413514.0, ans=0.125 2023-06-19 20:50:03,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=413574.0, ans=0.0 2023-06-19 20:50:12,079 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.07 vs. limit=12.0 2023-06-19 20:50:12,452 INFO [train.py:996] (3/4) Epoch 3, batch 7950, loss[loss=0.2875, simple_loss=0.3956, pruned_loss=0.08968, over 19793.00 frames. ], tot_loss[loss=0.2606, simple_loss=0.3285, pruned_loss=0.09634, over 4265353.72 frames. ], batch size: 702, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 20:50:17,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=413634.0, ans=0.0 2023-06-19 20:50:27,007 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=413634.0, ans=0.0 2023-06-19 20:51:09,784 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=413754.0, ans=0.125 2023-06-19 20:51:18,355 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.813e+02 2.993e+02 3.629e+02 4.291e+02 8.050e+02, threshold=7.259e+02, percent-clipped=3.0 2023-06-19 20:51:47,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=413814.0, ans=0.125 2023-06-19 20:51:55,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=413874.0, ans=0.0 2023-06-19 20:52:18,832 INFO [train.py:996] (3/4) Epoch 3, batch 8000, loss[loss=0.2894, simple_loss=0.4086, pruned_loss=0.08508, over 20798.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3334, pruned_loss=0.09942, over 4270554.91 frames. ], batch size: 607, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:53:14,249 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=414054.0, ans=0.0 2023-06-19 20:53:19,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=414054.0, ans=0.09899494936611666 2023-06-19 20:53:28,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=414114.0, ans=0.125 2023-06-19 20:53:53,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=414174.0, ans=0.125 2023-06-19 20:54:09,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=414174.0, ans=0.07 2023-06-19 20:54:25,795 INFO [train.py:996] (3/4) Epoch 3, batch 8050, loss[loss=0.2457, simple_loss=0.3245, pruned_loss=0.08344, over 21817.00 frames. ], tot_loss[loss=0.2674, simple_loss=0.3362, pruned_loss=0.0993, over 4268140.07 frames. ], batch size: 282, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:55:04,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=414294.0, ans=0.0 2023-06-19 20:55:10,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=414354.0, ans=0.0 2023-06-19 20:55:40,679 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.095e+02 2.904e+02 3.441e+02 4.463e+02 1.081e+03, threshold=6.883e+02, percent-clipped=4.0 2023-06-19 20:55:48,538 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=414414.0, ans=0.0 2023-06-19 20:56:10,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=414474.0, ans=0.1 2023-06-19 20:56:13,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=414474.0, ans=0.125 2023-06-19 20:56:18,524 INFO [train.py:996] (3/4) Epoch 3, batch 8100, loss[loss=0.2839, simple_loss=0.3411, pruned_loss=0.1134, over 20882.00 frames. ], tot_loss[loss=0.2648, simple_loss=0.3322, pruned_loss=0.09868, over 4264032.81 frames. ], batch size: 608, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:56:20,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=414534.0, ans=0.125 2023-06-19 20:57:10,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=414654.0, ans=0.125 2023-06-19 20:58:51,122 INFO [train.py:996] (3/4) Epoch 3, batch 8150, loss[loss=0.2601, simple_loss=0.3485, pruned_loss=0.08587, over 21704.00 frames. ], tot_loss[loss=0.2727, simple_loss=0.3433, pruned_loss=0.101, over 4269514.83 frames. ], batch size: 298, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 20:59:54,566 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=414954.0, ans=0.1 2023-06-19 21:00:00,435 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=415014.0, ans=0.125 2023-06-19 21:00:01,454 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.034e+02 2.727e+02 3.206e+02 3.958e+02 8.712e+02, threshold=6.412e+02, percent-clipped=8.0 2023-06-19 21:00:08,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=415014.0, ans=0.125 2023-06-19 21:00:28,964 INFO [train.py:996] (3/4) Epoch 3, batch 8200, loss[loss=0.2442, simple_loss=0.2977, pruned_loss=0.09539, over 20801.00 frames. ], tot_loss[loss=0.2647, simple_loss=0.3337, pruned_loss=0.0978, over 4268584.24 frames. ], batch size: 609, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:00:44,087 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.86 vs. limit=22.5 2023-06-19 21:01:57,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=415314.0, ans=0.2 2023-06-19 21:02:11,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=415374.0, ans=0.1 2023-06-19 21:02:26,219 INFO [train.py:996] (3/4) Epoch 3, batch 8250, loss[loss=0.2316, simple_loss=0.3144, pruned_loss=0.07442, over 21673.00 frames. ], tot_loss[loss=0.2628, simple_loss=0.3313, pruned_loss=0.09714, over 4263701.66 frames. ], batch size: 247, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:02:42,042 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=415434.0, ans=0.125 2023-06-19 21:02:47,752 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=415494.0, ans=0.125 2023-06-19 21:03:33,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=415554.0, ans=0.0 2023-06-19 21:03:36,229 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=415554.0, ans=0.0 2023-06-19 21:03:47,723 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.841e+02 2.708e+02 3.103e+02 3.539e+02 5.585e+02, threshold=6.206e+02, percent-clipped=0.0 2023-06-19 21:04:23,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=415674.0, ans=6.0 2023-06-19 21:04:29,217 INFO [train.py:996] (3/4) Epoch 3, batch 8300, loss[loss=0.2853, simple_loss=0.4187, pruned_loss=0.07595, over 20737.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.3293, pruned_loss=0.09358, over 4273184.28 frames. ], batch size: 607, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:04:29,650 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=415734.0, ans=0.0 2023-06-19 21:04:48,333 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=415734.0, ans=0.125 2023-06-19 21:05:21,826 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=415854.0, ans=0.125 2023-06-19 21:05:25,109 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=415854.0, ans=0.0 2023-06-19 21:05:47,532 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=415914.0, ans=0.02 2023-06-19 21:06:12,433 INFO [train.py:996] (3/4) Epoch 3, batch 8350, loss[loss=0.2497, simple_loss=0.318, pruned_loss=0.09066, over 21726.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3266, pruned_loss=0.09072, over 4274715.28 frames. ], batch size: 351, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:06:57,226 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=416094.0, ans=0.125 2023-06-19 21:07:19,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=416154.0, ans=0.0 2023-06-19 21:07:24,673 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.926e+02 2.460e+02 2.877e+02 3.560e+02 6.454e+02, threshold=5.755e+02, percent-clipped=1.0 2023-06-19 21:08:24,297 INFO [train.py:996] (3/4) Epoch 3, batch 8400, loss[loss=0.1972, simple_loss=0.275, pruned_loss=0.05974, over 21199.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3239, pruned_loss=0.08869, over 4270414.48 frames. ], batch size: 143, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:08:37,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=416334.0, ans=0.05 2023-06-19 21:09:12,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=416454.0, ans=0.125 2023-06-19 21:09:23,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=416454.0, ans=0.0 2023-06-19 21:09:34,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=416514.0, ans=0.0 2023-06-19 21:09:39,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=416574.0, ans=0.125 2023-06-19 21:09:55,429 INFO [train.py:996] (3/4) Epoch 3, batch 8450, loss[loss=0.2385, simple_loss=0.3004, pruned_loss=0.08832, over 21852.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.321, pruned_loss=0.08703, over 4279423.52 frames. ], batch size: 282, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:10:05,226 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.28 vs. limit=12.0 2023-06-19 21:10:14,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=416634.0, ans=0.2 2023-06-19 21:10:28,913 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.42 vs. limit=6.0 2023-06-19 21:11:06,457 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.788e+02 2.528e+02 3.268e+02 4.022e+02 7.297e+02, threshold=6.535e+02, percent-clipped=4.0 2023-06-19 21:11:22,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=416874.0, ans=0.1 2023-06-19 21:11:53,085 INFO [train.py:996] (3/4) Epoch 3, batch 8500, loss[loss=0.2269, simple_loss=0.275, pruned_loss=0.08944, over 21245.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3189, pruned_loss=0.08999, over 4272386.21 frames. ], batch size: 548, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:12:38,594 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 21:13:56,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=417174.0, ans=0.0 2023-06-19 21:13:56,324 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=417174.0, ans=0.125 2023-06-19 21:13:58,920 INFO [train.py:996] (3/4) Epoch 3, batch 8550, loss[loss=0.2439, simple_loss=0.3354, pruned_loss=0.07623, over 21810.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.3236, pruned_loss=0.09358, over 4282346.10 frames. ], batch size: 282, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:14:34,307 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=417294.0, ans=0.0 2023-06-19 21:15:14,157 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.077e+02 2.754e+02 3.232e+02 3.747e+02 6.984e+02, threshold=6.464e+02, percent-clipped=1.0 2023-06-19 21:15:38,597 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.35 vs. limit=10.0 2023-06-19 21:16:08,809 INFO [train.py:996] (3/4) Epoch 3, batch 8600, loss[loss=0.2768, simple_loss=0.3418, pruned_loss=0.1059, over 21705.00 frames. ], tot_loss[loss=0.263, simple_loss=0.3326, pruned_loss=0.09673, over 4279900.46 frames. ], batch size: 298, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:16:24,293 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=417534.0, ans=0.125 2023-06-19 21:16:56,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=417654.0, ans=0.125 2023-06-19 21:17:13,348 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.65 vs. limit=6.0 2023-06-19 21:17:14,333 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=417714.0, ans=0.0 2023-06-19 21:17:54,628 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=417774.0, ans=0.125 2023-06-19 21:18:02,541 INFO [train.py:996] (3/4) Epoch 3, batch 8650, loss[loss=0.2042, simple_loss=0.3017, pruned_loss=0.05331, over 21839.00 frames. ], tot_loss[loss=0.2684, simple_loss=0.3396, pruned_loss=0.09859, over 4281947.26 frames. ], batch size: 316, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:18:04,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=417834.0, ans=0.125 2023-06-19 21:18:05,839 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=417834.0, ans=0.125 2023-06-19 21:18:12,102 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.74 vs. limit=15.0 2023-06-19 21:18:31,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=417894.0, ans=0.2 2023-06-19 21:18:37,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=417894.0, ans=0.0 2023-06-19 21:18:49,313 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.57 vs. limit=15.0 2023-06-19 21:18:57,279 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=417954.0, ans=0.125 2023-06-19 21:19:20,787 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.564e+02 2.624e+02 3.051e+02 3.895e+02 8.480e+02, threshold=6.103e+02, percent-clipped=4.0 2023-06-19 21:19:24,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=418014.0, ans=0.1 2023-06-19 21:19:24,628 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.54 vs. limit=15.0 2023-06-19 21:19:30,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=418014.0, ans=0.0 2023-06-19 21:19:50,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff3.min_abs, batch_count=418074.0, ans=0.2 2023-06-19 21:19:52,981 INFO [train.py:996] (3/4) Epoch 3, batch 8700, loss[loss=0.2195, simple_loss=0.283, pruned_loss=0.07799, over 21535.00 frames. ], tot_loss[loss=0.259, simple_loss=0.3301, pruned_loss=0.09397, over 4282795.66 frames. ], batch size: 196, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:20:15,145 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=418134.0, ans=0.1 2023-06-19 21:20:18,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=418194.0, ans=0.1 2023-06-19 21:21:55,073 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.06 vs. limit=15.0 2023-06-19 21:22:05,624 INFO [train.py:996] (3/4) Epoch 3, batch 8750, loss[loss=0.2545, simple_loss=0.3166, pruned_loss=0.0962, over 21820.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3265, pruned_loss=0.09472, over 4286724.75 frames. ], batch size: 298, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:22:07,869 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.79 vs. limit=15.0 2023-06-19 21:22:35,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=418554.0, ans=0.125 2023-06-19 21:22:59,180 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.830e+02 2.653e+02 3.166e+02 3.937e+02 6.791e+02, threshold=6.332e+02, percent-clipped=3.0 2023-06-19 21:23:32,443 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.39 vs. limit=15.0 2023-06-19 21:23:43,287 INFO [train.py:996] (3/4) Epoch 3, batch 8800, loss[loss=0.2202, simple_loss=0.3259, pruned_loss=0.05723, over 20827.00 frames. ], tot_loss[loss=0.2654, simple_loss=0.3349, pruned_loss=0.09793, over 4285905.39 frames. ], batch size: 608, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:24:08,031 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.06 vs. limit=12.0 2023-06-19 21:24:12,344 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.54 vs. limit=15.0 2023-06-19 21:25:29,645 INFO [train.py:996] (3/4) Epoch 3, batch 8850, loss[loss=0.2623, simple_loss=0.3478, pruned_loss=0.08836, over 21718.00 frames. ], tot_loss[loss=0.2712, simple_loss=0.3418, pruned_loss=0.1003, over 4288540.95 frames. ], batch size: 332, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:25:38,090 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.93 vs. limit=22.5 2023-06-19 21:26:04,961 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=419154.0, ans=0.0 2023-06-19 21:26:43,015 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.235e+02 2.885e+02 3.451e+02 3.948e+02 6.880e+02, threshold=6.902e+02, percent-clipped=2.0 2023-06-19 21:26:50,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=419214.0, ans=0.1 2023-06-19 21:27:20,930 INFO [train.py:996] (3/4) Epoch 3, batch 8900, loss[loss=0.2884, simple_loss=0.3641, pruned_loss=0.1064, over 21602.00 frames. ], tot_loss[loss=0.2689, simple_loss=0.3377, pruned_loss=0.1, over 4274336.69 frames. ], batch size: 414, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:27:24,645 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=419334.0, ans=0.2 2023-06-19 21:27:31,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=419334.0, ans=0.0 2023-06-19 21:27:36,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=419394.0, ans=0.125 2023-06-19 21:28:53,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=419574.0, ans=0.2 2023-06-19 21:29:00,063 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=419574.0, ans=0.125 2023-06-19 21:29:26,018 INFO [train.py:996] (3/4) Epoch 3, batch 8950, loss[loss=0.2496, simple_loss=0.3131, pruned_loss=0.09305, over 21644.00 frames. ], tot_loss[loss=0.2673, simple_loss=0.3371, pruned_loss=0.09876, over 4278822.93 frames. ], batch size: 263, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:30:12,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=419694.0, ans=0.125 2023-06-19 21:30:25,310 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.72 vs. limit=15.0 2023-06-19 21:30:46,390 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.971e+02 2.693e+02 3.248e+02 3.972e+02 8.193e+02, threshold=6.496e+02, percent-clipped=3.0 2023-06-19 21:30:48,667 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=419814.0, ans=0.125 2023-06-19 21:31:05,071 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=419874.0, ans=0.0 2023-06-19 21:31:07,146 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.82 vs. limit=10.0 2023-06-19 21:31:08,832 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.21 vs. limit=5.0 2023-06-19 21:31:13,411 INFO [train.py:996] (3/4) Epoch 3, batch 9000, loss[loss=0.2224, simple_loss=0.2796, pruned_loss=0.0826, over 21831.00 frames. ], tot_loss[loss=0.2645, simple_loss=0.332, pruned_loss=0.09854, over 4279919.07 frames. ], batch size: 118, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:31:13,411 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-19 21:31:59,488 INFO [train.py:1028] (3/4) Epoch 3, validation: loss=0.2739, simple_loss=0.372, pruned_loss=0.08794, over 1796401.00 frames. 2023-06-19 21:31:59,490 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-19 21:32:53,414 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=420054.0, ans=0.125 2023-06-19 21:33:02,437 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=420114.0, ans=0.0 2023-06-19 21:33:04,059 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=420114.0, ans=0.07 2023-06-19 21:33:37,754 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=420174.0, ans=0.125 2023-06-19 21:33:45,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=420174.0, ans=0.2 2023-06-19 21:33:46,165 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.53 vs. limit=15.0 2023-06-19 21:33:49,292 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.96 vs. limit=15.0 2023-06-19 21:33:49,578 INFO [train.py:996] (3/4) Epoch 3, batch 9050, loss[loss=0.267, simple_loss=0.3352, pruned_loss=0.0994, over 21698.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.3268, pruned_loss=0.09533, over 4278367.77 frames. ], batch size: 351, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:34:14,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=420234.0, ans=0.09899494936611666 2023-06-19 21:34:26,061 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.34 vs. limit=15.0 2023-06-19 21:34:27,220 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=420294.0, ans=0.0 2023-06-19 21:34:49,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=420354.0, ans=0.0 2023-06-19 21:35:09,277 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.914e+02 2.569e+02 2.874e+02 3.466e+02 6.244e+02, threshold=5.748e+02, percent-clipped=0.0 2023-06-19 21:35:27,224 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=420414.0, ans=0.0 2023-06-19 21:36:03,323 INFO [train.py:996] (3/4) Epoch 3, batch 9100, loss[loss=0.2768, simple_loss=0.3486, pruned_loss=0.1025, over 21777.00 frames. ], tot_loss[loss=0.2645, simple_loss=0.3333, pruned_loss=0.0979, over 4275279.48 frames. ], batch size: 118, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:36:10,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=420534.0, ans=0.125 2023-06-19 21:36:12,249 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=420534.0, ans=0.07 2023-06-19 21:38:17,220 INFO [train.py:996] (3/4) Epoch 3, batch 9150, loss[loss=0.315, simple_loss=0.3968, pruned_loss=0.1166, over 21512.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.3343, pruned_loss=0.09434, over 4275971.25 frames. ], batch size: 471, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:38:33,469 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=420834.0, ans=0.125 2023-06-19 21:38:51,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=420894.0, ans=0.125 2023-06-19 21:38:55,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=420894.0, ans=0.125 2023-06-19 21:39:41,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=420954.0, ans=0.2 2023-06-19 21:39:46,701 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.842e+02 2.484e+02 2.819e+02 3.300e+02 4.761e+02, threshold=5.639e+02, percent-clipped=0.0 2023-06-19 21:39:50,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=421014.0, ans=0.2 2023-06-19 21:40:06,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=421074.0, ans=0.0 2023-06-19 21:40:22,960 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.38 vs. limit=15.0 2023-06-19 21:40:30,502 INFO [train.py:996] (3/4) Epoch 3, batch 9200, loss[loss=0.276, simple_loss=0.351, pruned_loss=0.1005, over 21751.00 frames. ], tot_loss[loss=0.26, simple_loss=0.3353, pruned_loss=0.09238, over 4275313.02 frames. ], batch size: 332, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:40:45,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=421134.0, ans=0.04949747468305833 2023-06-19 21:40:58,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=421194.0, ans=0.0 2023-06-19 21:41:04,414 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.73 vs. limit=15.0 2023-06-19 21:41:48,947 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=421314.0, ans=0.0 2023-06-19 21:42:36,623 INFO [train.py:996] (3/4) Epoch 3, batch 9250, loss[loss=0.2643, simple_loss=0.3172, pruned_loss=0.1056, over 21258.00 frames. ], tot_loss[loss=0.2654, simple_loss=0.3384, pruned_loss=0.09618, over 4277966.23 frames. ], batch size: 548, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:42:41,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=421434.0, ans=0.0 2023-06-19 21:42:55,331 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=421494.0, ans=0.125 2023-06-19 21:43:30,541 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.077e+02 2.625e+02 3.038e+02 3.484e+02 5.422e+02, threshold=6.077e+02, percent-clipped=0.0 2023-06-19 21:43:35,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=421614.0, ans=0.1 2023-06-19 21:43:50,237 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=421674.0, ans=0.2 2023-06-19 21:44:17,070 INFO [train.py:996] (3/4) Epoch 3, batch 9300, loss[loss=0.2559, simple_loss=0.303, pruned_loss=0.1044, over 21873.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.3321, pruned_loss=0.09543, over 4267265.87 frames. ], batch size: 98, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:45:03,953 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=421794.0, ans=0.125 2023-06-19 21:45:30,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=421914.0, ans=0.0 2023-06-19 21:46:29,840 INFO [train.py:996] (3/4) Epoch 3, batch 9350, loss[loss=0.2796, simple_loss=0.3549, pruned_loss=0.1022, over 21611.00 frames. ], tot_loss[loss=0.2664, simple_loss=0.3386, pruned_loss=0.0971, over 4270567.27 frames. ], batch size: 230, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 21:47:13,751 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=422094.0, ans=0.1 2023-06-19 21:47:50,598 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.058e+02 2.663e+02 3.172e+02 3.860e+02 6.008e+02, threshold=6.345e+02, percent-clipped=0.0 2023-06-19 21:47:59,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=422214.0, ans=0.125 2023-06-19 21:48:15,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=422274.0, ans=0.125 2023-06-19 21:48:22,862 INFO [train.py:996] (3/4) Epoch 3, batch 9400, loss[loss=0.245, simple_loss=0.3018, pruned_loss=0.09412, over 21740.00 frames. ], tot_loss[loss=0.2684, simple_loss=0.3404, pruned_loss=0.09817, over 4278373.84 frames. ], batch size: 112, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 21:48:24,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=422334.0, ans=0.2 2023-06-19 21:48:27,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=422334.0, ans=0.125 2023-06-19 21:48:37,615 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=422334.0, ans=0.035 2023-06-19 21:48:42,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=422394.0, ans=0.04949747468305833 2023-06-19 21:49:50,720 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=422514.0, ans=0.125 2023-06-19 21:50:11,529 INFO [train.py:996] (3/4) Epoch 3, batch 9450, loss[loss=0.2081, simple_loss=0.2625, pruned_loss=0.07683, over 21300.00 frames. ], tot_loss[loss=0.2614, simple_loss=0.3308, pruned_loss=0.09596, over 4268650.16 frames. ], batch size: 551, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 21:50:14,224 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.43 vs. limit=8.0 2023-06-19 21:50:16,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=422634.0, ans=0.125 2023-06-19 21:50:47,124 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 21:50:59,344 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 21:51:28,769 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.824e+02 3.255e+02 3.921e+02 7.411e+02, threshold=6.510e+02, percent-clipped=5.0 2023-06-19 21:51:50,056 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.33 vs. limit=6.0 2023-06-19 21:52:04,913 INFO [train.py:996] (3/4) Epoch 3, batch 9500, loss[loss=0.2364, simple_loss=0.2981, pruned_loss=0.08735, over 21509.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3229, pruned_loss=0.09388, over 4251503.75 frames. ], batch size: 441, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 21:52:05,449 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=422934.0, ans=0.125 2023-06-19 21:52:24,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=422994.0, ans=0.0 2023-06-19 21:52:24,294 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=422994.0, ans=0.125 2023-06-19 21:52:30,133 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 21:53:29,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=423114.0, ans=0.125 2023-06-19 21:53:35,786 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.48 vs. limit=10.0 2023-06-19 21:53:36,731 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=423174.0, ans=0.0 2023-06-19 21:53:40,929 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=423174.0, ans=0.125 2023-06-19 21:53:48,129 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=423174.0, ans=0.125 2023-06-19 21:53:50,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=423234.0, ans=0.0 2023-06-19 21:53:51,866 INFO [train.py:996] (3/4) Epoch 3, batch 9550, loss[loss=0.2879, simple_loss=0.3719, pruned_loss=0.102, over 21477.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.3275, pruned_loss=0.09611, over 4259384.51 frames. ], batch size: 211, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 21:55:14,492 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.773e+02 2.688e+02 3.199e+02 3.613e+02 5.972e+02, threshold=6.398e+02, percent-clipped=0.0 2023-06-19 21:55:24,322 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.35 vs. limit=10.0 2023-06-19 21:55:28,285 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=423474.0, ans=0.125 2023-06-19 21:55:41,570 INFO [train.py:996] (3/4) Epoch 3, batch 9600, loss[loss=0.2741, simple_loss=0.3263, pruned_loss=0.1109, over 21396.00 frames. ], tot_loss[loss=0.2634, simple_loss=0.3303, pruned_loss=0.09827, over 4262863.69 frames. ], batch size: 159, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 21:55:44,035 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-19 21:56:25,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=423594.0, ans=0.1 2023-06-19 21:57:31,851 INFO [train.py:996] (3/4) Epoch 3, batch 9650, loss[loss=0.2569, simple_loss=0.3287, pruned_loss=0.09253, over 21761.00 frames. ], tot_loss[loss=0.2653, simple_loss=0.3322, pruned_loss=0.09926, over 4268335.73 frames. ], batch size: 332, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 21:57:34,340 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=8.05 vs. limit=12.0 2023-06-19 21:59:07,194 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.79 vs. limit=15.0 2023-06-19 21:59:07,585 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.122e+02 2.684e+02 3.158e+02 3.769e+02 7.574e+02, threshold=6.315e+02, percent-clipped=3.0 2023-06-19 21:59:29,437 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=424074.0, ans=0.02 2023-06-19 21:59:41,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=424074.0, ans=0.125 2023-06-19 21:59:53,775 INFO [train.py:996] (3/4) Epoch 3, batch 9700, loss[loss=0.2782, simple_loss=0.3705, pruned_loss=0.09298, over 20728.00 frames. ], tot_loss[loss=0.267, simple_loss=0.3352, pruned_loss=0.09938, over 4273159.90 frames. ], batch size: 607, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:00:22,837 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=424194.0, ans=0.1 2023-06-19 22:00:51,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=424254.0, ans=0.125 2023-06-19 22:01:38,961 INFO [train.py:996] (3/4) Epoch 3, batch 9750, loss[loss=0.2644, simple_loss=0.3032, pruned_loss=0.1128, over 21502.00 frames. ], tot_loss[loss=0.2633, simple_loss=0.3299, pruned_loss=0.09838, over 4268772.69 frames. ], batch size: 441, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:02:04,373 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=424494.0, ans=0.1 2023-06-19 22:02:35,568 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=424554.0, ans=0.05 2023-06-19 22:02:37,820 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.99 vs. limit=15.0 2023-06-19 22:02:44,339 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.903e+02 2.697e+02 3.201e+02 3.859e+02 6.121e+02, threshold=6.401e+02, percent-clipped=0.0 2023-06-19 22:02:50,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=424614.0, ans=0.125 2023-06-19 22:02:52,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=424614.0, ans=0.125 2023-06-19 22:03:13,840 INFO [train.py:996] (3/4) Epoch 3, batch 9800, loss[loss=0.2706, simple_loss=0.339, pruned_loss=0.1011, over 21845.00 frames. ], tot_loss[loss=0.2633, simple_loss=0.3291, pruned_loss=0.09874, over 4267393.93 frames. ], batch size: 124, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:03:30,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=424734.0, ans=0.125 2023-06-19 22:04:00,420 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=424794.0, ans=0.125 2023-06-19 22:04:04,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=424854.0, ans=0.125 2023-06-19 22:04:24,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=424914.0, ans=0.125 2023-06-19 22:04:37,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=424914.0, ans=0.0 2023-06-19 22:04:59,797 INFO [train.py:996] (3/4) Epoch 3, batch 9850, loss[loss=0.2406, simple_loss=0.3152, pruned_loss=0.08305, over 20744.00 frames. ], tot_loss[loss=0.2612, simple_loss=0.326, pruned_loss=0.09818, over 4249773.01 frames. ], batch size: 607, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:05:30,567 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=12.0 2023-06-19 22:05:34,265 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=425094.0, ans=0.95 2023-06-19 22:06:21,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=425214.0, ans=0.1 2023-06-19 22:06:23,733 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.796e+02 2.545e+02 2.826e+02 3.275e+02 4.693e+02, threshold=5.651e+02, percent-clipped=0.0 2023-06-19 22:07:14,429 INFO [train.py:996] (3/4) Epoch 3, batch 9900, loss[loss=0.2509, simple_loss=0.328, pruned_loss=0.08688, over 21444.00 frames. ], tot_loss[loss=0.2573, simple_loss=0.321, pruned_loss=0.09679, over 4244445.88 frames. ], batch size: 194, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:07:18,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=425334.0, ans=0.125 2023-06-19 22:07:29,748 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.96 vs. limit=15.0 2023-06-19 22:08:47,800 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=425574.0, ans=0.125 2023-06-19 22:08:47,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=425574.0, ans=0.125 2023-06-19 22:08:51,745 INFO [train.py:996] (3/4) Epoch 3, batch 9950, loss[loss=0.2443, simple_loss=0.302, pruned_loss=0.09325, over 21601.00 frames. ], tot_loss[loss=0.2594, simple_loss=0.3221, pruned_loss=0.09837, over 4255190.91 frames. ], batch size: 263, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:08:53,609 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=425634.0, ans=0.0 2023-06-19 22:09:15,617 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 22:09:47,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=425754.0, ans=0.125 2023-06-19 22:10:05,173 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=425814.0, ans=0.125 2023-06-19 22:10:11,500 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.878e+02 2.786e+02 3.203e+02 4.575e+02 9.878e+02, threshold=6.406e+02, percent-clipped=15.0 2023-06-19 22:10:54,983 INFO [train.py:996] (3/4) Epoch 3, batch 10000, loss[loss=0.1879, simple_loss=0.2489, pruned_loss=0.06341, over 21240.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3155, pruned_loss=0.09614, over 4252791.54 frames. ], batch size: 159, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:11:06,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=425934.0, ans=0.125 2023-06-19 22:13:06,295 INFO [train.py:996] (3/4) Epoch 3, batch 10050, loss[loss=0.2413, simple_loss=0.3126, pruned_loss=0.085, over 21773.00 frames. ], tot_loss[loss=0.2573, simple_loss=0.3192, pruned_loss=0.09773, over 4255796.35 frames. ], batch size: 124, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:13:35,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=426294.0, ans=0.125 2023-06-19 22:14:03,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=426354.0, ans=0.1 2023-06-19 22:14:14,304 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.21 vs. limit=22.5 2023-06-19 22:14:21,821 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.68 vs. limit=15.0 2023-06-19 22:14:22,132 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.827e+02 2.478e+02 2.837e+02 3.285e+02 4.855e+02, threshold=5.673e+02, percent-clipped=0.0 2023-06-19 22:15:05,151 INFO [train.py:996] (3/4) Epoch 3, batch 10100, loss[loss=0.2947, simple_loss=0.3596, pruned_loss=0.1149, over 19886.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3178, pruned_loss=0.0953, over 4238857.09 frames. ], batch size: 702, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:16:30,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=426654.0, ans=0.125 2023-06-19 22:16:37,341 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.40 vs. limit=15.0 2023-06-19 22:16:42,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=426714.0, ans=0.0 2023-06-19 22:16:42,973 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-19 22:17:02,067 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.76 vs. limit=12.0 2023-06-19 22:17:13,931 INFO [train.py:996] (3/4) Epoch 3, batch 10150, loss[loss=0.2338, simple_loss=0.2962, pruned_loss=0.08575, over 15906.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.3238, pruned_loss=0.09781, over 4237422.34 frames. ], batch size: 60, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:17:14,335 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff3.min_abs, batch_count=426834.0, ans=0.2 2023-06-19 22:17:30,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=426834.0, ans=0.1 2023-06-19 22:17:42,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=426894.0, ans=0.0 2023-06-19 22:17:59,700 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=426954.0, ans=0.125 2023-06-19 22:18:15,038 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.995e+02 2.616e+02 3.111e+02 3.928e+02 6.144e+02, threshold=6.222e+02, percent-clipped=2.0 2023-06-19 22:18:25,437 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 22:18:54,598 INFO [train.py:996] (3/4) Epoch 3, batch 10200, loss[loss=0.202, simple_loss=0.2703, pruned_loss=0.06688, over 21172.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.3207, pruned_loss=0.09469, over 4241112.64 frames. ], batch size: 143, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:19:54,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=427314.0, ans=0.0 2023-06-19 22:20:43,644 INFO [train.py:996] (3/4) Epoch 3, batch 10250, loss[loss=0.1733, simple_loss=0.2602, pruned_loss=0.04314, over 21391.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3145, pruned_loss=0.08845, over 4235176.16 frames. ], batch size: 211, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:20:52,232 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.72 vs. limit=15.0 2023-06-19 22:20:52,253 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.28 vs. limit=15.0 2023-06-19 22:20:57,785 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=427494.0, ans=0.0 2023-06-19 22:21:57,823 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.697e+02 2.458e+02 2.783e+02 3.278e+02 6.025e+02, threshold=5.566e+02, percent-clipped=0.0 2023-06-19 22:22:29,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=427674.0, ans=0.04949747468305833 2023-06-19 22:22:30,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=427674.0, ans=10.0 2023-06-19 22:22:34,959 INFO [train.py:996] (3/4) Epoch 3, batch 10300, loss[loss=0.2983, simple_loss=0.3614, pruned_loss=0.1176, over 20045.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.32, pruned_loss=0.09152, over 4238509.43 frames. ], batch size: 702, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:22:38,707 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=427734.0, ans=0.1 2023-06-19 22:23:13,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=427794.0, ans=0.95 2023-06-19 22:23:16,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=427854.0, ans=0.1 2023-06-19 22:24:07,723 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=427914.0, ans=0.125 2023-06-19 22:24:16,243 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=427914.0, ans=0.2 2023-06-19 22:24:35,486 INFO [train.py:996] (3/4) Epoch 3, batch 10350, loss[loss=0.2037, simple_loss=0.2663, pruned_loss=0.07051, over 21643.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3194, pruned_loss=0.09049, over 4244112.74 frames. ], batch size: 247, lr: 1.13e-02, grad_scale: 16.0 2023-06-19 22:25:03,775 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.92 vs. limit=22.5 2023-06-19 22:25:05,278 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.37 vs. limit=15.0 2023-06-19 22:26:05,028 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.214e+02 2.731e+02 3.156e+02 3.865e+02 6.319e+02, threshold=6.313e+02, percent-clipped=4.0 2023-06-19 22:26:12,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=428274.0, ans=0.125 2023-06-19 22:26:38,494 INFO [train.py:996] (3/4) Epoch 3, batch 10400, loss[loss=0.2445, simple_loss=0.321, pruned_loss=0.084, over 21916.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.312, pruned_loss=0.0884, over 4247944.28 frames. ], batch size: 373, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:27:20,034 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.76 vs. limit=6.0 2023-06-19 22:27:20,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=428394.0, ans=0.125 2023-06-19 22:27:29,852 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=428454.0, ans=0.07 2023-06-19 22:27:38,976 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.90 vs. limit=15.0 2023-06-19 22:27:56,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=428454.0, ans=0.09899494936611666 2023-06-19 22:28:06,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=428514.0, ans=0.125 2023-06-19 22:28:43,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=428574.0, ans=0.0 2023-06-19 22:28:45,450 INFO [train.py:996] (3/4) Epoch 3, batch 10450, loss[loss=0.2604, simple_loss=0.3274, pruned_loss=0.09674, over 21409.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.3181, pruned_loss=0.09229, over 4249773.22 frames. ], batch size: 131, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:29:24,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=428694.0, ans=0.0 2023-06-19 22:29:55,546 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.51 vs. limit=8.0 2023-06-19 22:29:56,954 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.33 vs. limit=10.0 2023-06-19 22:30:13,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=428814.0, ans=0.05 2023-06-19 22:30:16,368 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.703e+02 2.823e+02 3.547e+02 4.478e+02 9.217e+02, threshold=7.094e+02, percent-clipped=7.0 2023-06-19 22:30:39,057 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=428874.0, ans=0.5 2023-06-19 22:30:42,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=428874.0, ans=0.1 2023-06-19 22:30:55,059 INFO [train.py:996] (3/4) Epoch 3, batch 10500, loss[loss=0.2417, simple_loss=0.3032, pruned_loss=0.09007, over 21717.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3168, pruned_loss=0.0907, over 4251515.25 frames. ], batch size: 351, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:31:22,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=428994.0, ans=0.05 2023-06-19 22:31:46,427 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=429054.0, ans=0.0 2023-06-19 22:31:55,274 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=429054.0, ans=0.2 2023-06-19 22:32:22,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=429174.0, ans=0.0 2023-06-19 22:32:45,447 INFO [train.py:996] (3/4) Epoch 3, batch 10550, loss[loss=0.2365, simple_loss=0.2954, pruned_loss=0.08875, over 21849.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.311, pruned_loss=0.08994, over 4251008.68 frames. ], batch size: 373, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:33:41,148 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=429354.0, ans=0.125 2023-06-19 22:33:58,344 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 22:33:59,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=429414.0, ans=0.0 2023-06-19 22:34:03,944 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.938e+02 2.386e+02 2.842e+02 3.360e+02 5.942e+02, threshold=5.684e+02, percent-clipped=0.0 2023-06-19 22:34:49,235 INFO [train.py:996] (3/4) Epoch 3, batch 10600, loss[loss=0.2245, simple_loss=0.3025, pruned_loss=0.0732, over 21758.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3074, pruned_loss=0.08855, over 4248726.40 frames. ], batch size: 282, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:35:30,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=429594.0, ans=0.0 2023-06-19 22:35:37,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=429654.0, ans=0.1 2023-06-19 22:35:50,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=429714.0, ans=0.125 2023-06-19 22:35:55,205 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=429714.0, ans=0.125 2023-06-19 22:36:34,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=429774.0, ans=0.1 2023-06-19 22:36:52,427 INFO [train.py:996] (3/4) Epoch 3, batch 10650, loss[loss=0.1883, simple_loss=0.2669, pruned_loss=0.05485, over 21583.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3126, pruned_loss=0.08802, over 4252606.89 frames. ], batch size: 263, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 22:37:20,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=429894.0, ans=0.125 2023-06-19 22:37:21,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=429894.0, ans=0.125 2023-06-19 22:37:26,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=429894.0, ans=0.1 2023-06-19 22:37:30,201 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=429894.0, ans=0.125 2023-06-19 22:38:19,923 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.896e+02 2.760e+02 3.791e+02 4.826e+02 9.694e+02, threshold=7.582e+02, percent-clipped=13.0 2023-06-19 22:38:56,295 INFO [train.py:996] (3/4) Epoch 3, batch 10700, loss[loss=0.2363, simple_loss=0.3041, pruned_loss=0.08427, over 21324.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.312, pruned_loss=0.08845, over 4254664.69 frames. ], batch size: 159, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 22:39:15,876 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=430134.0, ans=0.0 2023-06-19 22:40:25,853 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.25 vs. limit=22.5 2023-06-19 22:40:42,452 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=430374.0, ans=0.0 2023-06-19 22:40:44,511 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.04 vs. limit=22.5 2023-06-19 22:40:55,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=430434.0, ans=0.125 2023-06-19 22:41:02,108 INFO [train.py:996] (3/4) Epoch 3, batch 10750, loss[loss=0.3091, simple_loss=0.3975, pruned_loss=0.1103, over 21672.00 frames. ], tot_loss[loss=0.2555, simple_loss=0.3236, pruned_loss=0.09375, over 4254982.98 frames. ], batch size: 414, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 22:42:07,565 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.75 vs. limit=10.0 2023-06-19 22:42:22,447 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.932e+02 2.840e+02 3.445e+02 4.011e+02 6.659e+02, threshold=6.891e+02, percent-clipped=0.0 2023-06-19 22:43:02,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=430734.0, ans=0.125 2023-06-19 22:43:03,921 INFO [train.py:996] (3/4) Epoch 3, batch 10800, loss[loss=0.3003, simple_loss=0.3624, pruned_loss=0.1191, over 21568.00 frames. ], tot_loss[loss=0.2598, simple_loss=0.3301, pruned_loss=0.09474, over 4257420.14 frames. ], batch size: 389, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 22:43:06,459 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.01 vs. limit=15.0 2023-06-19 22:43:26,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=430734.0, ans=0.2 2023-06-19 22:44:12,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=430854.0, ans=0.2 2023-06-19 22:44:31,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=430914.0, ans=0.1 2023-06-19 22:45:01,151 INFO [train.py:996] (3/4) Epoch 3, batch 10850, loss[loss=0.2235, simple_loss=0.2862, pruned_loss=0.08035, over 21919.00 frames. ], tot_loss[loss=0.261, simple_loss=0.3314, pruned_loss=0.09529, over 4261928.01 frames. ], batch size: 373, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 22:45:08,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=431034.0, ans=0.125 2023-06-19 22:46:22,968 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.020e+02 2.541e+02 3.011e+02 3.427e+02 5.318e+02, threshold=6.021e+02, percent-clipped=0.0 2023-06-19 22:46:23,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=431214.0, ans=0.0 2023-06-19 22:46:24,285 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.68 vs. limit=15.0 2023-06-19 22:46:26,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=431214.0, ans=0.1 2023-06-19 22:46:37,061 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.13 vs. limit=10.0 2023-06-19 22:46:46,209 INFO [train.py:996] (3/4) Epoch 3, batch 10900, loss[loss=0.2149, simple_loss=0.2758, pruned_loss=0.07697, over 21293.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.3237, pruned_loss=0.0926, over 4264429.16 frames. ], batch size: 177, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 22:48:30,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=431574.0, ans=0.1 2023-06-19 22:48:42,865 INFO [train.py:996] (3/4) Epoch 3, batch 10950, loss[loss=0.2209, simple_loss=0.2851, pruned_loss=0.07831, over 21551.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3172, pruned_loss=0.08998, over 4265890.47 frames. ], batch size: 263, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 22:49:11,139 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.50 vs. limit=15.0 2023-06-19 22:49:41,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=431754.0, ans=0.07 2023-06-19 22:49:51,182 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=431814.0, ans=0.1 2023-06-19 22:50:09,157 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.876e+02 2.597e+02 3.255e+02 3.821e+02 6.519e+02, threshold=6.510e+02, percent-clipped=2.0 2023-06-19 22:50:12,685 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=431814.0, ans=0.125 2023-06-19 22:50:45,267 INFO [train.py:996] (3/4) Epoch 3, batch 11000, loss[loss=0.2676, simple_loss=0.3248, pruned_loss=0.1052, over 21808.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3178, pruned_loss=0.09146, over 4270794.99 frames. ], batch size: 282, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 22:51:16,929 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.39 vs. limit=15.0 2023-06-19 22:51:18,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=431994.0, ans=0.0 2023-06-19 22:52:05,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=432114.0, ans=0.0 2023-06-19 22:52:14,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=432114.0, ans=0.09899494936611666 2023-06-19 22:52:28,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=432174.0, ans=0.1 2023-06-19 22:52:32,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=432234.0, ans=0.0 2023-06-19 22:52:33,731 INFO [train.py:996] (3/4) Epoch 3, batch 11050, loss[loss=0.2382, simple_loss=0.3, pruned_loss=0.08821, over 21864.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3156, pruned_loss=0.09283, over 4276492.26 frames. ], batch size: 98, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 22:53:23,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=432354.0, ans=0.125 2023-06-19 22:53:38,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=432414.0, ans=0.125 2023-06-19 22:53:41,039 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.34 vs. limit=15.0 2023-06-19 22:53:46,626 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.996e+02 2.920e+02 3.396e+02 4.201e+02 8.716e+02, threshold=6.791e+02, percent-clipped=3.0 2023-06-19 22:53:52,084 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=432414.0, ans=0.0 2023-06-19 22:54:10,318 INFO [train.py:996] (3/4) Epoch 3, batch 11100, loss[loss=0.2742, simple_loss=0.3308, pruned_loss=0.1088, over 21524.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3138, pruned_loss=0.09251, over 4276012.91 frames. ], batch size: 441, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 22:55:28,412 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=432654.0, ans=0.0 2023-06-19 22:55:38,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=432714.0, ans=0.125 2023-06-19 22:55:50,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=432714.0, ans=0.0 2023-06-19 22:56:14,884 INFO [train.py:996] (3/4) Epoch 3, batch 11150, loss[loss=0.2216, simple_loss=0.2784, pruned_loss=0.08238, over 21475.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3114, pruned_loss=0.09211, over 4275684.71 frames. ], batch size: 195, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 22:56:19,877 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=432834.0, ans=0.04949747468305833 2023-06-19 22:57:14,651 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=17.03 vs. limit=15.0 2023-06-19 22:57:23,878 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=432954.0, ans=0.95 2023-06-19 22:57:41,396 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.038e+02 2.651e+02 2.885e+02 3.515e+02 7.934e+02, threshold=5.769e+02, percent-clipped=2.0 2023-06-19 22:58:06,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=433074.0, ans=0.0 2023-06-19 22:58:09,300 INFO [train.py:996] (3/4) Epoch 3, batch 11200, loss[loss=0.2098, simple_loss=0.2888, pruned_loss=0.06538, over 21353.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3102, pruned_loss=0.09095, over 4266578.92 frames. ], batch size: 176, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 22:58:23,485 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=433134.0, ans=0.0 2023-06-19 22:58:25,633 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.80 vs. limit=15.0 2023-06-19 22:58:29,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=433134.0, ans=0.0 2023-06-19 22:58:52,044 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.54 vs. limit=15.0 2023-06-19 22:59:13,986 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=433254.0, ans=0.0 2023-06-19 22:59:18,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=433254.0, ans=0.125 2023-06-19 23:00:02,714 INFO [train.py:996] (3/4) Epoch 3, batch 11250, loss[loss=0.2389, simple_loss=0.313, pruned_loss=0.08239, over 21724.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3097, pruned_loss=0.09123, over 4254582.00 frames. ], batch size: 333, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 23:01:13,158 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=433554.0, ans=0.0 2023-06-19 23:01:40,267 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.923e+02 2.431e+02 2.742e+02 3.352e+02 9.461e+02, threshold=5.484e+02, percent-clipped=5.0 2023-06-19 23:01:43,838 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=433614.0, ans=0.125 2023-06-19 23:02:04,705 INFO [train.py:996] (3/4) Epoch 3, batch 11300, loss[loss=0.239, simple_loss=0.2963, pruned_loss=0.09082, over 21519.00 frames. ], tot_loss[loss=0.247, simple_loss=0.311, pruned_loss=0.09151, over 4260239.55 frames. ], batch size: 211, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 23:02:18,322 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=433794.0, ans=0.125 2023-06-19 23:03:57,767 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.75 vs. limit=15.0 2023-06-19 23:04:00,915 INFO [train.py:996] (3/4) Epoch 3, batch 11350, loss[loss=0.3039, simple_loss=0.3738, pruned_loss=0.117, over 21810.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3129, pruned_loss=0.09024, over 4255749.43 frames. ], batch size: 118, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 23:04:32,189 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.75 vs. limit=10.0 2023-06-19 23:04:50,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=434094.0, ans=0.125 2023-06-19 23:05:19,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=434214.0, ans=0.0 2023-06-19 23:05:25,135 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.948e+02 2.877e+02 3.365e+02 4.461e+02 8.303e+02, threshold=6.730e+02, percent-clipped=12.0 2023-06-19 23:06:05,030 INFO [train.py:996] (3/4) Epoch 3, batch 11400, loss[loss=0.284, simple_loss=0.3552, pruned_loss=0.1064, over 21279.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.3189, pruned_loss=0.09375, over 4262453.66 frames. ], batch size: 549, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 23:06:34,448 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.whiten.whitening_limit, batch_count=434334.0, ans=12.0 2023-06-19 23:06:45,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=434394.0, ans=0.2 2023-06-19 23:07:18,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=434454.0, ans=10.0 2023-06-19 23:07:37,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=434574.0, ans=0.1 2023-06-19 23:08:04,087 INFO [train.py:996] (3/4) Epoch 3, batch 11450, loss[loss=0.2301, simple_loss=0.2977, pruned_loss=0.08123, over 21304.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.3189, pruned_loss=0.09219, over 4265371.86 frames. ], batch size: 176, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 23:08:04,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=434634.0, ans=0.0 2023-06-19 23:08:53,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=434694.0, ans=0.1 2023-06-19 23:09:32,764 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.847e+02 2.596e+02 3.109e+02 3.660e+02 5.880e+02, threshold=6.218e+02, percent-clipped=0.0 2023-06-19 23:10:02,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=434874.0, ans=0.2 2023-06-19 23:10:08,102 INFO [train.py:996] (3/4) Epoch 3, batch 11500, loss[loss=0.3267, simple_loss=0.3892, pruned_loss=0.1321, over 21492.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.3234, pruned_loss=0.09461, over 4270386.93 frames. ], batch size: 508, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 23:10:34,703 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=434934.0, ans=0.125 2023-06-19 23:11:20,528 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.11 vs. limit=15.0 2023-06-19 23:11:24,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=435054.0, ans=0.125 2023-06-19 23:11:48,598 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=435114.0, ans=0.0 2023-06-19 23:12:11,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=435174.0, ans=0.125 2023-06-19 23:12:15,178 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.43 vs. limit=6.0 2023-06-19 23:12:15,586 INFO [train.py:996] (3/4) Epoch 3, batch 11550, loss[loss=0.3374, simple_loss=0.4514, pruned_loss=0.1117, over 21204.00 frames. ], tot_loss[loss=0.2603, simple_loss=0.3304, pruned_loss=0.09512, over 4270545.13 frames. ], batch size: 548, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 23:12:22,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=435234.0, ans=0.1 2023-06-19 23:12:56,510 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=435294.0, ans=0.2 2023-06-19 23:13:46,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=435354.0, ans=0.0 2023-06-19 23:13:47,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=435414.0, ans=0.0 2023-06-19 23:13:54,258 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 2.665e+02 3.265e+02 4.118e+02 8.231e+02, threshold=6.531e+02, percent-clipped=5.0 2023-06-19 23:14:01,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=435414.0, ans=0.015 2023-06-19 23:14:29,436 INFO [train.py:996] (3/4) Epoch 3, batch 11600, loss[loss=0.3689, simple_loss=0.459, pruned_loss=0.1394, over 21492.00 frames. ], tot_loss[loss=0.2702, simple_loss=0.3453, pruned_loss=0.09752, over 4263345.51 frames. ], batch size: 471, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 23:14:30,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=435534.0, ans=0.2 2023-06-19 23:14:39,353 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=435534.0, ans=0.125 2023-06-19 23:14:56,114 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=435594.0, ans=0.2 2023-06-19 23:15:05,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=435654.0, ans=0.025 2023-06-19 23:16:09,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=435774.0, ans=0.125 2023-06-19 23:16:14,738 INFO [train.py:996] (3/4) Epoch 3, batch 11650, loss[loss=0.279, simple_loss=0.34, pruned_loss=0.109, over 21723.00 frames. ], tot_loss[loss=0.2725, simple_loss=0.3497, pruned_loss=0.0977, over 4261476.62 frames. ], batch size: 351, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 23:16:35,797 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=435834.0, ans=0.05 2023-06-19 23:16:38,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=435834.0, ans=0.0 2023-06-19 23:16:53,016 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.04 vs. limit=10.0 2023-06-19 23:16:55,682 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-19 23:17:47,551 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.847e+02 2.535e+02 2.971e+02 3.665e+02 6.378e+02, threshold=5.943e+02, percent-clipped=0.0 2023-06-19 23:17:54,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=436014.0, ans=0.1 2023-06-19 23:18:18,004 INFO [train.py:996] (3/4) Epoch 3, batch 11700, loss[loss=0.2032, simple_loss=0.2824, pruned_loss=0.06202, over 15200.00 frames. ], tot_loss[loss=0.2671, simple_loss=0.3405, pruned_loss=0.09679, over 4256353.85 frames. ], batch size: 61, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 23:18:22,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=436134.0, ans=0.125 2023-06-19 23:18:23,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=436134.0, ans=0.125 2023-06-19 23:18:27,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=436134.0, ans=0.0 2023-06-19 23:18:56,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=436254.0, ans=0.125 2023-06-19 23:19:24,130 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.18 vs. limit=12.0 2023-06-19 23:19:50,331 INFO [train.py:996] (3/4) Epoch 3, batch 11750, loss[loss=0.2167, simple_loss=0.2745, pruned_loss=0.0794, over 21597.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.3303, pruned_loss=0.0961, over 4265003.99 frames. ], batch size: 247, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 23:20:06,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=436434.0, ans=0.125 2023-06-19 23:20:25,184 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=436554.0, ans=0.0 2023-06-19 23:20:53,728 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.944e+02 2.589e+02 3.123e+02 3.553e+02 5.230e+02, threshold=6.245e+02, percent-clipped=0.0 2023-06-19 23:21:00,903 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.17 vs. limit=15.0 2023-06-19 23:21:11,175 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.33 vs. limit=22.5 2023-06-19 23:21:17,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=436674.0, ans=0.125 2023-06-19 23:21:23,189 INFO [train.py:996] (3/4) Epoch 3, batch 11800, loss[loss=0.2385, simple_loss=0.3271, pruned_loss=0.07496, over 21574.00 frames. ], tot_loss[loss=0.2635, simple_loss=0.3315, pruned_loss=0.09779, over 4263941.20 frames. ], batch size: 230, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 23:21:46,527 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.07 vs. limit=15.0 2023-06-19 23:22:56,787 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=436914.0, ans=0.2 2023-06-19 23:23:19,446 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.95 vs. limit=15.0 2023-06-19 23:23:32,217 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=437034.0, ans=0.2 2023-06-19 23:23:33,153 INFO [train.py:996] (3/4) Epoch 3, batch 11850, loss[loss=0.2798, simple_loss=0.3716, pruned_loss=0.09399, over 20855.00 frames. ], tot_loss[loss=0.2634, simple_loss=0.3331, pruned_loss=0.09685, over 4265199.95 frames. ], batch size: 609, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 23:24:54,516 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.011e+02 2.604e+02 3.025e+02 3.520e+02 5.085e+02, threshold=6.049e+02, percent-clipped=0.0 2023-06-19 23:25:06,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=437214.0, ans=0.125 2023-06-19 23:25:30,664 INFO [train.py:996] (3/4) Epoch 3, batch 11900, loss[loss=0.2167, simple_loss=0.3005, pruned_loss=0.06643, over 21582.00 frames. ], tot_loss[loss=0.2608, simple_loss=0.3335, pruned_loss=0.09408, over 4264332.14 frames. ], batch size: 263, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 23:25:41,380 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.69 vs. limit=6.0 2023-06-19 23:25:42,121 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=437334.0, ans=0.1 2023-06-19 23:27:00,975 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.78 vs. limit=6.0 2023-06-19 23:27:08,737 INFO [train.py:996] (3/4) Epoch 3, batch 11950, loss[loss=0.201, simple_loss=0.2857, pruned_loss=0.05812, over 21408.00 frames. ], tot_loss[loss=0.2573, simple_loss=0.3333, pruned_loss=0.09065, over 4269562.53 frames. ], batch size: 211, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 23:27:47,332 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=437694.0, ans=0.0 2023-06-19 23:27:59,752 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.15 vs. limit=15.0 2023-06-19 23:28:28,551 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.900e+02 2.557e+02 3.205e+02 3.971e+02 7.967e+02, threshold=6.411e+02, percent-clipped=3.0 2023-06-19 23:28:48,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=437874.0, ans=0.0 2023-06-19 23:28:58,288 INFO [train.py:996] (3/4) Epoch 3, batch 12000, loss[loss=0.2561, simple_loss=0.3113, pruned_loss=0.1005, over 21545.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3294, pruned_loss=0.08951, over 4261831.37 frames. ], batch size: 414, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 23:28:58,289 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-19 23:29:56,190 INFO [train.py:1028] (3/4) Epoch 3, validation: loss=0.2725, simple_loss=0.3684, pruned_loss=0.08831, over 1796401.00 frames. 2023-06-19 23:29:56,192 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-19 23:30:31,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=437994.0, ans=0.1 2023-06-19 23:30:31,558 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 23:30:55,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=438114.0, ans=0.1 2023-06-19 23:31:01,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=438114.0, ans=0.125 2023-06-19 23:31:32,173 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=438174.0, ans=0.125 2023-06-19 23:31:52,699 INFO [train.py:996] (3/4) Epoch 3, batch 12050, loss[loss=0.2491, simple_loss=0.3146, pruned_loss=0.0918, over 21296.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.3256, pruned_loss=0.09156, over 4264237.96 frames. ], batch size: 176, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 23:33:06,227 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.141e+02 2.716e+02 3.222e+02 3.934e+02 6.549e+02, threshold=6.444e+02, percent-clipped=1.0 2023-06-19 23:33:13,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=438474.0, ans=0.125 2023-06-19 23:33:45,635 INFO [train.py:996] (3/4) Epoch 3, batch 12100, loss[loss=0.2604, simple_loss=0.3305, pruned_loss=0.09511, over 21628.00 frames. ], tot_loss[loss=0.2629, simple_loss=0.3318, pruned_loss=0.09703, over 4271823.91 frames. ], batch size: 230, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 23:33:54,449 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=438534.0, ans=0.02 2023-06-19 23:33:58,928 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=438534.0, ans=0.025 2023-06-19 23:34:25,334 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=438594.0, ans=0.125 2023-06-19 23:34:25,888 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=9.36 vs. limit=12.0 2023-06-19 23:36:08,828 INFO [train.py:996] (3/4) Epoch 3, batch 12150, loss[loss=0.2412, simple_loss=0.3386, pruned_loss=0.07188, over 21853.00 frames. ], tot_loss[loss=0.2625, simple_loss=0.3331, pruned_loss=0.09598, over 4267164.62 frames. ], batch size: 316, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 23:36:09,851 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.86 vs. limit=15.0 2023-06-19 23:36:40,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=438894.0, ans=0.125 2023-06-19 23:36:43,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=438894.0, ans=0.1 2023-06-19 23:37:01,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=438954.0, ans=0.0 2023-06-19 23:37:37,838 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.196e+02 3.132e+02 3.694e+02 4.434e+02 6.753e+02, threshold=7.387e+02, percent-clipped=1.0 2023-06-19 23:37:40,704 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.37 vs. limit=5.0 2023-06-19 23:37:42,076 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.06 vs. limit=15.0 2023-06-19 23:38:11,589 INFO [train.py:996] (3/4) Epoch 3, batch 12200, loss[loss=0.2307, simple_loss=0.2817, pruned_loss=0.08984, over 21496.00 frames. ], tot_loss[loss=0.2602, simple_loss=0.3294, pruned_loss=0.09552, over 4261687.28 frames. ], batch size: 230, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 23:38:18,204 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=439134.0, ans=0.0 2023-06-19 23:38:27,193 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=439194.0, ans=0.125 2023-06-19 23:38:51,693 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.52 vs. limit=15.0 2023-06-19 23:38:59,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=439254.0, ans=0.025 2023-06-19 23:39:08,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=439314.0, ans=0.05 2023-06-19 23:39:28,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=439314.0, ans=0.1 2023-06-19 23:40:09,937 INFO [train.py:996] (3/4) Epoch 3, batch 12250, loss[loss=0.2227, simple_loss=0.3137, pruned_loss=0.06587, over 21216.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.3218, pruned_loss=0.09185, over 4249139.35 frames. ], batch size: 548, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 23:40:43,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=439554.0, ans=0.0 2023-06-19 23:41:12,269 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.612e+02 2.432e+02 2.824e+02 3.455e+02 5.989e+02, threshold=5.649e+02, percent-clipped=0.0 2023-06-19 23:41:35,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=439674.0, ans=0.0 2023-06-19 23:41:39,430 INFO [train.py:996] (3/4) Epoch 3, batch 12300, loss[loss=0.2464, simple_loss=0.3345, pruned_loss=0.0792, over 21790.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3128, pruned_loss=0.0845, over 4259780.19 frames. ], batch size: 351, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 23:41:48,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=439734.0, ans=0.125 2023-06-19 23:42:46,236 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=439854.0, ans=0.0 2023-06-19 23:42:47,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=439854.0, ans=0.0 2023-06-19 23:43:47,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=439974.0, ans=0.0 2023-06-19 23:43:49,785 INFO [train.py:996] (3/4) Epoch 3, batch 12350, loss[loss=0.3006, simple_loss=0.3686, pruned_loss=0.1163, over 21774.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3188, pruned_loss=0.08652, over 4259543.47 frames. ], batch size: 441, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 23:44:00,435 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.03 vs. limit=15.0 2023-06-19 23:44:08,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=440094.0, ans=0.125 2023-06-19 23:44:51,642 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.640e+02 2.628e+02 3.014e+02 3.788e+02 7.072e+02, threshold=6.028e+02, percent-clipped=4.0 2023-06-19 23:44:52,102 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 23:45:31,967 INFO [train.py:996] (3/4) Epoch 3, batch 12400, loss[loss=0.2728, simple_loss=0.3365, pruned_loss=0.1045, over 21508.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3233, pruned_loss=0.09213, over 4275773.38 frames. ], batch size: 131, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 23:46:22,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=440454.0, ans=0.125 2023-06-19 23:46:36,792 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.53 vs. limit=22.5 2023-06-19 23:47:39,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=440574.0, ans=10.0 2023-06-19 23:47:47,502 INFO [train.py:996] (3/4) Epoch 3, batch 12450, loss[loss=0.2592, simple_loss=0.3329, pruned_loss=0.09277, over 21792.00 frames. ], tot_loss[loss=0.2596, simple_loss=0.3278, pruned_loss=0.09576, over 4281981.04 frames. ], batch size: 247, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 23:48:02,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=440634.0, ans=0.125 2023-06-19 23:48:04,946 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=440634.0, ans=0.125 2023-06-19 23:48:21,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=440694.0, ans=0.5 2023-06-19 23:48:33,615 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=440754.0, ans=0.95 2023-06-19 23:48:39,582 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=440754.0, ans=0.0 2023-06-19 23:48:55,828 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.68 vs. limit=15.0 2023-06-19 23:48:57,892 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.102e+02 2.789e+02 3.094e+02 3.461e+02 5.452e+02, threshold=6.188e+02, percent-clipped=0.0 2023-06-19 23:49:31,696 INFO [train.py:996] (3/4) Epoch 3, batch 12500, loss[loss=0.3567, simple_loss=0.4377, pruned_loss=0.1378, over 21509.00 frames. ], tot_loss[loss=0.2681, simple_loss=0.3376, pruned_loss=0.09933, over 4281290.04 frames. ], batch size: 471, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 23:49:52,378 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.50 vs. limit=12.0 2023-06-19 23:50:39,988 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=441054.0, ans=0.04949747468305833 2023-06-19 23:50:44,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=441054.0, ans=0.1 2023-06-19 23:50:54,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=441114.0, ans=0.1 2023-06-19 23:51:00,229 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=441114.0, ans=0.125 2023-06-19 23:51:16,303 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=441174.0, ans=0.025 2023-06-19 23:51:48,448 INFO [train.py:996] (3/4) Epoch 3, batch 12550, loss[loss=0.2645, simple_loss=0.3385, pruned_loss=0.09526, over 21716.00 frames. ], tot_loss[loss=0.2727, simple_loss=0.3421, pruned_loss=0.1017, over 4278201.79 frames. ], batch size: 298, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 23:52:03,305 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=441234.0, ans=0.125 2023-06-19 23:52:29,427 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=441294.0, ans=0.0 2023-06-19 23:52:39,934 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=441354.0, ans=0.05 2023-06-19 23:53:15,583 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.054e+02 2.900e+02 3.527e+02 4.002e+02 7.299e+02, threshold=7.054e+02, percent-clipped=3.0 2023-06-19 23:53:34,413 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=441474.0, ans=10.0 2023-06-19 23:53:39,085 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.85 vs. limit=15.0 2023-06-19 23:53:56,669 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=441474.0, ans=0.2 2023-06-19 23:53:58,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=441474.0, ans=0.0 2023-06-19 23:54:00,520 INFO [train.py:996] (3/4) Epoch 3, batch 12600, loss[loss=0.1994, simple_loss=0.2853, pruned_loss=0.05674, over 21771.00 frames. ], tot_loss[loss=0.2671, simple_loss=0.3378, pruned_loss=0.09815, over 4268769.44 frames. ], batch size: 282, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 23:54:21,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=441594.0, ans=0.125 2023-06-19 23:54:27,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=441594.0, ans=0.1 2023-06-19 23:55:49,629 INFO [train.py:996] (3/4) Epoch 3, batch 12650, loss[loss=0.261, simple_loss=0.3239, pruned_loss=0.09901, over 21527.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3287, pruned_loss=0.09284, over 4276539.62 frames. ], batch size: 131, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 23:56:38,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=441954.0, ans=0.1 2023-06-19 23:57:05,141 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.544e+02 2.355e+02 2.861e+02 3.372e+02 5.218e+02, threshold=5.723e+02, percent-clipped=0.0 2023-06-19 23:57:37,731 INFO [train.py:996] (3/4) Epoch 3, batch 12700, loss[loss=0.3112, simple_loss=0.4158, pruned_loss=0.1033, over 20809.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.3292, pruned_loss=0.0953, over 4281655.77 frames. ], batch size: 608, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 23:58:41,329 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 23:58:56,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=442314.0, ans=0.125 2023-06-19 23:59:17,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=442374.0, ans=0.0 2023-06-19 23:59:35,039 INFO [train.py:996] (3/4) Epoch 3, batch 12750, loss[loss=0.2138, simple_loss=0.2836, pruned_loss=0.07194, over 16482.00 frames. ], tot_loss[loss=0.2607, simple_loss=0.3311, pruned_loss=0.09516, over 4274864.40 frames. ], batch size: 60, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 23:59:49,852 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=442434.0, ans=0.05 2023-06-20 00:01:08,598 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.010e+02 2.647e+02 3.084e+02 3.611e+02 5.455e+02, threshold=6.168e+02, percent-clipped=0.0 2023-06-20 00:01:09,863 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.32 vs. limit=12.0 2023-06-20 00:01:19,679 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.76 vs. limit=15.0 2023-06-20 00:01:36,865 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=442674.0, ans=0.125 2023-06-20 00:01:43,928 INFO [train.py:996] (3/4) Epoch 3, batch 12800, loss[loss=0.2969, simple_loss=0.3813, pruned_loss=0.1063, over 19877.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3313, pruned_loss=0.09635, over 4277634.67 frames. ], batch size: 704, lr: 1.11e-02, grad_scale: 32.0 2023-06-20 00:01:48,581 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=442734.0, ans=0.0 2023-06-20 00:02:48,221 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=442914.0, ans=0.0 2023-06-20 00:03:01,378 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=442914.0, ans=0.0 2023-06-20 00:03:41,813 INFO [train.py:996] (3/4) Epoch 3, batch 12850, loss[loss=0.2696, simple_loss=0.3181, pruned_loss=0.1105, over 19979.00 frames. ], tot_loss[loss=0.2653, simple_loss=0.3343, pruned_loss=0.09812, over 4276091.88 frames. ], batch size: 703, lr: 1.11e-02, grad_scale: 32.0 2023-06-20 00:03:57,873 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=443034.0, ans=0.1 2023-06-20 00:04:53,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=443154.0, ans=0.0 2023-06-20 00:04:54,416 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.29 vs. limit=22.5 2023-06-20 00:05:19,845 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 2.657e+02 3.038e+02 3.599e+02 6.295e+02, threshold=6.075e+02, percent-clipped=1.0 2023-06-20 00:05:47,354 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.57 vs. limit=15.0 2023-06-20 00:05:48,549 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=443334.0, ans=0.0 2023-06-20 00:05:49,487 INFO [train.py:996] (3/4) Epoch 3, batch 12900, loss[loss=0.266, simple_loss=0.3461, pruned_loss=0.09298, over 21694.00 frames. ], tot_loss[loss=0.2609, simple_loss=0.3334, pruned_loss=0.09424, over 4273132.11 frames. ], batch size: 351, lr: 1.11e-02, grad_scale: 32.0 2023-06-20 00:06:57,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=443454.0, ans=0.0 2023-06-20 00:07:13,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=443514.0, ans=0.125 2023-06-20 00:07:54,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=443574.0, ans=0.0 2023-06-20 00:07:58,475 INFO [train.py:996] (3/4) Epoch 3, batch 12950, loss[loss=0.3382, simple_loss=0.3925, pruned_loss=0.142, over 21416.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.3317, pruned_loss=0.09274, over 4272199.74 frames. ], batch size: 471, lr: 1.11e-02, grad_scale: 32.0 2023-06-20 00:08:06,521 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=443634.0, ans=0.0 2023-06-20 00:08:12,505 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=443694.0, ans=0.125 2023-06-20 00:08:54,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=443754.0, ans=0.0 2023-06-20 00:09:30,309 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.957e+02 2.524e+02 2.816e+02 3.145e+02 4.812e+02, threshold=5.631e+02, percent-clipped=0.0 2023-06-20 00:09:50,039 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.96 vs. limit=15.0 2023-06-20 00:09:57,391 INFO [train.py:996] (3/4) Epoch 3, batch 13000, loss[loss=0.1671, simple_loss=0.2321, pruned_loss=0.051, over 17078.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.3315, pruned_loss=0.09168, over 4263852.94 frames. ], batch size: 63, lr: 1.11e-02, grad_scale: 16.0 2023-06-20 00:10:37,723 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.60 vs. limit=15.0 2023-06-20 00:10:40,540 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.36 vs. limit=12.0 2023-06-20 00:11:13,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=444114.0, ans=0.1 2023-06-20 00:11:24,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=444114.0, ans=0.2 2023-06-20 00:12:05,457 INFO [train.py:996] (3/4) Epoch 3, batch 13050, loss[loss=0.2554, simple_loss=0.3139, pruned_loss=0.09844, over 21562.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.328, pruned_loss=0.08981, over 4267540.04 frames. ], batch size: 548, lr: 1.11e-02, grad_scale: 16.0 2023-06-20 00:12:42,142 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=444354.0, ans=0.125 2023-06-20 00:13:01,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=444414.0, ans=0.125 2023-06-20 00:13:18,541 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 2.600e+02 3.229e+02 4.013e+02 6.675e+02, threshold=6.458e+02, percent-clipped=7.0 2023-06-20 00:13:48,104 INFO [train.py:996] (3/4) Epoch 3, batch 13100, loss[loss=0.2867, simple_loss=0.3431, pruned_loss=0.1151, over 21877.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3283, pruned_loss=0.0903, over 4276579.63 frames. ], batch size: 107, lr: 1.11e-02, grad_scale: 16.0 2023-06-20 00:14:34,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=444594.0, ans=0.035 2023-06-20 00:15:44,213 INFO [train.py:996] (3/4) Epoch 3, batch 13150, loss[loss=0.249, simple_loss=0.3069, pruned_loss=0.09553, over 21773.00 frames. ], tot_loss[loss=0.2594, simple_loss=0.3302, pruned_loss=0.09427, over 4274664.54 frames. ], batch size: 124, lr: 1.11e-02, grad_scale: 16.0 2023-06-20 00:15:56,694 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.02 vs. limit=22.5 2023-06-20 00:16:22,811 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=444894.0, ans=15.0 2023-06-20 00:17:04,938 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=444954.0, ans=0.0 2023-06-20 00:17:17,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=445014.0, ans=0.125 2023-06-20 00:17:23,501 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.014e+02 2.624e+02 3.071e+02 3.652e+02 5.306e+02, threshold=6.141e+02, percent-clipped=0.0 2023-06-20 00:17:48,427 INFO [train.py:996] (3/4) Epoch 3, batch 13200, loss[loss=0.2613, simple_loss=0.3298, pruned_loss=0.0964, over 21974.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.3279, pruned_loss=0.09401, over 4280465.92 frames. ], batch size: 317, lr: 1.11e-02, grad_scale: 32.0 2023-06-20 00:19:16,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=445314.0, ans=0.0 2023-06-20 00:19:56,213 INFO [train.py:996] (3/4) Epoch 3, batch 13250, loss[loss=0.2562, simple_loss=0.3101, pruned_loss=0.1012, over 21450.00 frames. ], tot_loss[loss=0.2596, simple_loss=0.3284, pruned_loss=0.09544, over 4280473.65 frames. ], batch size: 548, lr: 1.11e-02, grad_scale: 32.0 2023-06-20 00:21:40,764 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.748e+02 2.563e+02 2.914e+02 3.301e+02 4.888e+02, threshold=5.829e+02, percent-clipped=0.0 2023-06-20 00:21:50,229 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.51 vs. limit=15.0 2023-06-20 00:22:03,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=445674.0, ans=0.1 2023-06-20 00:22:12,924 INFO [train.py:996] (3/4) Epoch 3, batch 13300, loss[loss=0.2843, simple_loss=0.3511, pruned_loss=0.1087, over 21313.00 frames. ], tot_loss[loss=0.2624, simple_loss=0.3323, pruned_loss=0.0962, over 4286468.46 frames. ], batch size: 143, lr: 1.10e-02, grad_scale: 16.0 2023-06-20 00:22:32,940 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=445794.0, ans=0.125 2023-06-20 00:22:36,125 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=445794.0, ans=0.0 2023-06-20 00:22:51,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=445794.0, ans=0.0 2023-06-20 00:22:51,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=445794.0, ans=0.125 2023-06-20 00:24:11,650 INFO [train.py:996] (3/4) Epoch 3, batch 13350, loss[loss=0.2915, simple_loss=0.3601, pruned_loss=0.1115, over 21701.00 frames. ], tot_loss[loss=0.267, simple_loss=0.3365, pruned_loss=0.0987, over 4281768.18 frames. ], batch size: 351, lr: 1.10e-02, grad_scale: 16.0 2023-06-20 00:24:19,747 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.85 vs. limit=15.0 2023-06-20 00:24:20,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=446034.0, ans=0.1 2023-06-20 00:25:19,120 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=446154.0, ans=0.2 2023-06-20 00:25:30,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=446154.0, ans=0.0 2023-06-20 00:25:50,619 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.040e+02 2.814e+02 3.239e+02 3.845e+02 6.415e+02, threshold=6.478e+02, percent-clipped=2.0 2023-06-20 00:26:29,091 INFO [train.py:996] (3/4) Epoch 3, batch 13400, loss[loss=0.2771, simple_loss=0.333, pruned_loss=0.1106, over 21451.00 frames. ], tot_loss[loss=0.2686, simple_loss=0.3372, pruned_loss=0.09994, over 4280041.94 frames. ], batch size: 194, lr: 1.10e-02, grad_scale: 16.0 2023-06-20 00:26:45,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=446394.0, ans=0.125 2023-06-20 00:28:25,119 INFO [train.py:996] (3/4) Epoch 3, batch 13450, loss[loss=0.2468, simple_loss=0.2982, pruned_loss=0.09771, over 21682.00 frames. ], tot_loss[loss=0.2733, simple_loss=0.3397, pruned_loss=0.1034, over 4287542.43 frames. ], batch size: 247, lr: 1.10e-02, grad_scale: 16.0 2023-06-20 00:29:00,797 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-20 00:29:18,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=446754.0, ans=0.125 2023-06-20 00:29:20,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=446754.0, ans=0.0 2023-06-20 00:29:58,506 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.01 vs. limit=22.5 2023-06-20 00:29:59,001 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.120e+02 2.743e+02 3.193e+02 4.040e+02 8.828e+02, threshold=6.385e+02, percent-clipped=3.0 2023-06-20 00:30:34,586 INFO [train.py:996] (3/4) Epoch 3, batch 13500, loss[loss=0.2567, simple_loss=0.3129, pruned_loss=0.1002, over 21253.00 frames. ], tot_loss[loss=0.2646, simple_loss=0.3294, pruned_loss=0.09986, over 4284067.26 frames. ], batch size: 159, lr: 1.10e-02, grad_scale: 16.0 2023-06-20 00:31:38,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=447054.0, ans=0.1 2023-06-20 00:32:46,182 INFO [train.py:996] (3/4) Epoch 3, batch 13550, loss[loss=0.2431, simple_loss=0.3171, pruned_loss=0.08454, over 21808.00 frames. ], tot_loss[loss=0.2648, simple_loss=0.3331, pruned_loss=0.09828, over 4284408.68 frames. ], batch size: 124, lr: 1.10e-02, grad_scale: 16.0 2023-06-20 00:33:19,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=447294.0, ans=0.125 2023-06-20 00:33:21,859 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.27 vs. limit=15.0 2023-06-20 00:33:43,605 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=447354.0, ans=0.1 2023-06-20 00:34:07,246 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.54 vs. limit=12.0 2023-06-20 00:34:23,613 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.201e+02 2.814e+02 3.268e+02 3.929e+02 6.587e+02, threshold=6.537e+02, percent-clipped=1.0 2023-06-20 00:34:49,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=447474.0, ans=0.09899494936611666 2023-06-20 00:34:52,330 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=447474.0, ans=0.125 2023-06-20 00:34:52,973 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.21 vs. limit=15.0 2023-06-20 00:34:58,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=447534.0, ans=0.125 2023-06-20 00:34:59,314 INFO [train.py:996] (3/4) Epoch 3, batch 13600, loss[loss=0.305, simple_loss=0.3835, pruned_loss=0.1132, over 20727.00 frames. ], tot_loss[loss=0.2682, simple_loss=0.3362, pruned_loss=0.1001, over 4290416.24 frames. ], batch size: 607, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 00:35:21,483 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-20 00:35:33,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=447594.0, ans=0.2 2023-06-20 00:36:24,511 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.31 vs. limit=22.5 2023-06-20 00:36:52,761 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.59 vs. limit=6.0 2023-06-20 00:36:56,425 INFO [train.py:996] (3/4) Epoch 3, batch 13650, loss[loss=0.2412, simple_loss=0.2958, pruned_loss=0.09332, over 15460.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.331, pruned_loss=0.09616, over 4270930.64 frames. ], batch size: 62, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 00:37:37,005 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=447954.0, ans=0.125 2023-06-20 00:37:37,009 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=447954.0, ans=0.1 2023-06-20 00:37:47,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=447954.0, ans=0.125 2023-06-20 00:38:11,438 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.010e+02 2.579e+02 2.975e+02 3.634e+02 4.818e+02, threshold=5.950e+02, percent-clipped=0.0 2023-06-20 00:38:43,055 INFO [train.py:996] (3/4) Epoch 3, batch 13700, loss[loss=0.2417, simple_loss=0.3125, pruned_loss=0.08544, over 21853.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3242, pruned_loss=0.09506, over 4267419.39 frames. ], batch size: 316, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 00:40:13,121 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=448314.0, ans=0.0 2023-06-20 00:40:39,564 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.16 vs. limit=6.0 2023-06-20 00:40:53,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=448434.0, ans=0.125 2023-06-20 00:40:54,362 INFO [train.py:996] (3/4) Epoch 3, batch 13750, loss[loss=0.2322, simple_loss=0.2982, pruned_loss=0.08314, over 21436.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3193, pruned_loss=0.09289, over 4260821.46 frames. ], batch size: 212, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 00:41:20,386 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.61 vs. limit=10.0 2023-06-20 00:42:29,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=448614.0, ans=0.125 2023-06-20 00:42:35,429 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.902e+02 2.755e+02 3.135e+02 3.702e+02 5.925e+02, threshold=6.269e+02, percent-clipped=0.0 2023-06-20 00:42:59,111 INFO [train.py:996] (3/4) Epoch 3, batch 13800, loss[loss=0.2811, simple_loss=0.3802, pruned_loss=0.091, over 21767.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.3266, pruned_loss=0.09262, over 4258237.96 frames. ], batch size: 332, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 00:43:28,730 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=15.0 2023-06-20 00:44:09,293 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=448914.0, ans=0.125 2023-06-20 00:44:20,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=448914.0, ans=0.07 2023-06-20 00:45:01,402 INFO [train.py:996] (3/4) Epoch 3, batch 13850, loss[loss=0.3798, simple_loss=0.4261, pruned_loss=0.1667, over 21491.00 frames. ], tot_loss[loss=0.261, simple_loss=0.3328, pruned_loss=0.09463, over 4262547.14 frames. ], batch size: 508, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 00:46:35,421 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=449214.0, ans=0.0 2023-06-20 00:46:40,937 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.071e+02 2.873e+02 3.393e+02 4.011e+02 6.669e+02, threshold=6.786e+02, percent-clipped=1.0 2023-06-20 00:47:04,960 INFO [train.py:996] (3/4) Epoch 3, batch 13900, loss[loss=0.2886, simple_loss=0.3541, pruned_loss=0.1116, over 21824.00 frames. ], tot_loss[loss=0.2655, simple_loss=0.3353, pruned_loss=0.09789, over 4266475.17 frames. ], batch size: 112, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 00:47:42,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=449394.0, ans=0.125 2023-06-20 00:48:42,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=449574.0, ans=0.125 2023-06-20 00:48:56,882 INFO [train.py:996] (3/4) Epoch 3, batch 13950, loss[loss=0.2896, simple_loss=0.3457, pruned_loss=0.1168, over 21484.00 frames. ], tot_loss[loss=0.2695, simple_loss=0.3373, pruned_loss=0.1008, over 4270287.76 frames. ], batch size: 548, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 00:48:57,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=449634.0, ans=0.125 2023-06-20 00:49:09,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=449634.0, ans=0.125 2023-06-20 00:49:14,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=449634.0, ans=0.0 2023-06-20 00:49:23,204 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.98 vs. limit=6.0 2023-06-20 00:49:44,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=449694.0, ans=0.0 2023-06-20 00:50:39,157 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.955e+02 2.609e+02 3.095e+02 3.648e+02 5.597e+02, threshold=6.190e+02, percent-clipped=0.0 2023-06-20 00:50:56,553 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 00:50:57,084 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.88 vs. limit=12.0 2023-06-20 00:50:59,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=449874.0, ans=0.0 2023-06-20 00:51:07,767 INFO [train.py:996] (3/4) Epoch 3, batch 14000, loss[loss=0.2116, simple_loss=0.3031, pruned_loss=0.06004, over 21381.00 frames. ], tot_loss[loss=0.2638, simple_loss=0.333, pruned_loss=0.09733, over 4268069.15 frames. ], batch size: 211, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 00:51:16,620 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=449934.0, ans=0.125 2023-06-20 00:51:33,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=449994.0, ans=0.05 2023-06-20 00:51:52,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=450054.0, ans=0.125 2023-06-20 00:52:16,822 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=450054.0, ans=0.125 2023-06-20 00:52:30,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=450114.0, ans=0.125 2023-06-20 00:52:51,952 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=450174.0, ans=0.125 2023-06-20 00:53:00,095 INFO [train.py:996] (3/4) Epoch 3, batch 14050, loss[loss=0.2423, simple_loss=0.3486, pruned_loss=0.06805, over 19735.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3267, pruned_loss=0.09194, over 4269841.07 frames. ], batch size: 702, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 00:53:03,431 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 00:53:05,455 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.42 vs. limit=10.0 2023-06-20 00:53:34,434 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=450294.0, ans=0.125 2023-06-20 00:54:17,595 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.53 vs. limit=15.0 2023-06-20 00:54:25,052 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.13 vs. limit=10.0 2023-06-20 00:54:25,118 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.76 vs. limit=22.5 2023-06-20 00:54:25,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=450414.0, ans=0.025 2023-06-20 00:54:41,189 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.501e+02 2.482e+02 3.063e+02 3.822e+02 8.036e+02, threshold=6.126e+02, percent-clipped=3.0 2023-06-20 00:55:01,138 INFO [train.py:996] (3/4) Epoch 3, batch 14100, loss[loss=0.2499, simple_loss=0.3033, pruned_loss=0.09823, over 21557.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.3214, pruned_loss=0.09204, over 4272523.35 frames. ], batch size: 391, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 00:55:02,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=450534.0, ans=0.125 2023-06-20 00:55:05,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=450534.0, ans=0.05 2023-06-20 00:55:08,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=450534.0, ans=0.0 2023-06-20 00:55:57,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=450654.0, ans=0.125 2023-06-20 00:56:24,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=450714.0, ans=0.125 2023-06-20 00:56:50,638 INFO [train.py:996] (3/4) Epoch 3, batch 14150, loss[loss=0.2707, simple_loss=0.3532, pruned_loss=0.09414, over 21261.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.3258, pruned_loss=0.0936, over 4273349.16 frames. ], batch size: 549, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 00:56:57,408 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.60 vs. limit=15.0 2023-06-20 00:57:07,370 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=450894.0, ans=0.1 2023-06-20 00:57:18,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=450894.0, ans=0.125 2023-06-20 00:57:58,480 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.809e+02 2.371e+02 2.924e+02 3.889e+02 6.817e+02, threshold=5.848e+02, percent-clipped=2.0 2023-06-20 00:58:20,972 INFO [train.py:996] (3/4) Epoch 3, batch 14200, loss[loss=0.2473, simple_loss=0.3146, pruned_loss=0.09003, over 21860.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.324, pruned_loss=0.09222, over 4276900.76 frames. ], batch size: 371, lr: 1.10e-02, grad_scale: 16.0 2023-06-20 00:58:35,345 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=451134.0, ans=0.1 2023-06-20 00:59:04,117 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=451254.0, ans=0.125 2023-06-20 00:59:17,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=451314.0, ans=0.1 2023-06-20 00:59:24,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=451314.0, ans=0.0 2023-06-20 00:59:59,663 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.20 vs. limit=15.0 2023-06-20 01:00:03,131 INFO [train.py:996] (3/4) Epoch 3, batch 14250, loss[loss=0.2039, simple_loss=0.2846, pruned_loss=0.06159, over 21708.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3187, pruned_loss=0.09217, over 4272556.52 frames. ], batch size: 298, lr: 1.10e-02, grad_scale: 16.0 2023-06-20 01:00:39,148 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.93 vs. limit=22.5 2023-06-20 01:01:36,130 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.856e+02 2.661e+02 3.144e+02 3.972e+02 7.330e+02, threshold=6.288e+02, percent-clipped=4.0 2023-06-20 01:01:46,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=451674.0, ans=0.125 2023-06-20 01:01:57,623 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.32 vs. limit=15.0 2023-06-20 01:02:01,301 INFO [train.py:996] (3/4) Epoch 3, batch 14300, loss[loss=0.3096, simple_loss=0.397, pruned_loss=0.1111, over 21827.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3201, pruned_loss=0.09198, over 4273144.30 frames. ], batch size: 316, lr: 1.10e-02, grad_scale: 16.0 2023-06-20 01:02:16,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=451734.0, ans=0.0 2023-06-20 01:03:21,547 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.35 vs. limit=15.0 2023-06-20 01:03:25,063 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=451914.0, ans=0.125 2023-06-20 01:03:28,024 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=451914.0, ans=0.0 2023-06-20 01:03:57,061 INFO [train.py:996] (3/4) Epoch 3, batch 14350, loss[loss=0.2275, simple_loss=0.2957, pruned_loss=0.07963, over 21413.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3238, pruned_loss=0.09141, over 4266964.79 frames. ], batch size: 131, lr: 1.10e-02, grad_scale: 16.0 2023-06-20 01:03:57,578 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=452034.0, ans=0.0 2023-06-20 01:05:10,652 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=452154.0, ans=0.125 2023-06-20 01:05:27,939 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.913e+02 2.766e+02 3.278e+02 4.344e+02 6.485e+02, threshold=6.555e+02, percent-clipped=1.0 2023-06-20 01:05:49,277 INFO [train.py:996] (3/4) Epoch 3, batch 14400, loss[loss=0.2463, simple_loss=0.2998, pruned_loss=0.09643, over 21518.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3216, pruned_loss=0.09175, over 4272867.79 frames. ], batch size: 195, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 01:06:00,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=452334.0, ans=0.95 2023-06-20 01:06:11,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=452394.0, ans=0.1 2023-06-20 01:06:52,902 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=452454.0, ans=0.2 2023-06-20 01:07:48,690 INFO [train.py:996] (3/4) Epoch 3, batch 14450, loss[loss=0.2394, simple_loss=0.2994, pruned_loss=0.0897, over 21755.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3158, pruned_loss=0.09161, over 4272804.20 frames. ], batch size: 316, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 01:07:57,926 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.41 vs. limit=6.0 2023-06-20 01:08:16,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=452694.0, ans=0.125 2023-06-20 01:09:02,998 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.837e+02 2.591e+02 2.982e+02 3.373e+02 5.553e+02, threshold=5.963e+02, percent-clipped=0.0 2023-06-20 01:09:04,769 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 01:09:29,467 INFO [train.py:996] (3/4) Epoch 3, batch 14500, loss[loss=0.2376, simple_loss=0.3143, pruned_loss=0.08045, over 21233.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.314, pruned_loss=0.09138, over 4270045.94 frames. ], batch size: 548, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 01:10:03,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=453054.0, ans=0.0 2023-06-20 01:10:23,260 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=453054.0, ans=0.0 2023-06-20 01:11:10,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=453174.0, ans=0.0 2023-06-20 01:11:21,638 INFO [train.py:996] (3/4) Epoch 3, batch 14550, loss[loss=0.2669, simple_loss=0.3375, pruned_loss=0.09811, over 21919.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.321, pruned_loss=0.09394, over 4262973.91 frames. ], batch size: 316, lr: 1.10e-02, grad_scale: 16.0 2023-06-20 01:11:24,954 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=453234.0, ans=0.125 2023-06-20 01:13:04,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=453414.0, ans=0.0 2023-06-20 01:13:07,562 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.389e+02 2.890e+02 3.229e+02 4.188e+02 6.870e+02, threshold=6.458e+02, percent-clipped=5.0 2023-06-20 01:13:39,666 INFO [train.py:996] (3/4) Epoch 3, batch 14600, loss[loss=0.3127, simple_loss=0.3949, pruned_loss=0.1152, over 21666.00 frames. ], tot_loss[loss=0.2634, simple_loss=0.3301, pruned_loss=0.09834, over 4268951.19 frames. ], batch size: 414, lr: 1.10e-02, grad_scale: 16.0 2023-06-20 01:14:55,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=453714.0, ans=0.0 2023-06-20 01:15:07,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=453714.0, ans=0.125 2023-06-20 01:15:23,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=453774.0, ans=0.125 2023-06-20 01:15:27,602 INFO [train.py:996] (3/4) Epoch 3, batch 14650, loss[loss=0.2068, simple_loss=0.2665, pruned_loss=0.07354, over 16148.00 frames. ], tot_loss[loss=0.2621, simple_loss=0.33, pruned_loss=0.09708, over 4270323.02 frames. ], batch size: 61, lr: 1.09e-02, grad_scale: 16.0 2023-06-20 01:16:09,811 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.31 vs. limit=22.5 2023-06-20 01:16:39,177 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.63 vs. limit=15.0 2023-06-20 01:17:12,668 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.451e+02 2.350e+02 2.833e+02 3.404e+02 5.520e+02, threshold=5.666e+02, percent-clipped=0.0 2023-06-20 01:17:17,916 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=454074.0, ans=0.0 2023-06-20 01:17:26,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=454134.0, ans=0.07 2023-06-20 01:17:27,804 INFO [train.py:996] (3/4) Epoch 3, batch 14700, loss[loss=0.2746, simple_loss=0.3655, pruned_loss=0.09187, over 21310.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.326, pruned_loss=0.09207, over 4262724.34 frames. ], batch size: 548, lr: 1.09e-02, grad_scale: 16.0 2023-06-20 01:17:46,332 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.86 vs. limit=15.0 2023-06-20 01:18:35,258 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.59 vs. limit=15.0 2023-06-20 01:19:29,585 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.49 vs. limit=15.0 2023-06-20 01:19:29,826 INFO [train.py:996] (3/4) Epoch 3, batch 14750, loss[loss=0.294, simple_loss=0.365, pruned_loss=0.1115, over 21763.00 frames. ], tot_loss[loss=0.2602, simple_loss=0.3306, pruned_loss=0.09485, over 4246377.07 frames. ], batch size: 124, lr: 1.09e-02, grad_scale: 16.0 2023-06-20 01:19:48,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=454434.0, ans=0.125 2023-06-20 01:20:19,353 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=454494.0, ans=0.125 2023-06-20 01:20:23,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=454554.0, ans=0.125 2023-06-20 01:20:27,152 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.10 vs. limit=15.0 2023-06-20 01:20:59,088 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=454614.0, ans=0.1 2023-06-20 01:21:21,289 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.918e+02 2.805e+02 3.277e+02 4.115e+02 8.120e+02, threshold=6.554e+02, percent-clipped=5.0 2023-06-20 01:21:52,218 INFO [train.py:996] (3/4) Epoch 3, batch 14800, loss[loss=0.2533, simple_loss=0.3074, pruned_loss=0.09959, over 21852.00 frames. ], tot_loss[loss=0.2706, simple_loss=0.3404, pruned_loss=0.1004, over 4251043.67 frames. ], batch size: 107, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:22:14,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=454734.0, ans=0.0 2023-06-20 01:23:48,743 INFO [train.py:996] (3/4) Epoch 3, batch 14850, loss[loss=0.3607, simple_loss=0.4182, pruned_loss=0.1516, over 21608.00 frames. ], tot_loss[loss=0.2693, simple_loss=0.3365, pruned_loss=0.101, over 4257481.63 frames. ], batch size: 441, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:24:40,283 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.35 vs. limit=15.0 2023-06-20 01:25:05,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=455214.0, ans=0.1 2023-06-20 01:25:33,480 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.172e+02 2.933e+02 3.352e+02 4.211e+02 7.293e+02, threshold=6.704e+02, percent-clipped=2.0 2023-06-20 01:26:01,556 INFO [train.py:996] (3/4) Epoch 3, batch 14900, loss[loss=0.2764, simple_loss=0.336, pruned_loss=0.1084, over 21444.00 frames. ], tot_loss[loss=0.2712, simple_loss=0.3395, pruned_loss=0.1014, over 4256235.75 frames. ], batch size: 194, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:26:19,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=455334.0, ans=0.125 2023-06-20 01:27:37,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn1.whiten.whitening_limit, batch_count=455514.0, ans=22.5 2023-06-20 01:27:51,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=455574.0, ans=0.125 2023-06-20 01:28:13,435 INFO [train.py:996] (3/4) Epoch 3, batch 14950, loss[loss=0.2502, simple_loss=0.3294, pruned_loss=0.08546, over 21628.00 frames. ], tot_loss[loss=0.2704, simple_loss=0.3397, pruned_loss=0.1006, over 4261863.04 frames. ], batch size: 263, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:28:27,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=455694.0, ans=0.2 2023-06-20 01:29:29,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=455754.0, ans=0.0 2023-06-20 01:29:52,500 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.135e+02 2.902e+02 3.334e+02 3.960e+02 6.522e+02, threshold=6.669e+02, percent-clipped=0.0 2023-06-20 01:30:00,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=455874.0, ans=0.0 2023-06-20 01:30:13,463 INFO [train.py:996] (3/4) Epoch 3, batch 15000, loss[loss=0.2573, simple_loss=0.3221, pruned_loss=0.09631, over 21452.00 frames. ], tot_loss[loss=0.274, simple_loss=0.3424, pruned_loss=0.1028, over 4268033.23 frames. ], batch size: 211, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:30:13,464 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-20 01:31:04,543 INFO [train.py:1028] (3/4) Epoch 3, validation: loss=0.2678, simple_loss=0.368, pruned_loss=0.08383, over 1796401.00 frames. 2023-06-20 01:31:04,545 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-20 01:31:13,602 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.61 vs. limit=8.0 2023-06-20 01:31:46,896 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=455994.0, ans=0.125 2023-06-20 01:32:16,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=456114.0, ans=0.2 2023-06-20 01:32:46,008 INFO [train.py:996] (3/4) Epoch 3, batch 15050, loss[loss=0.2672, simple_loss=0.3484, pruned_loss=0.09301, over 19790.00 frames. ], tot_loss[loss=0.2744, simple_loss=0.3418, pruned_loss=0.1035, over 4264400.49 frames. ], batch size: 703, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:32:52,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=456234.0, ans=0.125 2023-06-20 01:32:56,494 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=456234.0, ans=0.0 2023-06-20 01:33:14,087 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 01:33:28,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=456294.0, ans=0.1 2023-06-20 01:34:12,175 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.328e+02 3.015e+02 3.596e+02 4.330e+02 8.348e+02, threshold=7.192e+02, percent-clipped=9.0 2023-06-20 01:34:36,349 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=456474.0, ans=0.0 2023-06-20 01:34:48,982 INFO [train.py:996] (3/4) Epoch 3, batch 15100, loss[loss=0.3043, simple_loss=0.3687, pruned_loss=0.12, over 21590.00 frames. ], tot_loss[loss=0.2746, simple_loss=0.3436, pruned_loss=0.1028, over 4264539.39 frames. ], batch size: 389, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:34:56,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=456534.0, ans=0.2 2023-06-20 01:35:39,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=456594.0, ans=0.0 2023-06-20 01:36:10,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=456714.0, ans=0.125 2023-06-20 01:36:19,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=456774.0, ans=0.125 2023-06-20 01:36:19,948 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.68 vs. limit=15.0 2023-06-20 01:36:34,668 INFO [train.py:996] (3/4) Epoch 3, batch 15150, loss[loss=0.338, simple_loss=0.4527, pruned_loss=0.1116, over 19754.00 frames. ], tot_loss[loss=0.273, simple_loss=0.34, pruned_loss=0.1029, over 4271039.88 frames. ], batch size: 702, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:37:09,645 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=456894.0, ans=15.0 2023-06-20 01:37:43,183 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.56 vs. limit=15.0 2023-06-20 01:37:50,820 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.945e+02 2.762e+02 3.255e+02 4.111e+02 7.201e+02, threshold=6.510e+02, percent-clipped=1.0 2023-06-20 01:37:55,680 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=457074.0, ans=0.125 2023-06-20 01:38:05,365 INFO [train.py:996] (3/4) Epoch 3, batch 15200, loss[loss=0.2128, simple_loss=0.3049, pruned_loss=0.06033, over 21313.00 frames. ], tot_loss[loss=0.2636, simple_loss=0.3311, pruned_loss=0.09807, over 4264497.18 frames. ], batch size: 551, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:38:51,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=457194.0, ans=0.125 2023-06-20 01:39:41,422 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.81 vs. limit=12.0 2023-06-20 01:40:23,942 INFO [train.py:996] (3/4) Epoch 3, batch 15250, loss[loss=0.2664, simple_loss=0.3183, pruned_loss=0.1073, over 21566.00 frames. ], tot_loss[loss=0.2598, simple_loss=0.3259, pruned_loss=0.09685, over 4265074.53 frames. ], batch size: 415, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:40:41,200 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.62 vs. limit=15.0 2023-06-20 01:41:23,395 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=457554.0, ans=0.0 2023-06-20 01:41:50,049 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.914e+02 2.712e+02 3.294e+02 3.861e+02 5.748e+02, threshold=6.588e+02, percent-clipped=0.0 2023-06-20 01:42:02,250 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=457674.0, ans=0.125 2023-06-20 01:42:17,464 INFO [train.py:996] (3/4) Epoch 3, batch 15300, loss[loss=0.3594, simple_loss=0.3903, pruned_loss=0.1643, over 21446.00 frames. ], tot_loss[loss=0.2656, simple_loss=0.3301, pruned_loss=0.1006, over 4271348.44 frames. ], batch size: 510, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:42:58,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=457794.0, ans=0.125 2023-06-20 01:43:15,054 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=457854.0, ans=0.1 2023-06-20 01:44:02,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=457974.0, ans=0.125 2023-06-20 01:44:21,161 INFO [train.py:996] (3/4) Epoch 3, batch 15350, loss[loss=0.3118, simple_loss=0.3784, pruned_loss=0.1226, over 21453.00 frames. ], tot_loss[loss=0.2695, simple_loss=0.3343, pruned_loss=0.1023, over 4270739.22 frames. ], batch size: 471, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:44:35,916 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=458034.0, ans=0.1 2023-06-20 01:45:31,291 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.008e+02 2.699e+02 3.240e+02 4.013e+02 6.691e+02, threshold=6.480e+02, percent-clipped=1.0 2023-06-20 01:45:36,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=458274.0, ans=0.125 2023-06-20 01:45:36,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=458274.0, ans=0.125 2023-06-20 01:45:46,122 INFO [train.py:996] (3/4) Epoch 3, batch 15400, loss[loss=0.2535, simple_loss=0.3231, pruned_loss=0.09193, over 21500.00 frames. ], tot_loss[loss=0.2663, simple_loss=0.3332, pruned_loss=0.09967, over 4270916.78 frames. ], batch size: 211, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:45:49,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=458334.0, ans=0.125 2023-06-20 01:47:04,773 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 01:47:18,456 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=458574.0, ans=0.125 2023-06-20 01:47:33,357 INFO [train.py:996] (3/4) Epoch 3, batch 15450, loss[loss=0.2975, simple_loss=0.3874, pruned_loss=0.1037, over 19623.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.3312, pruned_loss=0.09861, over 4259886.70 frames. ], batch size: 703, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:47:41,324 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.00 vs. limit=15.0 2023-06-20 01:47:52,355 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=458634.0, ans=0.1 2023-06-20 01:47:53,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=458634.0, ans=0.125 2023-06-20 01:47:58,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=458694.0, ans=0.2 2023-06-20 01:48:05,526 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.20 vs. limit=6.0 2023-06-20 01:48:07,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=458694.0, ans=0.0 2023-06-20 01:48:30,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=458754.0, ans=0.125 2023-06-20 01:48:33,091 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=12.0 2023-06-20 01:48:35,598 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=458754.0, ans=0.125 2023-06-20 01:48:37,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=458754.0, ans=0.125 2023-06-20 01:48:37,064 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=458754.0, ans=0.2 2023-06-20 01:48:49,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=458814.0, ans=0.1 2023-06-20 01:48:54,286 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.039e+02 2.616e+02 3.368e+02 4.050e+02 6.110e+02, threshold=6.736e+02, percent-clipped=0.0 2023-06-20 01:49:12,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=458874.0, ans=0.125 2023-06-20 01:49:32,918 INFO [train.py:996] (3/4) Epoch 3, batch 15500, loss[loss=0.3759, simple_loss=0.407, pruned_loss=0.1724, over 21338.00 frames. ], tot_loss[loss=0.2649, simple_loss=0.3337, pruned_loss=0.0981, over 4257227.88 frames. ], batch size: 507, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:51:26,137 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=459174.0, ans=0.125 2023-06-20 01:51:29,886 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.22 vs. limit=22.5 2023-06-20 01:51:48,342 INFO [train.py:996] (3/4) Epoch 3, batch 15550, loss[loss=0.3033, simple_loss=0.3562, pruned_loss=0.1252, over 21474.00 frames. ], tot_loss[loss=0.2619, simple_loss=0.3328, pruned_loss=0.09547, over 4259056.20 frames. ], batch size: 508, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:51:55,018 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.71 vs. limit=12.0 2023-06-20 01:52:08,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=459294.0, ans=0.125 2023-06-20 01:52:27,374 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.46 vs. limit=15.0 2023-06-20 01:53:14,714 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.749e+02 2.425e+02 2.682e+02 3.354e+02 5.431e+02, threshold=5.364e+02, percent-clipped=0.0 2023-06-20 01:53:29,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=459474.0, ans=0.125 2023-06-20 01:53:35,285 INFO [train.py:996] (3/4) Epoch 3, batch 15600, loss[loss=0.2309, simple_loss=0.2949, pruned_loss=0.08349, over 21311.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3279, pruned_loss=0.09412, over 4237003.24 frames. ], batch size: 160, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:53:39,014 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=22.5 2023-06-20 01:54:01,282 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.15 vs. limit=15.0 2023-06-20 01:54:13,856 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=459654.0, ans=0.0 2023-06-20 01:54:53,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=459714.0, ans=0.125 2023-06-20 01:55:34,237 INFO [train.py:996] (3/4) Epoch 3, batch 15650, loss[loss=0.2616, simple_loss=0.3247, pruned_loss=0.09925, over 21439.00 frames. ], tot_loss[loss=0.2555, simple_loss=0.3244, pruned_loss=0.09325, over 4244536.82 frames. ], batch size: 389, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:55:43,653 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=459834.0, ans=0.0 2023-06-20 01:55:57,838 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=459894.0, ans=0.125 2023-06-20 01:56:27,545 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=459954.0, ans=0.025 2023-06-20 01:56:53,088 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.85 vs. limit=15.0 2023-06-20 01:56:54,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=460014.0, ans=0.0 2023-06-20 01:56:57,902 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.823e+02 2.476e+02 2.884e+02 3.400e+02 5.919e+02, threshold=5.768e+02, percent-clipped=1.0 2023-06-20 01:57:12,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=460074.0, ans=0.0 2023-06-20 01:57:13,943 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=460074.0, ans=0.125 2023-06-20 01:57:17,718 INFO [train.py:996] (3/4) Epoch 3, batch 15700, loss[loss=0.2215, simple_loss=0.2826, pruned_loss=0.08021, over 21808.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3198, pruned_loss=0.09208, over 4247231.64 frames. ], batch size: 112, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:57:32,870 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.23 vs. limit=22.5 2023-06-20 01:57:55,234 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=460194.0, ans=0.0 2023-06-20 01:58:10,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=460254.0, ans=0.0 2023-06-20 01:58:34,653 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=460314.0, ans=0.125 2023-06-20 01:59:07,070 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=460374.0, ans=0.125 2023-06-20 01:59:18,034 INFO [train.py:996] (3/4) Epoch 3, batch 15750, loss[loss=0.2465, simple_loss=0.2939, pruned_loss=0.09951, over 15957.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3149, pruned_loss=0.09125, over 4253462.43 frames. ], batch size: 64, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:59:41,306 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 02:00:31,821 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=460614.0, ans=0.125 2023-06-20 02:00:54,928 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.788e+02 2.350e+02 2.605e+02 2.992e+02 4.357e+02, threshold=5.211e+02, percent-clipped=0.0 2023-06-20 02:01:21,481 INFO [train.py:996] (3/4) Epoch 3, batch 15800, loss[loss=0.237, simple_loss=0.293, pruned_loss=0.09048, over 21433.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3105, pruned_loss=0.0913, over 4257451.67 frames. ], batch size: 194, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 02:01:29,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=460734.0, ans=0.0 2023-06-20 02:01:45,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=460794.0, ans=0.1 2023-06-20 02:01:57,145 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=460854.0, ans=0.125 2023-06-20 02:02:03,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=460854.0, ans=0.2 2023-06-20 02:02:53,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=460974.0, ans=0.125 2023-06-20 02:03:13,424 INFO [train.py:996] (3/4) Epoch 3, batch 15850, loss[loss=0.2466, simple_loss=0.3104, pruned_loss=0.09139, over 21796.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3127, pruned_loss=0.09406, over 4254861.75 frames. ], batch size: 124, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 02:03:22,290 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.10 vs. limit=15.0 2023-06-20 02:03:24,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=461034.0, ans=0.125 2023-06-20 02:03:44,667 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.64 vs. limit=22.5 2023-06-20 02:04:23,728 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=461214.0, ans=15.0 2023-06-20 02:04:33,405 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.107e+02 2.753e+02 3.333e+02 4.254e+02 6.443e+02, threshold=6.666e+02, percent-clipped=4.0 2023-06-20 02:04:33,886 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=461274.0, ans=0.125 2023-06-20 02:04:45,162 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=461274.0, ans=0.125 2023-06-20 02:04:59,059 INFO [train.py:996] (3/4) Epoch 3, batch 15900, loss[loss=0.2618, simple_loss=0.3413, pruned_loss=0.09108, over 21539.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3104, pruned_loss=0.0935, over 4263555.66 frames. ], batch size: 389, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 02:05:02,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=461334.0, ans=0.125 2023-06-20 02:06:12,877 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=461514.0, ans=0.95 2023-06-20 02:06:16,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=461514.0, ans=0.0 2023-06-20 02:06:52,743 INFO [train.py:996] (3/4) Epoch 3, batch 15950, loss[loss=0.2086, simple_loss=0.2903, pruned_loss=0.06346, over 21738.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3109, pruned_loss=0.09053, over 4261135.10 frames. ], batch size: 247, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 02:06:57,846 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=461634.0, ans=0.0 2023-06-20 02:07:00,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=461634.0, ans=0.125 2023-06-20 02:07:42,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=461694.0, ans=0.05 2023-06-20 02:08:30,944 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.640e+02 2.370e+02 2.785e+02 3.587e+02 6.147e+02, threshold=5.571e+02, percent-clipped=0.0 2023-06-20 02:08:51,762 INFO [train.py:996] (3/4) Epoch 3, batch 16000, loss[loss=0.2474, simple_loss=0.3352, pruned_loss=0.07976, over 21672.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3124, pruned_loss=0.08862, over 4255437.04 frames. ], batch size: 389, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 02:09:20,193 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=461994.0, ans=0.0 2023-06-20 02:09:45,943 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=462054.0, ans=0.125 2023-06-20 02:09:49,307 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.92 vs. limit=15.0 2023-06-20 02:10:21,758 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=462114.0, ans=0.125 2023-06-20 02:10:51,830 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=462174.0, ans=0.125 2023-06-20 02:11:01,270 INFO [train.py:996] (3/4) Epoch 3, batch 16050, loss[loss=0.3457, simple_loss=0.426, pruned_loss=0.1328, over 21513.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3162, pruned_loss=0.08635, over 4249749.07 frames. ], batch size: 471, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 02:11:03,193 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=462234.0, ans=0.0 2023-06-20 02:11:26,319 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=462294.0, ans=0.1 2023-06-20 02:12:28,225 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.903e+02 2.489e+02 3.002e+02 4.044e+02 8.008e+02, threshold=6.003e+02, percent-clipped=4.0 2023-06-20 02:12:42,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=462474.0, ans=0.0 2023-06-20 02:12:49,933 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.25 vs. limit=8.0 2023-06-20 02:12:50,215 INFO [train.py:996] (3/4) Epoch 3, batch 16100, loss[loss=0.293, simple_loss=0.3448, pruned_loss=0.1206, over 21634.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3205, pruned_loss=0.0878, over 4255040.52 frames. ], batch size: 471, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:12:50,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=462534.0, ans=0.2 2023-06-20 02:13:19,703 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=462594.0, ans=0.1 2023-06-20 02:13:30,500 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.74 vs. limit=22.5 2023-06-20 02:13:35,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=462654.0, ans=0.125 2023-06-20 02:14:48,969 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=462834.0, ans=0.0 2023-06-20 02:14:49,006 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=462834.0, ans=0.05 2023-06-20 02:14:49,955 INFO [train.py:996] (3/4) Epoch 3, batch 16150, loss[loss=0.2483, simple_loss=0.3343, pruned_loss=0.08116, over 19904.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.322, pruned_loss=0.09108, over 4272184.11 frames. ], batch size: 703, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:15:06,777 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.09 vs. limit=6.0 2023-06-20 02:15:16,846 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.19 vs. limit=15.0 2023-06-20 02:15:29,953 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=462954.0, ans=0.0 2023-06-20 02:15:56,742 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=463014.0, ans=0.125 2023-06-20 02:16:12,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=463014.0, ans=0.1 2023-06-20 02:16:17,569 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.879e+02 2.540e+02 2.925e+02 3.494e+02 4.930e+02, threshold=5.850e+02, percent-clipped=0.0 2023-06-20 02:16:48,933 INFO [train.py:996] (3/4) Epoch 3, batch 16200, loss[loss=0.3171, simple_loss=0.385, pruned_loss=0.1246, over 21435.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3267, pruned_loss=0.0929, over 4276111.89 frames. ], batch size: 131, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:16:56,946 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=463134.0, ans=0.0 2023-06-20 02:18:13,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=463314.0, ans=0.125 2023-06-20 02:18:15,898 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.04 vs. limit=12.0 2023-06-20 02:18:39,601 INFO [train.py:996] (3/4) Epoch 3, batch 16250, loss[loss=0.2039, simple_loss=0.2728, pruned_loss=0.06752, over 21286.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3247, pruned_loss=0.09239, over 4281091.37 frames. ], batch size: 176, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:18:53,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=463434.0, ans=0.125 2023-06-20 02:19:31,285 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=463554.0, ans=0.125 2023-06-20 02:19:31,386 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=463554.0, ans=0.125 2023-06-20 02:19:39,779 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=463554.0, ans=10.0 2023-06-20 02:20:01,033 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 02:20:01,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=463614.0, ans=0.0 2023-06-20 02:20:02,887 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=463614.0, ans=0.125 2023-06-20 02:20:10,411 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.786e+02 2.534e+02 3.093e+02 3.878e+02 6.087e+02, threshold=6.186e+02, percent-clipped=1.0 2023-06-20 02:20:30,344 INFO [train.py:996] (3/4) Epoch 3, batch 16300, loss[loss=0.2068, simple_loss=0.2971, pruned_loss=0.0583, over 21716.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3173, pruned_loss=0.08796, over 4276655.60 frames. ], batch size: 351, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:20:37,826 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=463734.0, ans=0.2 2023-06-20 02:20:38,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=463734.0, ans=15.0 2023-06-20 02:20:46,710 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=463734.0, ans=0.07 2023-06-20 02:21:01,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=463794.0, ans=0.125 2023-06-20 02:21:06,004 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 02:21:18,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=463854.0, ans=0.125 2023-06-20 02:22:26,474 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=464034.0, ans=0.125 2023-06-20 02:22:27,464 INFO [train.py:996] (3/4) Epoch 3, batch 16350, loss[loss=0.2867, simple_loss=0.3448, pruned_loss=0.1142, over 21941.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3189, pruned_loss=0.0899, over 4277485.80 frames. ], batch size: 372, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:22:57,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=464034.0, ans=0.125 2023-06-20 02:23:09,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=464094.0, ans=0.05 2023-06-20 02:23:20,691 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.25 vs. limit=15.0 2023-06-20 02:23:45,089 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.92 vs. limit=10.0 2023-06-20 02:24:01,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=464214.0, ans=0.2 2023-06-20 02:24:13,371 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.955e+02 2.570e+02 3.121e+02 3.955e+02 6.816e+02, threshold=6.242e+02, percent-clipped=2.0 2023-06-20 02:24:40,017 INFO [train.py:996] (3/4) Epoch 3, batch 16400, loss[loss=0.2166, simple_loss=0.2863, pruned_loss=0.07346, over 21803.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.3243, pruned_loss=0.09322, over 4277503.89 frames. ], batch size: 247, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:24:42,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=464334.0, ans=0.0 2023-06-20 02:25:17,552 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.76 vs. limit=15.0 2023-06-20 02:26:03,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=464454.0, ans=0.125 2023-06-20 02:26:12,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=464514.0, ans=0.5 2023-06-20 02:26:46,839 INFO [train.py:996] (3/4) Epoch 3, batch 16450, loss[loss=0.2505, simple_loss=0.3146, pruned_loss=0.09313, over 21654.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3233, pruned_loss=0.09327, over 4281746.74 frames. ], batch size: 263, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:28:32,307 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.145e+02 2.737e+02 3.058e+02 3.975e+02 7.376e+02, threshold=6.116e+02, percent-clipped=5.0 2023-06-20 02:28:49,042 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.61 vs. limit=15.0 2023-06-20 02:28:52,850 INFO [train.py:996] (3/4) Epoch 3, batch 16500, loss[loss=0.1878, simple_loss=0.2496, pruned_loss=0.06307, over 21649.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.3216, pruned_loss=0.09353, over 4287185.55 frames. ], batch size: 230, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:29:00,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=464934.0, ans=0.2 2023-06-20 02:29:06,133 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.15 vs. limit=12.0 2023-06-20 02:29:32,549 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=465054.0, ans=0.0 2023-06-20 02:30:51,095 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.52 vs. limit=22.5 2023-06-20 02:30:51,346 INFO [train.py:996] (3/4) Epoch 3, batch 16550, loss[loss=0.2691, simple_loss=0.3439, pruned_loss=0.09716, over 21843.00 frames. ], tot_loss[loss=0.251, simple_loss=0.32, pruned_loss=0.09098, over 4276669.94 frames. ], batch size: 371, lr: 1.08e-02, grad_scale: 64.0 2023-06-20 02:31:05,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=465234.0, ans=0.1 2023-06-20 02:31:24,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=465294.0, ans=0.125 2023-06-20 02:31:24,973 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.09 vs. limit=22.5 2023-06-20 02:31:48,342 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.74 vs. limit=15.0 2023-06-20 02:32:19,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=465414.0, ans=0.2 2023-06-20 02:32:38,357 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.971e+02 2.553e+02 2.978e+02 3.771e+02 6.936e+02, threshold=5.956e+02, percent-clipped=3.0 2023-06-20 02:33:16,487 INFO [train.py:996] (3/4) Epoch 3, batch 16600, loss[loss=0.3182, simple_loss=0.4089, pruned_loss=0.1137, over 21729.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.3294, pruned_loss=0.09453, over 4267323.73 frames. ], batch size: 351, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:33:30,991 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=465534.0, ans=0.0 2023-06-20 02:33:44,756 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=465594.0, ans=0.0 2023-06-20 02:34:00,822 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.67 vs. limit=10.0 2023-06-20 02:34:08,172 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=465654.0, ans=0.1 2023-06-20 02:35:24,720 INFO [train.py:996] (3/4) Epoch 3, batch 16650, loss[loss=0.2635, simple_loss=0.3355, pruned_loss=0.09571, over 21791.00 frames. ], tot_loss[loss=0.2684, simple_loss=0.3399, pruned_loss=0.09846, over 4267183.51 frames. ], batch size: 247, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:35:26,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=465834.0, ans=0.0 2023-06-20 02:35:52,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=465894.0, ans=0.0 2023-06-20 02:36:03,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=465894.0, ans=10.0 2023-06-20 02:37:14,224 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.203e+02 2.861e+02 3.349e+02 3.962e+02 7.489e+02, threshold=6.698e+02, percent-clipped=1.0 2023-06-20 02:37:23,074 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.13 vs. limit=15.0 2023-06-20 02:37:33,541 INFO [train.py:996] (3/4) Epoch 3, batch 16700, loss[loss=0.2621, simple_loss=0.3465, pruned_loss=0.08883, over 21632.00 frames. ], tot_loss[loss=0.2708, simple_loss=0.342, pruned_loss=0.09982, over 4270716.56 frames. ], batch size: 389, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:37:34,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=466134.0, ans=0.125 2023-06-20 02:38:16,914 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.91 vs. limit=22.5 2023-06-20 02:38:53,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=466254.0, ans=0.125 2023-06-20 02:38:53,193 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=466254.0, ans=0.125 2023-06-20 02:38:54,619 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=466254.0, ans=0.05 2023-06-20 02:39:19,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=466374.0, ans=0.2 2023-06-20 02:39:20,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=466374.0, ans=0.0 2023-06-20 02:39:35,648 INFO [train.py:996] (3/4) Epoch 3, batch 16750, loss[loss=0.3309, simple_loss=0.4084, pruned_loss=0.1267, over 21694.00 frames. ], tot_loss[loss=0.2741, simple_loss=0.3444, pruned_loss=0.102, over 4275685.63 frames. ], batch size: 441, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:39:36,196 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=466434.0, ans=0.0 2023-06-20 02:40:03,206 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=466434.0, ans=0.125 2023-06-20 02:40:37,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=466494.0, ans=0.1 2023-06-20 02:40:44,880 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=466494.0, ans=0.0 2023-06-20 02:40:49,986 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.12 vs. limit=15.0 2023-06-20 02:41:13,167 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=466554.0, ans=0.1 2023-06-20 02:41:44,390 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.115e+02 2.879e+02 3.240e+02 3.697e+02 5.648e+02, threshold=6.479e+02, percent-clipped=0.0 2023-06-20 02:41:57,461 INFO [train.py:996] (3/4) Epoch 3, batch 16800, loss[loss=0.2572, simple_loss=0.321, pruned_loss=0.09675, over 21891.00 frames. ], tot_loss[loss=0.2768, simple_loss=0.3483, pruned_loss=0.1026, over 4272739.34 frames. ], batch size: 316, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:42:51,540 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.43 vs. limit=15.0 2023-06-20 02:43:29,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=466914.0, ans=0.1 2023-06-20 02:43:41,345 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=466974.0, ans=0.125 2023-06-20 02:43:50,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=466974.0, ans=0.1 2023-06-20 02:43:50,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=466974.0, ans=0.07 2023-06-20 02:44:01,333 INFO [train.py:996] (3/4) Epoch 3, batch 16850, loss[loss=0.2668, simple_loss=0.3342, pruned_loss=0.09969, over 21466.00 frames. ], tot_loss[loss=0.2751, simple_loss=0.345, pruned_loss=0.1026, over 4282602.15 frames. ], batch size: 131, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:44:09,151 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=467034.0, ans=0.125 2023-06-20 02:44:45,899 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=467154.0, ans=0.1 2023-06-20 02:45:07,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=467214.0, ans=0.125 2023-06-20 02:45:22,550 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.682e+02 2.543e+02 2.978e+02 3.751e+02 8.288e+02, threshold=5.956e+02, percent-clipped=4.0 2023-06-20 02:45:37,909 INFO [train.py:996] (3/4) Epoch 3, batch 16900, loss[loss=0.2791, simple_loss=0.3297, pruned_loss=0.1142, over 20068.00 frames. ], tot_loss[loss=0.2699, simple_loss=0.3386, pruned_loss=0.1006, over 4287911.71 frames. ], batch size: 703, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:46:42,549 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.80 vs. limit=22.5 2023-06-20 02:46:47,048 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.26 vs. limit=6.0 2023-06-20 02:46:47,216 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.93 vs. limit=15.0 2023-06-20 02:47:13,024 INFO [train.py:996] (3/4) Epoch 3, batch 16950, loss[loss=0.2328, simple_loss=0.3001, pruned_loss=0.08277, over 21814.00 frames. ], tot_loss[loss=0.2648, simple_loss=0.331, pruned_loss=0.09926, over 4290827.53 frames. ], batch size: 298, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:47:24,785 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=467634.0, ans=0.0 2023-06-20 02:47:29,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=467634.0, ans=0.125 2023-06-20 02:47:42,947 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=467694.0, ans=0.0 2023-06-20 02:47:56,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=467694.0, ans=0.05 2023-06-20 02:48:01,812 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=467754.0, ans=0.125 2023-06-20 02:48:54,928 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 2.520e+02 2.812e+02 3.324e+02 6.057e+02, threshold=5.623e+02, percent-clipped=0.0 2023-06-20 02:49:01,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=467874.0, ans=0.125 2023-06-20 02:49:08,234 INFO [train.py:996] (3/4) Epoch 3, batch 17000, loss[loss=0.2523, simple_loss=0.3124, pruned_loss=0.09605, over 21373.00 frames. ], tot_loss[loss=0.2625, simple_loss=0.3267, pruned_loss=0.09913, over 4295053.95 frames. ], batch size: 159, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:49:55,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=467994.0, ans=0.125 2023-06-20 02:50:38,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=468114.0, ans=0.0 2023-06-20 02:50:57,165 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=468174.0, ans=0.0 2023-06-20 02:50:59,627 INFO [train.py:996] (3/4) Epoch 3, batch 17050, loss[loss=0.2854, simple_loss=0.3609, pruned_loss=0.105, over 21847.00 frames. ], tot_loss[loss=0.2701, simple_loss=0.3353, pruned_loss=0.1025, over 4299961.81 frames. ], batch size: 351, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:51:01,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=468234.0, ans=0.0 2023-06-20 02:51:22,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=468234.0, ans=0.0 2023-06-20 02:51:24,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=468294.0, ans=0.035 2023-06-20 02:51:57,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=468354.0, ans=0.0 2023-06-20 02:52:15,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=468414.0, ans=0.0 2023-06-20 02:52:22,717 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 2.663e+02 3.089e+02 3.979e+02 5.737e+02, threshold=6.177e+02, percent-clipped=2.0 2023-06-20 02:52:29,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=468474.0, ans=0.125 2023-06-20 02:52:32,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=468474.0, ans=0.0 2023-06-20 02:52:35,776 INFO [train.py:996] (3/4) Epoch 3, batch 17100, loss[loss=0.3347, simple_loss=0.3667, pruned_loss=0.1514, over 21720.00 frames. ], tot_loss[loss=0.2697, simple_loss=0.3341, pruned_loss=0.1026, over 4304517.47 frames. ], batch size: 508, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:53:29,387 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=468654.0, ans=0.1 2023-06-20 02:53:37,107 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=468714.0, ans=0.0 2023-06-20 02:53:44,370 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=468714.0, ans=0.2 2023-06-20 02:53:48,052 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.62 vs. limit=10.0 2023-06-20 02:53:50,568 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=468774.0, ans=0.0 2023-06-20 02:53:50,569 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=468774.0, ans=0.0 2023-06-20 02:54:11,172 INFO [train.py:996] (3/4) Epoch 3, batch 17150, loss[loss=0.2245, simple_loss=0.2996, pruned_loss=0.07466, over 21827.00 frames. ], tot_loss[loss=0.2665, simple_loss=0.3298, pruned_loss=0.1015, over 4303518.76 frames. ], batch size: 332, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:54:11,675 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=468834.0, ans=0.025 2023-06-20 02:54:41,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=468894.0, ans=0.125 2023-06-20 02:54:54,745 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=468894.0, ans=0.1 2023-06-20 02:55:29,185 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=469014.0, ans=0.1 2023-06-20 02:55:58,766 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.60 vs. limit=12.0 2023-06-20 02:55:59,113 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.947e+02 2.624e+02 2.921e+02 3.624e+02 5.783e+02, threshold=5.842e+02, percent-clipped=0.0 2023-06-20 02:56:28,762 INFO [train.py:996] (3/4) Epoch 3, batch 17200, loss[loss=0.2822, simple_loss=0.3553, pruned_loss=0.1045, over 21487.00 frames. ], tot_loss[loss=0.2653, simple_loss=0.3294, pruned_loss=0.1006, over 4297472.91 frames. ], batch size: 131, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:56:29,702 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.97 vs. limit=15.0 2023-06-20 02:56:34,454 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.47 vs. limit=15.0 2023-06-20 02:57:28,012 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=469314.0, ans=0.2 2023-06-20 02:57:52,473 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=469374.0, ans=0.125 2023-06-20 02:58:00,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=469374.0, ans=0.125 2023-06-20 02:58:07,494 INFO [train.py:996] (3/4) Epoch 3, batch 17250, loss[loss=0.2789, simple_loss=0.3487, pruned_loss=0.1045, over 21670.00 frames. ], tot_loss[loss=0.2679, simple_loss=0.3323, pruned_loss=0.1017, over 4289678.12 frames. ], batch size: 351, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:58:21,014 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=469434.0, ans=0.125 2023-06-20 02:58:31,667 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.55 vs. limit=15.0 2023-06-20 02:58:35,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=469494.0, ans=0.125 2023-06-20 02:58:57,193 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=469554.0, ans=0.0 2023-06-20 02:59:25,274 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=469614.0, ans=0.125 2023-06-20 02:59:32,663 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.265e+02 2.790e+02 3.289e+02 4.103e+02 7.188e+02, threshold=6.577e+02, percent-clipped=6.0 2023-06-20 02:59:56,842 INFO [train.py:996] (3/4) Epoch 3, batch 17300, loss[loss=0.2997, simple_loss=0.3765, pruned_loss=0.1114, over 20716.00 frames. ], tot_loss[loss=0.2759, simple_loss=0.3416, pruned_loss=0.1051, over 4282550.98 frames. ], batch size: 607, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 03:00:16,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=469794.0, ans=0.0 2023-06-20 03:00:18,371 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=469794.0, ans=0.1 2023-06-20 03:00:18,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=469794.0, ans=0.2 2023-06-20 03:00:25,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=469794.0, ans=0.125 2023-06-20 03:00:40,648 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.00 vs. limit=15.0 2023-06-20 03:00:57,279 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=469914.0, ans=0.1 2023-06-20 03:01:00,125 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=469914.0, ans=0.0 2023-06-20 03:01:37,011 INFO [train.py:996] (3/4) Epoch 3, batch 17350, loss[loss=0.215, simple_loss=0.2994, pruned_loss=0.06533, over 21642.00 frames. ], tot_loss[loss=0.2765, simple_loss=0.3436, pruned_loss=0.1047, over 4276937.71 frames. ], batch size: 263, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 03:01:39,726 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.15 vs. limit=22.5 2023-06-20 03:01:44,779 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=470034.0, ans=0.125 2023-06-20 03:01:50,647 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 03:01:51,255 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.22 vs. limit=12.0 2023-06-20 03:02:38,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=470154.0, ans=0.0 2023-06-20 03:02:41,333 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=470214.0, ans=0.1 2023-06-20 03:03:09,776 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.973e+02 2.914e+02 3.379e+02 4.234e+02 7.402e+02, threshold=6.758e+02, percent-clipped=2.0 2023-06-20 03:03:23,320 INFO [train.py:996] (3/4) Epoch 3, batch 17400, loss[loss=0.2153, simple_loss=0.2826, pruned_loss=0.07395, over 21340.00 frames. ], tot_loss[loss=0.27, simple_loss=0.3385, pruned_loss=0.1007, over 4277012.07 frames. ], batch size: 131, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 03:03:27,329 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.43 vs. limit=15.0 2023-06-20 03:04:04,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=470394.0, ans=0.125 2023-06-20 03:05:06,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=470514.0, ans=0.125 2023-06-20 03:05:11,764 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.87 vs. limit=12.0 2023-06-20 03:05:25,731 INFO [train.py:996] (3/4) Epoch 3, batch 17450, loss[loss=0.2065, simple_loss=0.2952, pruned_loss=0.05887, over 21618.00 frames. ], tot_loss[loss=0.2625, simple_loss=0.3322, pruned_loss=0.09637, over 4270756.79 frames. ], batch size: 263, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 03:05:35,174 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.38 vs. limit=15.0 2023-06-20 03:05:58,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=470694.0, ans=0.1 2023-06-20 03:06:16,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=470754.0, ans=0.1 2023-06-20 03:06:36,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=470814.0, ans=0.125 2023-06-20 03:06:47,778 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.738e+02 2.479e+02 3.095e+02 3.749e+02 7.284e+02, threshold=6.191e+02, percent-clipped=2.0 2023-06-20 03:07:06,290 INFO [train.py:996] (3/4) Epoch 3, batch 17500, loss[loss=0.2512, simple_loss=0.3136, pruned_loss=0.09437, over 21438.00 frames. ], tot_loss[loss=0.2577, simple_loss=0.3277, pruned_loss=0.09388, over 4270108.75 frames. ], batch size: 194, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 03:07:29,124 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=470994.0, ans=0.0 2023-06-20 03:07:34,811 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=470994.0, ans=0.1 2023-06-20 03:08:41,870 INFO [train.py:996] (3/4) Epoch 3, batch 17550, loss[loss=0.2287, simple_loss=0.308, pruned_loss=0.07468, over 21889.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3273, pruned_loss=0.09254, over 4261063.20 frames. ], batch size: 98, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:09:01,490 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=471294.0, ans=0.0 2023-06-20 03:09:03,121 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=471294.0, ans=0.07 2023-06-20 03:09:14,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=471294.0, ans=0.1 2023-06-20 03:09:40,834 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.76 vs. limit=10.0 2023-06-20 03:09:54,211 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.620e+02 2.408e+02 2.804e+02 3.227e+02 5.442e+02, threshold=5.608e+02, percent-clipped=0.0 2023-06-20 03:10:14,340 INFO [train.py:996] (3/4) Epoch 3, batch 17600, loss[loss=0.2519, simple_loss=0.3378, pruned_loss=0.08301, over 21839.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.329, pruned_loss=0.09221, over 4266562.45 frames. ], batch size: 124, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:10:29,074 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=471534.0, ans=0.125 2023-06-20 03:10:37,788 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.43 vs. limit=10.0 2023-06-20 03:11:04,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=471654.0, ans=0.125 2023-06-20 03:11:04,934 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=471654.0, ans=0.125 2023-06-20 03:11:15,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=471654.0, ans=0.125 2023-06-20 03:12:01,249 INFO [train.py:996] (3/4) Epoch 3, batch 17650, loss[loss=0.1912, simple_loss=0.2516, pruned_loss=0.06538, over 21674.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3271, pruned_loss=0.09218, over 4273657.32 frames. ], batch size: 247, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:12:59,813 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=471894.0, ans=0.125 2023-06-20 03:13:19,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=472014.0, ans=0.125 2023-06-20 03:13:22,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=472014.0, ans=0.1 2023-06-20 03:13:35,993 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.953e+02 2.510e+02 2.877e+02 3.404e+02 6.188e+02, threshold=5.753e+02, percent-clipped=2.0 2023-06-20 03:13:47,583 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=472074.0, ans=0.0 2023-06-20 03:13:54,786 INFO [train.py:996] (3/4) Epoch 3, batch 17700, loss[loss=0.2676, simple_loss=0.3464, pruned_loss=0.09441, over 21744.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3238, pruned_loss=0.09019, over 4278433.75 frames. ], batch size: 298, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:14:21,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=472194.0, ans=0.125 2023-06-20 03:15:38,932 INFO [train.py:996] (3/4) Epoch 3, batch 17750, loss[loss=0.3097, simple_loss=0.3716, pruned_loss=0.1239, over 21246.00 frames. ], tot_loss[loss=0.2595, simple_loss=0.3312, pruned_loss=0.09387, over 4274227.95 frames. ], batch size: 143, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:16:35,486 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.93 vs. limit=10.0 2023-06-20 03:16:54,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=472614.0, ans=0.125 2023-06-20 03:17:03,690 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.776e+02 2.594e+02 2.990e+02 3.478e+02 6.591e+02, threshold=5.980e+02, percent-clipped=3.0 2023-06-20 03:17:10,451 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=472674.0, ans=0.125 2023-06-20 03:17:27,837 INFO [train.py:996] (3/4) Epoch 3, batch 17800, loss[loss=0.2277, simple_loss=0.3089, pruned_loss=0.0732, over 21739.00 frames. ], tot_loss[loss=0.2609, simple_loss=0.3328, pruned_loss=0.09453, over 4276501.73 frames. ], batch size: 298, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:17:31,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=472734.0, ans=0.1 2023-06-20 03:17:40,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=472734.0, ans=0.125 2023-06-20 03:17:45,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=472794.0, ans=0.125 2023-06-20 03:17:49,893 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=472794.0, ans=0.0 2023-06-20 03:17:53,008 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=472794.0, ans=0.125 2023-06-20 03:17:55,904 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=472794.0, ans=0.0 2023-06-20 03:18:59,120 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.04 vs. limit=15.0 2023-06-20 03:19:05,729 INFO [train.py:996] (3/4) Epoch 3, batch 17850, loss[loss=0.2899, simple_loss=0.3511, pruned_loss=0.1143, over 21603.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.3328, pruned_loss=0.09506, over 4274956.12 frames. ], batch size: 389, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:19:11,114 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=473034.0, ans=0.0 2023-06-20 03:19:18,929 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.50 vs. limit=15.0 2023-06-20 03:19:22,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=473094.0, ans=0.125 2023-06-20 03:19:42,224 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=473094.0, ans=0.125 2023-06-20 03:19:56,246 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.75 vs. limit=10.0 2023-06-20 03:20:42,271 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.887e+02 2.814e+02 3.255e+02 4.435e+02 8.070e+02, threshold=6.511e+02, percent-clipped=5.0 2023-06-20 03:20:48,067 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.22 vs. limit=22.5 2023-06-20 03:20:52,006 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=473274.0, ans=0.125 2023-06-20 03:20:55,849 INFO [train.py:996] (3/4) Epoch 3, batch 17900, loss[loss=0.2629, simple_loss=0.313, pruned_loss=0.1064, over 20133.00 frames. ], tot_loss[loss=0.2667, simple_loss=0.3385, pruned_loss=0.09748, over 4277990.43 frames. ], batch size: 702, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:22:11,415 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.53 vs. limit=15.0 2023-06-20 03:22:53,610 INFO [train.py:996] (3/4) Epoch 3, batch 17950, loss[loss=0.2272, simple_loss=0.3141, pruned_loss=0.07021, over 21745.00 frames. ], tot_loss[loss=0.2604, simple_loss=0.3357, pruned_loss=0.09258, over 4282441.10 frames. ], batch size: 332, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:22:54,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=473634.0, ans=0.1 2023-06-20 03:24:07,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=473754.0, ans=0.1 2023-06-20 03:24:11,034 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.27 vs. limit=12.0 2023-06-20 03:24:11,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=473814.0, ans=0.0 2023-06-20 03:24:19,571 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=473814.0, ans=0.2 2023-06-20 03:24:29,840 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.586e+02 2.407e+02 3.130e+02 3.720e+02 6.419e+02, threshold=6.259e+02, percent-clipped=0.0 2023-06-20 03:24:39,787 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=12.0 2023-06-20 03:24:43,119 INFO [train.py:996] (3/4) Epoch 3, batch 18000, loss[loss=0.2218, simple_loss=0.273, pruned_loss=0.08528, over 21212.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3277, pruned_loss=0.09087, over 4278224.19 frames. ], batch size: 548, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:24:43,119 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-20 03:25:43,145 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.5.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([2.5098, 2.3772, 4.3612, 4.3192], device='cuda:3') 2023-06-20 03:25:44,238 INFO [train.py:1028] (3/4) Epoch 3, validation: loss=0.2752, simple_loss=0.3767, pruned_loss=0.08679, over 1796401.00 frames. 2023-06-20 03:25:44,238 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-20 03:26:09,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=473994.0, ans=0.125 2023-06-20 03:26:51,831 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=474114.0, ans=0.1 2023-06-20 03:26:53,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=474114.0, ans=0.0 2023-06-20 03:27:01,102 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=474114.0, ans=0.1 2023-06-20 03:27:25,113 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.66 vs. limit=15.0 2023-06-20 03:27:26,968 INFO [train.py:996] (3/4) Epoch 3, batch 18050, loss[loss=0.227, simple_loss=0.2857, pruned_loss=0.08417, over 21780.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.3213, pruned_loss=0.08982, over 4277936.32 frames. ], batch size: 371, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:28:17,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=474354.0, ans=0.0 2023-06-20 03:28:18,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=474354.0, ans=0.125 2023-06-20 03:28:23,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=474354.0, ans=0.2 2023-06-20 03:28:37,442 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 03:28:41,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=474414.0, ans=0.035 2023-06-20 03:28:45,785 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.975e+02 2.808e+02 3.337e+02 3.938e+02 6.336e+02, threshold=6.674e+02, percent-clipped=1.0 2023-06-20 03:29:05,301 INFO [train.py:996] (3/4) Epoch 3, batch 18100, loss[loss=0.2352, simple_loss=0.3276, pruned_loss=0.07139, over 21234.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3262, pruned_loss=0.09312, over 4268686.23 frames. ], batch size: 176, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:29:22,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=474534.0, ans=0.1 2023-06-20 03:30:43,528 INFO [train.py:996] (3/4) Epoch 3, batch 18150, loss[loss=0.2323, simple_loss=0.3029, pruned_loss=0.08082, over 21192.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3261, pruned_loss=0.09268, over 4267891.20 frames. ], batch size: 549, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:32:04,227 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.999e+02 2.533e+02 2.904e+02 3.319e+02 6.906e+02, threshold=5.807e+02, percent-clipped=1.0 2023-06-20 03:32:06,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=475074.0, ans=0.0 2023-06-20 03:32:25,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=475074.0, ans=0.125 2023-06-20 03:32:29,459 INFO [train.py:996] (3/4) Epoch 3, batch 18200, loss[loss=0.2177, simple_loss=0.2875, pruned_loss=0.07388, over 21354.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3195, pruned_loss=0.09271, over 4254640.49 frames. ], batch size: 144, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:32:29,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=475134.0, ans=0.125 2023-06-20 03:32:55,811 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=475194.0, ans=0.125 2023-06-20 03:33:31,558 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.83 vs. limit=15.0 2023-06-20 03:33:43,554 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=475314.0, ans=0.0 2023-06-20 03:33:47,803 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=475374.0, ans=0.0 2023-06-20 03:33:58,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=475374.0, ans=0.125 2023-06-20 03:34:01,005 INFO [train.py:996] (3/4) Epoch 3, batch 18250, loss[loss=0.2715, simple_loss=0.3227, pruned_loss=0.1101, over 21720.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3118, pruned_loss=0.08937, over 4257534.30 frames. ], batch size: 389, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:34:42,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=475494.0, ans=0.0 2023-06-20 03:34:46,904 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=475554.0, ans=0.125 2023-06-20 03:34:49,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=475554.0, ans=0.125 2023-06-20 03:35:00,202 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.42 vs. limit=15.0 2023-06-20 03:35:24,974 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 2.363e+02 2.913e+02 3.587e+02 8.946e+02, threshold=5.827e+02, percent-clipped=7.0 2023-06-20 03:35:38,297 INFO [train.py:996] (3/4) Epoch 3, batch 18300, loss[loss=0.2489, simple_loss=0.3204, pruned_loss=0.08869, over 21427.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3101, pruned_loss=0.08953, over 4253986.99 frames. ], batch size: 131, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:35:45,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=475734.0, ans=0.125 2023-06-20 03:36:05,658 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.52 vs. limit=15.0 2023-06-20 03:36:46,960 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.30 vs. limit=15.0 2023-06-20 03:36:50,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=475914.0, ans=0.125 2023-06-20 03:37:14,086 INFO [train.py:996] (3/4) Epoch 3, batch 18350, loss[loss=0.2506, simple_loss=0.3551, pruned_loss=0.07307, over 21389.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.316, pruned_loss=0.08837, over 4246837.22 frames. ], batch size: 194, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:37:43,373 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=476094.0, ans=0.125 2023-06-20 03:38:33,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=476214.0, ans=0.0 2023-06-20 03:38:44,856 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.975e+02 2.546e+02 3.125e+02 3.968e+02 7.184e+02, threshold=6.249e+02, percent-clipped=4.0 2023-06-20 03:38:49,950 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=476274.0, ans=0.0 2023-06-20 03:38:54,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=476274.0, ans=10.0 2023-06-20 03:38:58,457 INFO [train.py:996] (3/4) Epoch 3, batch 18400, loss[loss=0.209, simple_loss=0.2808, pruned_loss=0.06866, over 21625.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3139, pruned_loss=0.08865, over 4253484.85 frames. ], batch size: 263, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:39:22,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=476394.0, ans=0.0 2023-06-20 03:40:26,968 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=476574.0, ans=0.1 2023-06-20 03:40:28,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=476574.0, ans=0.1 2023-06-20 03:40:33,535 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=17.95 vs. limit=15.0 2023-06-20 03:40:35,503 INFO [train.py:996] (3/4) Epoch 3, batch 18450, loss[loss=0.2141, simple_loss=0.2916, pruned_loss=0.06828, over 21676.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3103, pruned_loss=0.08423, over 4245961.84 frames. ], batch size: 298, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:40:43,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=476634.0, ans=0.2 2023-06-20 03:41:32,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=476754.0, ans=0.125 2023-06-20 03:41:57,540 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.580e+02 2.233e+02 2.788e+02 3.654e+02 6.686e+02, threshold=5.575e+02, percent-clipped=3.0 2023-06-20 03:41:59,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=476874.0, ans=0.0 2023-06-20 03:42:00,969 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=476874.0, ans=0.1 2023-06-20 03:42:20,996 INFO [train.py:996] (3/4) Epoch 3, batch 18500, loss[loss=0.1967, simple_loss=0.2676, pruned_loss=0.06291, over 21506.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.305, pruned_loss=0.08284, over 4233788.57 frames. ], batch size: 230, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:42:54,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=476994.0, ans=0.1 2023-06-20 03:43:31,041 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=477054.0, ans=0.09899494936611666 2023-06-20 03:43:39,718 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=477114.0, ans=0.125 2023-06-20 03:44:04,052 INFO [train.py:996] (3/4) Epoch 3, batch 18550, loss[loss=0.2244, simple_loss=0.2833, pruned_loss=0.08272, over 21208.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3039, pruned_loss=0.08149, over 4238042.74 frames. ], batch size: 176, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:44:11,399 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=477234.0, ans=0.125 2023-06-20 03:45:05,627 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 03:45:23,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=477414.0, ans=10.0 2023-06-20 03:45:30,056 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 03:45:38,894 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.834e+02 2.443e+02 2.739e+02 3.357e+02 5.521e+02, threshold=5.479e+02, percent-clipped=0.0 2023-06-20 03:45:51,270 INFO [train.py:996] (3/4) Epoch 3, batch 18600, loss[loss=0.2395, simple_loss=0.3091, pruned_loss=0.08497, over 21656.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3045, pruned_loss=0.08318, over 4234519.21 frames. ], batch size: 298, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:46:16,723 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 03:46:40,928 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=477594.0, ans=0.125 2023-06-20 03:46:59,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=477714.0, ans=0.1 2023-06-20 03:47:02,241 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-20 03:47:25,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=477774.0, ans=0.05 2023-06-20 03:47:27,322 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 03:47:33,513 INFO [train.py:996] (3/4) Epoch 3, batch 18650, loss[loss=0.2359, simple_loss=0.2995, pruned_loss=0.08611, over 21804.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3048, pruned_loss=0.08356, over 4236084.20 frames. ], batch size: 317, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:48:25,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=477954.0, ans=0.125 2023-06-20 03:48:51,748 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.834e+02 2.448e+02 2.806e+02 3.441e+02 5.689e+02, threshold=5.611e+02, percent-clipped=1.0 2023-06-20 03:49:03,414 INFO [train.py:996] (3/4) Epoch 3, batch 18700, loss[loss=0.2466, simple_loss=0.3023, pruned_loss=0.09541, over 21887.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3039, pruned_loss=0.08563, over 4242868.33 frames. ], batch size: 316, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:49:08,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=478134.0, ans=0.125 2023-06-20 03:50:17,583 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=478314.0, ans=0.2 2023-06-20 03:50:18,175 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.00 vs. limit=22.5 2023-06-20 03:50:30,880 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=478374.0, ans=0.0 2023-06-20 03:50:36,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=478374.0, ans=0.09899494936611666 2023-06-20 03:50:39,449 INFO [train.py:996] (3/4) Epoch 3, batch 18750, loss[loss=0.2268, simple_loss=0.2872, pruned_loss=0.08317, over 21542.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.305, pruned_loss=0.0876, over 4243882.42 frames. ], batch size: 212, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:50:43,209 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.17 vs. limit=15.0 2023-06-20 03:51:30,686 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=478494.0, ans=0.125 2023-06-20 03:52:11,937 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.031e+02 2.724e+02 3.056e+02 3.518e+02 6.406e+02, threshold=6.113e+02, percent-clipped=0.0 2023-06-20 03:52:24,556 INFO [train.py:996] (3/4) Epoch 3, batch 18800, loss[loss=0.1913, simple_loss=0.2602, pruned_loss=0.06118, over 21767.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3105, pruned_loss=0.08824, over 4254867.81 frames. ], batch size: 118, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:52:41,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=478734.0, ans=0.125 2023-06-20 03:53:30,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=478914.0, ans=0.125 2023-06-20 03:53:48,028 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.35 vs. limit=10.0 2023-06-20 03:53:48,793 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=478974.0, ans=0.0 2023-06-20 03:54:08,162 INFO [train.py:996] (3/4) Epoch 3, batch 18850, loss[loss=0.2014, simple_loss=0.2709, pruned_loss=0.06592, over 21611.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3067, pruned_loss=0.08336, over 4253841.07 frames. ], batch size: 247, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:54:21,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=479034.0, ans=0.125 2023-06-20 03:54:29,706 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.18 vs. limit=15.0 2023-06-20 03:55:41,118 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.483e+02 2.197e+02 2.509e+02 3.227e+02 5.628e+02, threshold=5.018e+02, percent-clipped=1.0 2023-06-20 03:56:01,534 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=479274.0, ans=0.125 2023-06-20 03:56:10,928 INFO [train.py:996] (3/4) Epoch 3, batch 18900, loss[loss=0.2967, simple_loss=0.336, pruned_loss=0.1287, over 21617.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3041, pruned_loss=0.08423, over 4263483.76 frames. ], batch size: 473, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:56:28,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=479334.0, ans=0.125 2023-06-20 03:56:59,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=479454.0, ans=0.2 2023-06-20 03:57:10,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=479514.0, ans=0.0 2023-06-20 03:57:23,619 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=479514.0, ans=0.2 2023-06-20 03:57:57,226 INFO [train.py:996] (3/4) Epoch 3, batch 18950, loss[loss=0.277, simple_loss=0.3719, pruned_loss=0.09105, over 21872.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3033, pruned_loss=0.08568, over 4270832.24 frames. ], batch size: 333, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:58:05,230 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=479634.0, ans=0.0 2023-06-20 03:58:40,758 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=479754.0, ans=0.125 2023-06-20 03:59:36,601 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.907e+02 2.641e+02 3.073e+02 3.851e+02 7.117e+02, threshold=6.146e+02, percent-clipped=12.0 2023-06-20 03:59:48,124 INFO [train.py:996] (3/4) Epoch 3, batch 19000, loss[loss=0.2794, simple_loss=0.3474, pruned_loss=0.1057, over 21919.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3133, pruned_loss=0.08922, over 4267165.20 frames. ], batch size: 316, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 04:00:19,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=479994.0, ans=0.125 2023-06-20 04:00:40,187 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=480054.0, ans=0.0 2023-06-20 04:00:53,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=480114.0, ans=0.125 2023-06-20 04:00:53,947 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=480114.0, ans=15.0 2023-06-20 04:01:06,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=480174.0, ans=0.0 2023-06-20 04:01:35,657 INFO [train.py:996] (3/4) Epoch 3, batch 19050, loss[loss=0.2713, simple_loss=0.3312, pruned_loss=0.1057, over 21817.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3193, pruned_loss=0.09364, over 4273846.43 frames. ], batch size: 112, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:01:54,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=480234.0, ans=0.1 2023-06-20 04:02:05,109 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.55 vs. limit=15.0 2023-06-20 04:02:37,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=480414.0, ans=0.125 2023-06-20 04:03:08,634 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.052e+02 2.699e+02 3.066e+02 3.577e+02 5.949e+02, threshold=6.132e+02, percent-clipped=0.0 2023-06-20 04:03:18,858 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.08 vs. limit=15.0 2023-06-20 04:03:36,279 INFO [train.py:996] (3/4) Epoch 3, batch 19100, loss[loss=0.2521, simple_loss=0.3076, pruned_loss=0.09832, over 21248.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3179, pruned_loss=0.09443, over 4279587.78 frames. ], batch size: 548, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:04:14,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=480594.0, ans=0.125 2023-06-20 04:04:20,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=480654.0, ans=0.1 2023-06-20 04:05:39,992 INFO [train.py:996] (3/4) Epoch 3, batch 19150, loss[loss=0.2381, simple_loss=0.31, pruned_loss=0.08311, over 21219.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3195, pruned_loss=0.09459, over 4278873.82 frames. ], batch size: 159, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:05:50,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=480834.0, ans=0.125 2023-06-20 04:06:00,588 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.54 vs. limit=15.0 2023-06-20 04:06:13,100 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.97 vs. limit=15.0 2023-06-20 04:06:15,423 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=480954.0, ans=0.0 2023-06-20 04:06:29,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=480954.0, ans=0.0 2023-06-20 04:06:41,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=481014.0, ans=0.2 2023-06-20 04:06:43,070 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=481014.0, ans=0.1 2023-06-20 04:07:17,975 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.914e+02 2.726e+02 3.068e+02 3.801e+02 8.063e+02, threshold=6.136e+02, percent-clipped=5.0 2023-06-20 04:07:27,888 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=481074.0, ans=0.125 2023-06-20 04:07:35,518 INFO [train.py:996] (3/4) Epoch 3, batch 19200, loss[loss=0.2639, simple_loss=0.369, pruned_loss=0.07936, over 21728.00 frames. ], tot_loss[loss=0.2624, simple_loss=0.3322, pruned_loss=0.09634, over 4271822.49 frames. ], batch size: 332, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:07:40,856 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=481134.0, ans=0.0 2023-06-20 04:07:45,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=481134.0, ans=0.0 2023-06-20 04:08:34,741 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=481314.0, ans=0.125 2023-06-20 04:09:14,902 INFO [train.py:996] (3/4) Epoch 3, batch 19250, loss[loss=0.2182, simple_loss=0.2967, pruned_loss=0.06989, over 21366.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3295, pruned_loss=0.08989, over 4279958.32 frames. ], batch size: 194, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:09:15,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=481434.0, ans=0.2 2023-06-20 04:09:21,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=481434.0, ans=0.0 2023-06-20 04:09:23,230 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.84 vs. limit=15.0 2023-06-20 04:09:51,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=481554.0, ans=0.0 2023-06-20 04:10:15,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=481614.0, ans=0.125 2023-06-20 04:10:18,720 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=481674.0, ans=0.2 2023-06-20 04:10:28,332 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.404e+02 2.035e+02 2.594e+02 2.988e+02 5.303e+02, threshold=5.187e+02, percent-clipped=0.0 2023-06-20 04:10:50,954 INFO [train.py:996] (3/4) Epoch 3, batch 19300, loss[loss=0.2725, simple_loss=0.3316, pruned_loss=0.1067, over 21904.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3252, pruned_loss=0.08819, over 4283897.09 frames. ], batch size: 107, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:11:11,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=481794.0, ans=0.0 2023-06-20 04:12:12,368 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=481974.0, ans=0.125 2023-06-20 04:12:29,534 INFO [train.py:996] (3/4) Epoch 3, batch 19350, loss[loss=0.1941, simple_loss=0.2702, pruned_loss=0.05895, over 21473.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3197, pruned_loss=0.08458, over 4286071.32 frames. ], batch size: 195, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:12:40,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=482034.0, ans=10.0 2023-06-20 04:13:12,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=482154.0, ans=0.2 2023-06-20 04:13:46,941 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.720e+02 2.454e+02 2.758e+02 3.131e+02 4.952e+02, threshold=5.516e+02, percent-clipped=0.0 2023-06-20 04:14:04,534 INFO [train.py:996] (3/4) Epoch 3, batch 19400, loss[loss=0.3362, simple_loss=0.3742, pruned_loss=0.1492, over 21733.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3179, pruned_loss=0.08415, over 4286735.11 frames. ], batch size: 508, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:14:11,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=482334.0, ans=0.125 2023-06-20 04:15:09,342 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 04:15:33,659 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.45 vs. limit=10.0 2023-06-20 04:15:52,988 INFO [train.py:996] (3/4) Epoch 3, batch 19450, loss[loss=0.2181, simple_loss=0.2723, pruned_loss=0.082, over 21487.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3171, pruned_loss=0.08654, over 4283894.87 frames. ], batch size: 212, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:16:29,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=482754.0, ans=0.1 2023-06-20 04:17:12,629 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 2.711e+02 3.172e+02 3.955e+02 6.078e+02, threshold=6.345e+02, percent-clipped=2.0 2023-06-20 04:17:30,444 INFO [train.py:996] (3/4) Epoch 3, batch 19500, loss[loss=0.1979, simple_loss=0.2528, pruned_loss=0.07149, over 21159.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3121, pruned_loss=0.08765, over 4279357.82 frames. ], batch size: 159, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:17:49,540 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.14 vs. limit=15.0 2023-06-20 04:18:16,011 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=483054.0, ans=0.0 2023-06-20 04:19:07,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=483174.0, ans=0.1 2023-06-20 04:19:17,891 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=483174.0, ans=0.1 2023-06-20 04:19:28,232 INFO [train.py:996] (3/4) Epoch 3, batch 19550, loss[loss=0.2341, simple_loss=0.3231, pruned_loss=0.07251, over 21625.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3079, pruned_loss=0.08614, over 4273008.35 frames. ], batch size: 263, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:19:46,764 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=483294.0, ans=10.0 2023-06-20 04:20:01,326 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.93 vs. limit=15.0 2023-06-20 04:20:37,402 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=483414.0, ans=0.125 2023-06-20 04:21:03,750 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.934e+02 2.587e+02 3.203e+02 3.916e+02 8.201e+02, threshold=6.407e+02, percent-clipped=3.0 2023-06-20 04:21:11,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=483474.0, ans=0.0 2023-06-20 04:21:15,866 INFO [train.py:996] (3/4) Epoch 3, batch 19600, loss[loss=0.2589, simple_loss=0.3227, pruned_loss=0.09754, over 21748.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3101, pruned_loss=0.08689, over 4274877.27 frames. ], batch size: 441, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:21:29,585 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.95 vs. limit=22.5 2023-06-20 04:21:33,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=483594.0, ans=0.1 2023-06-20 04:22:08,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=483654.0, ans=0.2 2023-06-20 04:22:28,781 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=483714.0, ans=0.0 2023-06-20 04:22:35,410 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=483714.0, ans=0.1 2023-06-20 04:22:42,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=483774.0, ans=0.0 2023-06-20 04:22:47,490 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=483774.0, ans=0.125 2023-06-20 04:22:53,023 INFO [train.py:996] (3/4) Epoch 3, batch 19650, loss[loss=0.2808, simple_loss=0.3506, pruned_loss=0.1056, over 21417.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.3178, pruned_loss=0.09272, over 4278711.99 frames. ], batch size: 131, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:23:15,961 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.75 vs. limit=15.0 2023-06-20 04:23:32,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=483894.0, ans=0.125 2023-06-20 04:24:03,247 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.17 vs. limit=15.0 2023-06-20 04:24:35,403 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.259e+02 2.819e+02 3.213e+02 4.152e+02 6.048e+02, threshold=6.427e+02, percent-clipped=0.0 2023-06-20 04:24:38,968 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=484074.0, ans=0.125 2023-06-20 04:25:06,165 INFO [train.py:996] (3/4) Epoch 3, batch 19700, loss[loss=0.2552, simple_loss=0.3247, pruned_loss=0.09283, over 21608.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.3221, pruned_loss=0.0943, over 4279820.93 frames. ], batch size: 263, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:26:45,406 INFO [train.py:996] (3/4) Epoch 3, batch 19750, loss[loss=0.2605, simple_loss=0.3532, pruned_loss=0.08394, over 21570.00 frames. ], tot_loss[loss=0.2606, simple_loss=0.3313, pruned_loss=0.09494, over 4285074.32 frames. ], batch size: 441, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:26:56,644 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.69 vs. limit=22.5 2023-06-20 04:27:15,127 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.90 vs. limit=15.0 2023-06-20 04:28:13,201 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.33 vs. limit=15.0 2023-06-20 04:28:26,728 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.964e+02 2.683e+02 3.196e+02 4.002e+02 6.809e+02, threshold=6.392e+02, percent-clipped=1.0 2023-06-20 04:28:38,980 INFO [train.py:996] (3/4) Epoch 3, batch 19800, loss[loss=0.2575, simple_loss=0.3325, pruned_loss=0.09119, over 21521.00 frames. ], tot_loss[loss=0.2604, simple_loss=0.3297, pruned_loss=0.09556, over 4285273.95 frames. ], batch size: 471, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:29:58,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=484914.0, ans=0.2 2023-06-20 04:30:21,940 INFO [train.py:996] (3/4) Epoch 3, batch 19850, loss[loss=0.1882, simple_loss=0.2742, pruned_loss=0.05109, over 21599.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3215, pruned_loss=0.08949, over 4281389.24 frames. ], batch size: 230, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:30:40,788 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.57 vs. limit=15.0 2023-06-20 04:31:12,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=485154.0, ans=0.2 2023-06-20 04:31:33,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=485214.0, ans=0.2 2023-06-20 04:31:34,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=485214.0, ans=0.125 2023-06-20 04:31:41,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=485274.0, ans=0.125 2023-06-20 04:31:41,968 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.685e+02 2.292e+02 2.688e+02 3.533e+02 4.969e+02, threshold=5.376e+02, percent-clipped=0.0 2023-06-20 04:31:44,368 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.41 vs. limit=6.0 2023-06-20 04:31:45,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=485274.0, ans=0.125 2023-06-20 04:31:56,792 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=485274.0, ans=0.125 2023-06-20 04:31:59,296 INFO [train.py:996] (3/4) Epoch 3, batch 19900, loss[loss=0.2192, simple_loss=0.2947, pruned_loss=0.07183, over 21655.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3195, pruned_loss=0.08598, over 4279559.49 frames. ], batch size: 298, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:32:41,741 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=485454.0, ans=0.125 2023-06-20 04:32:44,990 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=485454.0, ans=0.0 2023-06-20 04:33:01,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=485514.0, ans=0.125 2023-06-20 04:33:10,293 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 04:33:19,733 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=485574.0, ans=0.2 2023-06-20 04:33:30,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=485574.0, ans=0.125 2023-06-20 04:33:37,413 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=485574.0, ans=0.0 2023-06-20 04:33:42,961 INFO [train.py:996] (3/4) Epoch 3, batch 19950, loss[loss=0.2181, simple_loss=0.2788, pruned_loss=0.07875, over 19953.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3139, pruned_loss=0.08617, over 4274199.16 frames. ], batch size: 702, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:33:43,430 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=485634.0, ans=0.125 2023-06-20 04:34:23,505 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=485754.0, ans=0.1 2023-06-20 04:34:48,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=485814.0, ans=0.125 2023-06-20 04:35:02,033 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.757e+02 2.533e+02 3.149e+02 3.836e+02 5.627e+02, threshold=6.299e+02, percent-clipped=1.0 2023-06-20 04:35:19,342 INFO [train.py:996] (3/4) Epoch 3, batch 20000, loss[loss=0.2826, simple_loss=0.3393, pruned_loss=0.113, over 21849.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3153, pruned_loss=0.08739, over 4281198.02 frames. ], batch size: 118, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:35:31,465 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=485934.0, ans=0.125 2023-06-20 04:35:55,777 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=486054.0, ans=0.125 2023-06-20 04:35:58,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=486054.0, ans=0.0 2023-06-20 04:36:29,377 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.68 vs. limit=15.0 2023-06-20 04:36:34,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=486174.0, ans=0.1 2023-06-20 04:36:54,198 INFO [train.py:996] (3/4) Epoch 3, batch 20050, loss[loss=0.2365, simple_loss=0.3048, pruned_loss=0.08416, over 21657.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.318, pruned_loss=0.09077, over 4286972.19 frames. ], batch size: 230, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:37:03,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=486234.0, ans=0.125 2023-06-20 04:37:06,282 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=486234.0, ans=0.1 2023-06-20 04:37:06,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=486234.0, ans=0.125 2023-06-20 04:37:58,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=486354.0, ans=0.0 2023-06-20 04:38:02,335 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=486414.0, ans=15.0 2023-06-20 04:38:25,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=486474.0, ans=0.0 2023-06-20 04:38:39,859 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.084e+02 2.657e+02 3.135e+02 3.679e+02 6.652e+02, threshold=6.270e+02, percent-clipped=1.0 2023-06-20 04:38:52,021 INFO [train.py:996] (3/4) Epoch 3, batch 20100, loss[loss=0.2815, simple_loss=0.375, pruned_loss=0.09405, over 20948.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.32, pruned_loss=0.0927, over 4290185.24 frames. ], batch size: 607, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:39:10,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=486534.0, ans=0.125 2023-06-20 04:39:21,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=486594.0, ans=0.1 2023-06-20 04:39:23,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=486594.0, ans=0.2 2023-06-20 04:40:08,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=486714.0, ans=0.2 2023-06-20 04:40:16,394 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=486774.0, ans=0.0 2023-06-20 04:40:31,381 INFO [train.py:996] (3/4) Epoch 3, batch 20150, loss[loss=0.2566, simple_loss=0.3248, pruned_loss=0.0942, over 21574.00 frames. ], tot_loss[loss=0.2608, simple_loss=0.3301, pruned_loss=0.09577, over 4289817.19 frames. ], batch size: 230, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:41:00,437 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=486894.0, ans=0.125 2023-06-20 04:41:01,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=486894.0, ans=0.125 2023-06-20 04:41:02,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=486894.0, ans=0.125 2023-06-20 04:42:11,238 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=487014.0, ans=0.125 2023-06-20 04:42:14,728 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=487074.0, ans=0.125 2023-06-20 04:42:17,058 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.177e+02 3.160e+02 3.625e+02 4.244e+02 7.181e+02, threshold=7.250e+02, percent-clipped=1.0 2023-06-20 04:42:40,025 INFO [train.py:996] (3/4) Epoch 3, batch 20200, loss[loss=0.2268, simple_loss=0.2874, pruned_loss=0.0831, over 21856.00 frames. ], tot_loss[loss=0.2663, simple_loss=0.335, pruned_loss=0.09878, over 4282945.99 frames. ], batch size: 107, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:42:42,508 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.61 vs. limit=15.0 2023-06-20 04:42:51,234 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=487134.0, ans=0.2 2023-06-20 04:43:38,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=487254.0, ans=0.2 2023-06-20 04:43:38,916 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.82 vs. limit=15.0 2023-06-20 04:44:28,249 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=487434.0, ans=0.2 2023-06-20 04:44:29,121 INFO [train.py:996] (3/4) Epoch 3, batch 20250, loss[loss=0.2756, simple_loss=0.3475, pruned_loss=0.1018, over 21407.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.337, pruned_loss=0.09758, over 4281208.23 frames. ], batch size: 548, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:44:39,213 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=22.5 2023-06-20 04:46:07,806 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.741e+02 2.337e+02 2.707e+02 3.464e+02 5.836e+02, threshold=5.414e+02, percent-clipped=0.0 2023-06-20 04:46:16,408 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-06-20 04:46:19,808 INFO [train.py:996] (3/4) Epoch 3, batch 20300, loss[loss=0.2301, simple_loss=0.2763, pruned_loss=0.0919, over 20012.00 frames. ], tot_loss[loss=0.2608, simple_loss=0.3336, pruned_loss=0.09401, over 4278423.73 frames. ], batch size: 704, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:46:33,251 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.88 vs. limit=10.0 2023-06-20 04:47:19,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=487914.0, ans=0.0 2023-06-20 04:47:31,850 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=487914.0, ans=0.1 2023-06-20 04:47:47,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=487974.0, ans=0.1 2023-06-20 04:47:52,082 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=487974.0, ans=0.125 2023-06-20 04:47:56,604 INFO [train.py:996] (3/4) Epoch 3, batch 20350, loss[loss=0.2676, simple_loss=0.3386, pruned_loss=0.09828, over 21431.00 frames. ], tot_loss[loss=0.2603, simple_loss=0.333, pruned_loss=0.09382, over 4269937.92 frames. ], batch size: 211, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:48:19,087 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=488094.0, ans=0.125 2023-06-20 04:48:22,621 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=8.78 vs. limit=12.0 2023-06-20 04:48:46,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=488154.0, ans=0.0 2023-06-20 04:49:08,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=488214.0, ans=0.2 2023-06-20 04:49:28,272 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.856e+02 2.591e+02 2.979e+02 3.665e+02 6.122e+02, threshold=5.958e+02, percent-clipped=2.0 2023-06-20 04:49:41,478 INFO [train.py:996] (3/4) Epoch 3, batch 20400, loss[loss=0.1996, simple_loss=0.2755, pruned_loss=0.06186, over 16765.00 frames. ], tot_loss[loss=0.2636, simple_loss=0.3345, pruned_loss=0.09629, over 4246269.59 frames. ], batch size: 62, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:49:59,821 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=488334.0, ans=0.125 2023-06-20 04:50:49,633 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 04:51:03,063 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=488454.0, ans=0.1 2023-06-20 04:51:21,151 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.92 vs. limit=15.0 2023-06-20 04:51:40,158 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.49 vs. limit=15.0 2023-06-20 04:51:46,603 INFO [train.py:996] (3/4) Epoch 3, batch 20450, loss[loss=0.2493, simple_loss=0.3157, pruned_loss=0.0915, over 21679.00 frames. ], tot_loss[loss=0.2668, simple_loss=0.3361, pruned_loss=0.09875, over 4224540.89 frames. ], batch size: 230, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:52:07,587 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=488694.0, ans=0.07 2023-06-20 04:53:02,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=488814.0, ans=0.1 2023-06-20 04:53:14,713 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.291e+02 2.891e+02 3.490e+02 4.118e+02 8.011e+02, threshold=6.980e+02, percent-clipped=6.0 2023-06-20 04:53:25,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=488934.0, ans=0.0 2023-06-20 04:53:26,656 INFO [train.py:996] (3/4) Epoch 3, batch 20500, loss[loss=0.2293, simple_loss=0.2969, pruned_loss=0.0808, over 21623.00 frames. ], tot_loss[loss=0.2649, simple_loss=0.3321, pruned_loss=0.0988, over 4233116.31 frames. ], batch size: 230, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:53:32,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff3.min_abs, batch_count=488934.0, ans=0.2 2023-06-20 04:53:47,988 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.67 vs. limit=15.0 2023-06-20 04:55:15,471 INFO [train.py:996] (3/4) Epoch 3, batch 20550, loss[loss=0.3041, simple_loss=0.3758, pruned_loss=0.1163, over 21478.00 frames. ], tot_loss[loss=0.2602, simple_loss=0.3263, pruned_loss=0.09707, over 4237635.90 frames. ], batch size: 473, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 04:56:44,854 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=489474.0, ans=0.0 2023-06-20 04:56:48,728 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.811e+02 2.383e+02 2.780e+02 3.165e+02 5.319e+02, threshold=5.560e+02, percent-clipped=0.0 2023-06-20 04:56:55,032 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=489474.0, ans=0.125 2023-06-20 04:56:59,052 INFO [train.py:996] (3/4) Epoch 3, batch 20600, loss[loss=0.3446, simple_loss=0.3953, pruned_loss=0.1469, over 21503.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.3283, pruned_loss=0.0951, over 4226462.52 frames. ], batch size: 507, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 04:58:06,718 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=489714.0, ans=0.125 2023-06-20 04:58:44,299 INFO [train.py:996] (3/4) Epoch 3, batch 20650, loss[loss=0.2245, simple_loss=0.2847, pruned_loss=0.08216, over 21873.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3231, pruned_loss=0.09486, over 4231551.73 frames. ], batch size: 98, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 04:58:46,992 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.81 vs. limit=15.0 2023-06-20 04:58:53,988 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=489834.0, ans=0.1 2023-06-20 04:59:42,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=490014.0, ans=0.125 2023-06-20 05:00:00,467 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=490014.0, ans=0.125 2023-06-20 05:00:03,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=490014.0, ans=0.1 2023-06-20 05:00:10,411 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.708e+02 2.534e+02 3.079e+02 3.590e+02 6.191e+02, threshold=6.158e+02, percent-clipped=1.0 2023-06-20 05:00:12,562 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=490074.0, ans=0.125 2023-06-20 05:00:21,351 INFO [train.py:996] (3/4) Epoch 3, batch 20700, loss[loss=0.2144, simple_loss=0.2914, pruned_loss=0.06865, over 21640.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.316, pruned_loss=0.09156, over 4243319.78 frames. ], batch size: 263, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:00:26,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=490134.0, ans=0.1 2023-06-20 05:00:46,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=490194.0, ans=0.125 2023-06-20 05:01:32,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=490314.0, ans=10.0 2023-06-20 05:01:35,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=490314.0, ans=0.125 2023-06-20 05:01:43,087 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=490314.0, ans=0.2 2023-06-20 05:02:15,090 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=490374.0, ans=0.125 2023-06-20 05:02:17,393 INFO [train.py:996] (3/4) Epoch 3, batch 20750, loss[loss=0.2729, simple_loss=0.3577, pruned_loss=0.09411, over 21656.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3169, pruned_loss=0.09044, over 4246289.95 frames. ], batch size: 230, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:02:17,943 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=490434.0, ans=0.1 2023-06-20 05:03:56,440 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.113e+02 2.880e+02 3.382e+02 4.040e+02 6.281e+02, threshold=6.763e+02, percent-clipped=1.0 2023-06-20 05:04:02,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=490674.0, ans=0.0 2023-06-20 05:04:04,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=490674.0, ans=0.0 2023-06-20 05:04:04,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=490674.0, ans=0.0 2023-06-20 05:04:10,639 INFO [train.py:996] (3/4) Epoch 3, batch 20800, loss[loss=0.2567, simple_loss=0.3116, pruned_loss=0.1009, over 21560.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3225, pruned_loss=0.09211, over 4256761.44 frames. ], batch size: 414, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:04:26,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=490734.0, ans=0.125 2023-06-20 05:04:27,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=490734.0, ans=0.1 2023-06-20 05:04:49,549 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=490794.0, ans=0.125 2023-06-20 05:05:33,248 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.20 vs. limit=22.5 2023-06-20 05:05:43,070 INFO [train.py:996] (3/4) Epoch 3, batch 20850, loss[loss=0.1869, simple_loss=0.259, pruned_loss=0.05736, over 21680.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.317, pruned_loss=0.09113, over 4258049.62 frames. ], batch size: 298, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:07:06,585 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=491274.0, ans=22.5 2023-06-20 05:07:09,895 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.988e+02 2.577e+02 3.141e+02 3.837e+02 5.556e+02, threshold=6.283e+02, percent-clipped=0.0 2023-06-20 05:07:12,229 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=491274.0, ans=0.0 2023-06-20 05:07:25,946 INFO [train.py:996] (3/4) Epoch 3, batch 20900, loss[loss=0.2684, simple_loss=0.3421, pruned_loss=0.09734, over 21351.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.317, pruned_loss=0.09237, over 4268713.18 frames. ], batch size: 548, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:07:30,603 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=491334.0, ans=0.125 2023-06-20 05:07:54,654 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=491394.0, ans=0.2 2023-06-20 05:08:02,230 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=491394.0, ans=0.0 2023-06-20 05:08:03,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=491394.0, ans=0.0 2023-06-20 05:08:38,298 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=491514.0, ans=0.0 2023-06-20 05:08:39,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=491574.0, ans=0.0 2023-06-20 05:08:44,774 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.46 vs. limit=15.0 2023-06-20 05:08:55,355 INFO [train.py:996] (3/4) Epoch 3, batch 20950, loss[loss=0.2018, simple_loss=0.2761, pruned_loss=0.06376, over 21690.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3134, pruned_loss=0.08893, over 4268738.90 frames. ], batch size: 298, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:09:30,361 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.47 vs. limit=6.0 2023-06-20 05:10:05,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=491814.0, ans=0.125 2023-06-20 05:10:11,406 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.55 vs. limit=15.0 2023-06-20 05:10:18,296 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=491874.0, ans=0.125 2023-06-20 05:10:19,391 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.871e+02 2.460e+02 2.871e+02 3.245e+02 5.204e+02, threshold=5.741e+02, percent-clipped=0.0 2023-06-20 05:10:29,538 INFO [train.py:996] (3/4) Epoch 3, batch 21000, loss[loss=0.2246, simple_loss=0.2942, pruned_loss=0.07751, over 21263.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3127, pruned_loss=0.08933, over 4274032.03 frames. ], batch size: 143, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:10:29,539 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-20 05:11:23,168 INFO [train.py:1028] (3/4) Epoch 3, validation: loss=0.2766, simple_loss=0.3765, pruned_loss=0.08831, over 1796401.00 frames. 2023-06-20 05:11:23,170 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-20 05:12:26,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=492114.0, ans=0.2 2023-06-20 05:12:31,445 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=492114.0, ans=0.125 2023-06-20 05:12:59,900 INFO [train.py:996] (3/4) Epoch 3, batch 21050, loss[loss=0.2554, simple_loss=0.3097, pruned_loss=0.1006, over 21242.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.31, pruned_loss=0.0896, over 4281408.46 frames. ], batch size: 176, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:13:33,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=492294.0, ans=0.125 2023-06-20 05:13:45,393 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.85 vs. limit=22.5 2023-06-20 05:14:21,309 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.912e+02 2.359e+02 2.713e+02 3.302e+02 4.914e+02, threshold=5.427e+02, percent-clipped=0.0 2023-06-20 05:14:30,220 INFO [train.py:996] (3/4) Epoch 3, batch 21100, loss[loss=0.2315, simple_loss=0.2824, pruned_loss=0.09033, over 21179.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3067, pruned_loss=0.08927, over 4281278.59 frames. ], batch size: 159, lr: 1.05e-02, grad_scale: 16.0 2023-06-20 05:14:37,644 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=492534.0, ans=0.07 2023-06-20 05:15:07,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=492594.0, ans=0.125 2023-06-20 05:16:26,483 INFO [train.py:996] (3/4) Epoch 3, batch 21150, loss[loss=0.2285, simple_loss=0.2812, pruned_loss=0.08794, over 21686.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3037, pruned_loss=0.08919, over 4271889.46 frames. ], batch size: 282, lr: 1.05e-02, grad_scale: 16.0 2023-06-20 05:16:48,643 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.10 vs. limit=22.5 2023-06-20 05:16:57,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=492894.0, ans=0.1 2023-06-20 05:17:48,820 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.014e+02 2.584e+02 3.052e+02 3.730e+02 6.126e+02, threshold=6.104e+02, percent-clipped=6.0 2023-06-20 05:17:50,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=493074.0, ans=0.125 2023-06-20 05:18:00,686 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=493074.0, ans=0.125 2023-06-20 05:18:03,240 INFO [train.py:996] (3/4) Epoch 3, batch 21200, loss[loss=0.1951, simple_loss=0.2561, pruned_loss=0.0671, over 21198.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.2993, pruned_loss=0.08788, over 4273119.41 frames. ], batch size: 548, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:18:55,770 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.24 vs. limit=22.5 2023-06-20 05:19:08,713 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=493314.0, ans=0.125 2023-06-20 05:19:39,364 INFO [train.py:996] (3/4) Epoch 3, batch 21250, loss[loss=0.3206, simple_loss=0.388, pruned_loss=0.1266, over 21730.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.2989, pruned_loss=0.08874, over 4257801.20 frames. ], batch size: 415, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:19:39,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=493434.0, ans=0.125 2023-06-20 05:20:28,828 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.58 vs. limit=22.5 2023-06-20 05:20:57,473 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=493614.0, ans=0.0 2023-06-20 05:21:07,058 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.832e+02 2.549e+02 3.045e+02 3.638e+02 6.248e+02, threshold=6.090e+02, percent-clipped=1.0 2023-06-20 05:21:13,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=493674.0, ans=0.1 2023-06-20 05:21:21,845 INFO [train.py:996] (3/4) Epoch 3, batch 21300, loss[loss=0.2852, simple_loss=0.365, pruned_loss=0.1027, over 19772.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3063, pruned_loss=0.09117, over 4262017.59 frames. ], batch size: 704, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:21:33,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=493734.0, ans=0.1 2023-06-20 05:21:37,362 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.74 vs. limit=15.0 2023-06-20 05:22:34,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=493914.0, ans=0.125 2023-06-20 05:22:53,214 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=15.0 2023-06-20 05:23:05,084 INFO [train.py:996] (3/4) Epoch 3, batch 21350, loss[loss=0.2326, simple_loss=0.3214, pruned_loss=0.07189, over 21264.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3108, pruned_loss=0.0919, over 4254987.79 frames. ], batch size: 548, lr: 1.05e-02, grad_scale: 16.0 2023-06-20 05:23:41,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=494094.0, ans=0.2 2023-06-20 05:24:57,874 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.675e+02 2.570e+02 2.839e+02 3.402e+02 4.557e+02, threshold=5.677e+02, percent-clipped=0.0 2023-06-20 05:25:16,171 INFO [train.py:996] (3/4) Epoch 3, batch 21400, loss[loss=0.2904, simple_loss=0.3527, pruned_loss=0.114, over 21380.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3123, pruned_loss=0.08997, over 4251610.38 frames. ], batch size: 548, lr: 1.05e-02, grad_scale: 16.0 2023-06-20 05:25:26,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=494334.0, ans=0.2 2023-06-20 05:26:32,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=494514.0, ans=0.125 2023-06-20 05:26:38,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=494514.0, ans=0.125 2023-06-20 05:26:52,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=494574.0, ans=0.2 2023-06-20 05:27:20,262 INFO [train.py:996] (3/4) Epoch 3, batch 21450, loss[loss=0.254, simple_loss=0.3136, pruned_loss=0.09721, over 21299.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3151, pruned_loss=0.09094, over 4258738.99 frames. ], batch size: 159, lr: 1.05e-02, grad_scale: 16.0 2023-06-20 05:27:25,974 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.38 vs. limit=22.5 2023-06-20 05:27:32,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=494634.0, ans=0.2 2023-06-20 05:28:03,758 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=494754.0, ans=0.0 2023-06-20 05:28:05,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=494754.0, ans=0.1 2023-06-20 05:28:09,849 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 05:28:24,596 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=494874.0, ans=0.05 2023-06-20 05:28:25,148 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.36 vs. limit=12.0 2023-06-20 05:28:26,575 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.09 vs. limit=22.5 2023-06-20 05:28:38,460 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.041e+02 2.516e+02 2.861e+02 3.352e+02 5.412e+02, threshold=5.722e+02, percent-clipped=0.0 2023-06-20 05:28:52,155 INFO [train.py:996] (3/4) Epoch 3, batch 21500, loss[loss=0.2454, simple_loss=0.3027, pruned_loss=0.094, over 21681.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3133, pruned_loss=0.09238, over 4265972.88 frames. ], batch size: 393, lr: 1.05e-02, grad_scale: 16.0 2023-06-20 05:30:29,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=495174.0, ans=0.0 2023-06-20 05:30:42,579 INFO [train.py:996] (3/4) Epoch 3, batch 21550, loss[loss=0.1962, simple_loss=0.2641, pruned_loss=0.06413, over 21150.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3051, pruned_loss=0.08927, over 4268032.54 frames. ], batch size: 176, lr: 1.05e-02, grad_scale: 16.0 2023-06-20 05:30:49,682 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.32 vs. limit=15.0 2023-06-20 05:31:11,744 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=495294.0, ans=0.2 2023-06-20 05:31:27,197 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=495354.0, ans=0.2 2023-06-20 05:32:24,796 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.753e+02 2.424e+02 3.051e+02 3.633e+02 8.477e+02, threshold=6.101e+02, percent-clipped=5.0 2023-06-20 05:32:25,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=495474.0, ans=0.2 2023-06-20 05:32:38,621 INFO [train.py:996] (3/4) Epoch 3, batch 21600, loss[loss=0.2593, simple_loss=0.3626, pruned_loss=0.07807, over 19660.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3009, pruned_loss=0.08714, over 4269657.07 frames. ], batch size: 703, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:33:08,322 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=495594.0, ans=0.125 2023-06-20 05:33:11,797 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.80 vs. limit=15.0 2023-06-20 05:33:16,973 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=495654.0, ans=0.0 2023-06-20 05:34:21,204 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=495774.0, ans=0.05 2023-06-20 05:34:27,995 INFO [train.py:996] (3/4) Epoch 3, batch 21650, loss[loss=0.2159, simple_loss=0.2991, pruned_loss=0.06633, over 21119.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3051, pruned_loss=0.08405, over 4274785.68 frames. ], batch size: 176, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:34:54,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=495894.0, ans=0.0 2023-06-20 05:35:23,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=496014.0, ans=0.2 2023-06-20 05:35:25,243 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.49 vs. limit=15.0 2023-06-20 05:35:50,978 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.877e+02 2.412e+02 2.832e+02 3.352e+02 7.209e+02, threshold=5.664e+02, percent-clipped=3.0 2023-06-20 05:35:58,261 INFO [train.py:996] (3/4) Epoch 3, batch 21700, loss[loss=0.2144, simple_loss=0.2904, pruned_loss=0.06914, over 21286.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3066, pruned_loss=0.08336, over 4274123.24 frames. ], batch size: 176, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:36:07,606 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 05:36:57,260 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=496314.0, ans=0.125 2023-06-20 05:37:12,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=496374.0, ans=0.125 2023-06-20 05:37:14,014 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=496374.0, ans=0.5 2023-06-20 05:37:33,330 INFO [train.py:996] (3/4) Epoch 3, batch 21750, loss[loss=0.231, simple_loss=0.2839, pruned_loss=0.08904, over 21953.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3026, pruned_loss=0.08359, over 4275825.60 frames. ], batch size: 103, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:37:39,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=496434.0, ans=0.125 2023-06-20 05:37:46,893 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=496494.0, ans=0.125 2023-06-20 05:38:01,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=496494.0, ans=0.125 2023-06-20 05:38:17,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=496554.0, ans=0.125 2023-06-20 05:39:02,547 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.696e+02 2.434e+02 2.778e+02 3.312e+02 5.034e+02, threshold=5.556e+02, percent-clipped=0.0 2023-06-20 05:39:05,024 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.76 vs. limit=15.0 2023-06-20 05:39:10,168 INFO [train.py:996] (3/4) Epoch 3, batch 21800, loss[loss=0.2604, simple_loss=0.3033, pruned_loss=0.1088, over 21232.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3011, pruned_loss=0.08511, over 4279418.20 frames. ], batch size: 471, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:39:21,404 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=496734.0, ans=0.0 2023-06-20 05:39:36,802 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.82 vs. limit=10.0 2023-06-20 05:39:45,054 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=496854.0, ans=0.125 2023-06-20 05:39:46,556 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=496854.0, ans=10.0 2023-06-20 05:39:49,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=496854.0, ans=0.0 2023-06-20 05:40:00,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=496854.0, ans=0.125 2023-06-20 05:40:46,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=496974.0, ans=0.125 2023-06-20 05:40:56,362 INFO [train.py:996] (3/4) Epoch 3, batch 21850, loss[loss=0.2306, simple_loss=0.3048, pruned_loss=0.07824, over 21811.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.306, pruned_loss=0.08616, over 4270293.51 frames. ], batch size: 282, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:40:56,810 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=497034.0, ans=0.5 2023-06-20 05:41:43,826 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=497094.0, ans=0.0 2023-06-20 05:41:57,328 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=12.0 2023-06-20 05:42:02,623 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 05:42:40,771 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.811e+02 2.614e+02 3.041e+02 3.805e+02 7.327e+02, threshold=6.083e+02, percent-clipped=2.0 2023-06-20 05:42:48,469 INFO [train.py:996] (3/4) Epoch 3, batch 21900, loss[loss=0.2365, simple_loss=0.2919, pruned_loss=0.09054, over 21756.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3079, pruned_loss=0.08767, over 4260851.99 frames. ], batch size: 316, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:43:25,214 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.36 vs. limit=6.0 2023-06-20 05:43:44,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=497454.0, ans=0.125 2023-06-20 05:44:33,683 INFO [train.py:996] (3/4) Epoch 3, batch 21950, loss[loss=0.1753, simple_loss=0.2634, pruned_loss=0.0436, over 20841.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3024, pruned_loss=0.08583, over 4263172.52 frames. ], batch size: 608, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:44:34,176 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=497634.0, ans=0.07 2023-06-20 05:44:54,557 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.59 vs. limit=15.0 2023-06-20 05:44:58,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=497694.0, ans=0.0 2023-06-20 05:45:19,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=497754.0, ans=0.125 2023-06-20 05:46:13,308 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.613e+02 2.252e+02 2.673e+02 3.306e+02 5.194e+02, threshold=5.347e+02, percent-clipped=0.0 2023-06-20 05:46:13,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=497874.0, ans=0.0 2023-06-20 05:46:21,090 INFO [train.py:996] (3/4) Epoch 3, batch 22000, loss[loss=0.2303, simple_loss=0.2883, pruned_loss=0.0862, over 21389.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.2975, pruned_loss=0.08384, over 4256048.45 frames. ], batch size: 131, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:46:31,254 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=497934.0, ans=0.1 2023-06-20 05:46:37,872 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.33 vs. limit=22.5 2023-06-20 05:46:47,064 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.85 vs. limit=15.0 2023-06-20 05:48:03,703 INFO [train.py:996] (3/4) Epoch 3, batch 22050, loss[loss=0.3517, simple_loss=0.4621, pruned_loss=0.1207, over 19848.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3063, pruned_loss=0.08699, over 4253712.49 frames. ], batch size: 702, lr: 1.05e-02, grad_scale: 16.0 2023-06-20 05:50:01,052 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.043e+02 3.110e+02 3.918e+02 4.887e+02 9.595e+02, threshold=7.836e+02, percent-clipped=17.0 2023-06-20 05:50:07,382 INFO [train.py:996] (3/4) Epoch 3, batch 22100, loss[loss=0.2576, simple_loss=0.3208, pruned_loss=0.09717, over 21950.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3162, pruned_loss=0.0927, over 4257437.92 frames. ], batch size: 316, lr: 1.05e-02, grad_scale: 16.0 2023-06-20 05:50:12,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=498534.0, ans=0.0 2023-06-20 05:50:12,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=498534.0, ans=0.125 2023-06-20 05:50:21,661 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=498594.0, ans=0.0 2023-06-20 05:50:27,919 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.49 vs. limit=12.0 2023-06-20 05:51:50,232 INFO [train.py:996] (3/4) Epoch 3, batch 22150, loss[loss=0.2469, simple_loss=0.3242, pruned_loss=0.08485, over 21901.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3181, pruned_loss=0.09411, over 4268478.21 frames. ], batch size: 124, lr: 1.04e-02, grad_scale: 16.0 2023-06-20 05:51:59,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=498834.0, ans=10.0 2023-06-20 05:52:29,172 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=498954.0, ans=0.125 2023-06-20 05:52:38,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=498954.0, ans=0.04949747468305833 2023-06-20 05:52:52,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=499014.0, ans=0.125 2023-06-20 05:53:13,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=499014.0, ans=0.125 2023-06-20 05:53:30,278 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.991e+02 2.905e+02 3.355e+02 4.260e+02 6.840e+02, threshold=6.709e+02, percent-clipped=0.0 2023-06-20 05:53:41,751 INFO [train.py:996] (3/4) Epoch 3, batch 22200, loss[loss=0.2349, simple_loss=0.3009, pruned_loss=0.08449, over 21666.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.319, pruned_loss=0.09435, over 4277048.54 frames. ], batch size: 263, lr: 1.04e-02, grad_scale: 16.0 2023-06-20 05:53:54,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=499134.0, ans=0.125 2023-06-20 05:53:57,442 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=499194.0, ans=0.0 2023-06-20 05:53:59,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=499194.0, ans=0.125 2023-06-20 05:54:03,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=499194.0, ans=0.1 2023-06-20 05:54:23,952 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=499254.0, ans=0.125 2023-06-20 05:54:33,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=499254.0, ans=0.125 2023-06-20 05:54:42,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=499314.0, ans=0.2 2023-06-20 05:54:54,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=499314.0, ans=0.0 2023-06-20 05:55:37,062 INFO [train.py:996] (3/4) Epoch 3, batch 22250, loss[loss=0.287, simple_loss=0.3558, pruned_loss=0.1091, over 21909.00 frames. ], tot_loss[loss=0.2598, simple_loss=0.3271, pruned_loss=0.0963, over 4278055.55 frames. ], batch size: 316, lr: 1.04e-02, grad_scale: 16.0 2023-06-20 05:55:38,870 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 05:56:46,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=499674.0, ans=0.0 2023-06-20 05:56:56,408 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.238e+02 2.809e+02 3.391e+02 3.899e+02 6.756e+02, threshold=6.782e+02, percent-clipped=1.0 2023-06-20 05:57:08,135 INFO [train.py:996] (3/4) Epoch 3, batch 22300, loss[loss=0.2539, simple_loss=0.3156, pruned_loss=0.09613, over 21901.00 frames. ], tot_loss[loss=0.2629, simple_loss=0.3286, pruned_loss=0.09859, over 4282260.27 frames. ], batch size: 351, lr: 1.04e-02, grad_scale: 16.0 2023-06-20 05:57:29,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=499794.0, ans=0.1 2023-06-20 05:57:31,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=499794.0, ans=0.0 2023-06-20 05:57:38,379 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.75 vs. limit=22.5 2023-06-20 05:58:08,712 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=499914.0, ans=0.0 2023-06-20 05:58:14,586 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=499914.0, ans=0.025 2023-06-20 05:58:16,182 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=499914.0, ans=0.2 2023-06-20 05:58:42,885 INFO [train.py:996] (3/4) Epoch 3, batch 22350, loss[loss=0.2289, simple_loss=0.2955, pruned_loss=0.08119, over 21471.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3263, pruned_loss=0.09885, over 4287962.09 frames. ], batch size: 194, lr: 1.04e-02, grad_scale: 16.0 2023-06-20 05:58:43,300 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=500034.0, ans=0.0 2023-06-20 05:58:49,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=500034.0, ans=0.2 2023-06-20 05:59:03,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=500094.0, ans=0.1 2023-06-20 06:00:13,795 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.725e+02 2.473e+02 2.755e+02 3.372e+02 7.896e+02, threshold=5.510e+02, percent-clipped=3.0 2023-06-20 06:00:19,817 INFO [train.py:996] (3/4) Epoch 3, batch 22400, loss[loss=0.2129, simple_loss=0.2886, pruned_loss=0.06856, over 21511.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3209, pruned_loss=0.0936, over 4275080.38 frames. ], batch size: 230, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:00:39,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=500394.0, ans=0.2 2023-06-20 06:00:47,462 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=500394.0, ans=6.0 2023-06-20 06:01:15,114 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=500454.0, ans=0.125 2023-06-20 06:02:04,299 INFO [train.py:996] (3/4) Epoch 3, batch 22450, loss[loss=0.2467, simple_loss=0.2976, pruned_loss=0.09797, over 21519.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3158, pruned_loss=0.09315, over 4261606.22 frames. ], batch size: 391, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:02:24,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=500634.0, ans=0.025 2023-06-20 06:02:29,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=500634.0, ans=0.125 2023-06-20 06:03:26,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=500814.0, ans=0.125 2023-06-20 06:03:44,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=500874.0, ans=0.0 2023-06-20 06:03:50,257 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.997e+02 2.596e+02 2.879e+02 3.313e+02 5.071e+02, threshold=5.757e+02, percent-clipped=0.0 2023-06-20 06:03:53,785 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=500874.0, ans=0.125 2023-06-20 06:03:56,320 INFO [train.py:996] (3/4) Epoch 3, batch 22500, loss[loss=0.2298, simple_loss=0.2841, pruned_loss=0.0877, over 21231.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3101, pruned_loss=0.09228, over 4261708.33 frames. ], batch size: 176, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:04:21,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=500994.0, ans=0.125 2023-06-20 06:04:37,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=500994.0, ans=0.05 2023-06-20 06:05:18,625 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=501174.0, ans=0.125 2023-06-20 06:05:18,722 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=501174.0, ans=0.0 2023-06-20 06:05:29,393 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.95 vs. limit=15.0 2023-06-20 06:05:30,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=501174.0, ans=0.125 2023-06-20 06:05:39,779 INFO [train.py:996] (3/4) Epoch 3, batch 22550, loss[loss=0.2716, simple_loss=0.3294, pruned_loss=0.1069, over 21865.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3165, pruned_loss=0.09315, over 4274100.90 frames. ], batch size: 371, lr: 1.04e-02, grad_scale: 16.0 2023-06-20 06:06:07,644 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=501234.0, ans=0.125 2023-06-20 06:06:33,585 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=501294.0, ans=0.0 2023-06-20 06:06:36,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=501294.0, ans=0.1 2023-06-20 06:06:56,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=501354.0, ans=0.125 2023-06-20 06:07:02,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=501414.0, ans=0.025 2023-06-20 06:07:09,672 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.78 vs. limit=15.0 2023-06-20 06:07:10,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=501414.0, ans=0.0 2023-06-20 06:07:33,203 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.019e+02 2.773e+02 3.379e+02 4.240e+02 8.103e+02, threshold=6.757e+02, percent-clipped=8.0 2023-06-20 06:07:33,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=501474.0, ans=0.125 2023-06-20 06:07:43,148 INFO [train.py:996] (3/4) Epoch 3, batch 22600, loss[loss=0.3327, simple_loss=0.3988, pruned_loss=0.1333, over 21514.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.3204, pruned_loss=0.09487, over 4284193.55 frames. ], batch size: 471, lr: 1.04e-02, grad_scale: 16.0 2023-06-20 06:07:59,835 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.87 vs. limit=12.0 2023-06-20 06:08:12,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=501594.0, ans=0.0 2023-06-20 06:08:30,972 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 06:08:41,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=501714.0, ans=0.125 2023-06-20 06:08:44,054 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=501714.0, ans=0.125 2023-06-20 06:08:48,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=501714.0, ans=0.125 2023-06-20 06:09:25,233 INFO [train.py:996] (3/4) Epoch 3, batch 22650, loss[loss=0.2301, simple_loss=0.285, pruned_loss=0.08758, over 21826.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.3178, pruned_loss=0.09388, over 4267897.46 frames. ], batch size: 107, lr: 1.04e-02, grad_scale: 16.0 2023-06-20 06:09:25,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=501834.0, ans=0.5 2023-06-20 06:09:34,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=501834.0, ans=0.125 2023-06-20 06:10:08,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=501954.0, ans=0.0 2023-06-20 06:10:24,907 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=502014.0, ans=0.125 2023-06-20 06:11:08,603 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.835e+02 2.541e+02 2.941e+02 3.391e+02 5.583e+02, threshold=5.883e+02, percent-clipped=0.0 2023-06-20 06:11:18,222 INFO [train.py:996] (3/4) Epoch 3, batch 22700, loss[loss=0.2397, simple_loss=0.2919, pruned_loss=0.09377, over 21653.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3093, pruned_loss=0.09243, over 4271279.17 frames. ], batch size: 282, lr: 1.04e-02, grad_scale: 16.0 2023-06-20 06:11:43,857 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=502194.0, ans=0.125 2023-06-20 06:11:48,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=502194.0, ans=0.125 2023-06-20 06:12:00,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=502254.0, ans=0.0 2023-06-20 06:12:14,813 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.34 vs. limit=22.5 2023-06-20 06:12:42,839 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=502374.0, ans=0.0 2023-06-20 06:12:55,401 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=502374.0, ans=0.125 2023-06-20 06:13:09,043 INFO [train.py:996] (3/4) Epoch 3, batch 22750, loss[loss=0.2925, simple_loss=0.3461, pruned_loss=0.1195, over 21480.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3114, pruned_loss=0.09527, over 4269440.98 frames. ], batch size: 194, lr: 1.04e-02, grad_scale: 16.0 2023-06-20 06:13:11,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=502434.0, ans=0.0 2023-06-20 06:13:27,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=502434.0, ans=0.1 2023-06-20 06:14:03,750 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=502554.0, ans=0.1 2023-06-20 06:14:07,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=502554.0, ans=6.0 2023-06-20 06:14:09,527 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 06:14:12,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=502554.0, ans=0.0 2023-06-20 06:14:21,654 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 06:14:51,659 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.316e+02 2.900e+02 3.287e+02 3.902e+02 7.614e+02, threshold=6.575e+02, percent-clipped=5.0 2023-06-20 06:14:54,192 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=502674.0, ans=10.0 2023-06-20 06:15:01,374 INFO [train.py:996] (3/4) Epoch 3, batch 22800, loss[loss=0.2348, simple_loss=0.3019, pruned_loss=0.08383, over 21831.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.3159, pruned_loss=0.09745, over 4275933.28 frames. ], batch size: 282, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:15:03,995 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.54 vs. limit=22.5 2023-06-20 06:15:08,828 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.55 vs. limit=15.0 2023-06-20 06:15:38,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=502854.0, ans=0.125 2023-06-20 06:15:43,432 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=502854.0, ans=0.125 2023-06-20 06:15:51,129 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=502914.0, ans=0.0 2023-06-20 06:16:04,510 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=502914.0, ans=0.125 2023-06-20 06:16:32,949 INFO [train.py:996] (3/4) Epoch 3, batch 22850, loss[loss=0.2484, simple_loss=0.306, pruned_loss=0.09538, over 21759.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3118, pruned_loss=0.09586, over 4276096.91 frames. ], batch size: 371, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:16:34,846 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=503034.0, ans=0.125 2023-06-20 06:16:39,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=503034.0, ans=0.0 2023-06-20 06:16:50,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=503034.0, ans=0.0 2023-06-20 06:17:22,344 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.33 vs. limit=10.0 2023-06-20 06:17:38,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=503214.0, ans=0.0 2023-06-20 06:17:44,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=503274.0, ans=0.2 2023-06-20 06:17:59,983 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.129e+02 2.770e+02 3.456e+02 4.106e+02 7.202e+02, threshold=6.912e+02, percent-clipped=4.0 2023-06-20 06:18:10,139 INFO [train.py:996] (3/4) Epoch 3, batch 22900, loss[loss=0.2513, simple_loss=0.3611, pruned_loss=0.07075, over 21663.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3154, pruned_loss=0.09481, over 4273232.80 frames. ], batch size: 389, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:18:45,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=503394.0, ans=0.125 2023-06-20 06:18:46,697 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=503394.0, ans=0.1 2023-06-20 06:18:56,559 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.79 vs. limit=15.0 2023-06-20 06:19:20,253 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=503514.0, ans=0.0 2023-06-20 06:19:53,672 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.00 vs. limit=15.0 2023-06-20 06:20:01,148 INFO [train.py:996] (3/4) Epoch 3, batch 22950, loss[loss=0.2994, simple_loss=0.4203, pruned_loss=0.08929, over 21630.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.3304, pruned_loss=0.09343, over 4272999.94 frames. ], batch size: 414, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:21:16,187 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=503814.0, ans=0.5 2023-06-20 06:21:55,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=503874.0, ans=0.0 2023-06-20 06:21:57,163 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.977e+02 2.454e+02 2.874e+02 3.736e+02 7.174e+02, threshold=5.748e+02, percent-clipped=1.0 2023-06-20 06:22:01,686 INFO [train.py:996] (3/4) Epoch 3, batch 23000, loss[loss=0.2456, simple_loss=0.3148, pruned_loss=0.08823, over 21832.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.3293, pruned_loss=0.09121, over 4280219.71 frames. ], batch size: 298, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:22:18,647 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.95 vs. limit=22.5 2023-06-20 06:22:36,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=503994.0, ans=0.0 2023-06-20 06:23:44,692 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.84 vs. limit=10.0 2023-06-20 06:23:48,137 INFO [train.py:996] (3/4) Epoch 3, batch 23050, loss[loss=0.2948, simple_loss=0.3559, pruned_loss=0.1169, over 21943.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.3302, pruned_loss=0.09326, over 4282818.08 frames. ], batch size: 372, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:24:29,911 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=504354.0, ans=0.125 2023-06-20 06:25:04,463 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.06 vs. limit=15.0 2023-06-20 06:25:23,655 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 2.688e+02 2.976e+02 3.405e+02 5.912e+02, threshold=5.952e+02, percent-clipped=1.0 2023-06-20 06:25:27,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=504534.0, ans=0.0 2023-06-20 06:25:28,345 INFO [train.py:996] (3/4) Epoch 3, batch 23100, loss[loss=0.215, simple_loss=0.2742, pruned_loss=0.07797, over 21412.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.3251, pruned_loss=0.09416, over 4286229.96 frames. ], batch size: 211, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:25:33,776 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.38 vs. limit=15.0 2023-06-20 06:27:26,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=504774.0, ans=0.0 2023-06-20 06:27:30,099 INFO [train.py:996] (3/4) Epoch 3, batch 23150, loss[loss=0.2779, simple_loss=0.3223, pruned_loss=0.1168, over 21539.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3172, pruned_loss=0.09274, over 4286808.58 frames. ], batch size: 508, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:28:00,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=504894.0, ans=0.125 2023-06-20 06:29:20,474 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.863e+02 2.598e+02 2.935e+02 3.793e+02 5.216e+02, threshold=5.870e+02, percent-clipped=0.0 2023-06-20 06:29:31,069 INFO [train.py:996] (3/4) Epoch 3, batch 23200, loss[loss=0.2102, simple_loss=0.2674, pruned_loss=0.07645, over 19914.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.3167, pruned_loss=0.09403, over 4293915.09 frames. ], batch size: 703, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:29:44,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=505194.0, ans=0.125 2023-06-20 06:29:53,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=505194.0, ans=0.125 2023-06-20 06:30:09,254 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 06:31:16,125 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=505374.0, ans=0.2 2023-06-20 06:31:31,763 INFO [train.py:996] (3/4) Epoch 3, batch 23250, loss[loss=0.244, simple_loss=0.2981, pruned_loss=0.095, over 21045.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3174, pruned_loss=0.09547, over 4290893.71 frames. ], batch size: 607, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:31:38,149 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.42 vs. limit=15.0 2023-06-20 06:31:44,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=505434.0, ans=0.125 2023-06-20 06:32:22,494 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=505554.0, ans=0.04949747468305833 2023-06-20 06:32:43,639 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=15.0 2023-06-20 06:32:57,717 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=505614.0, ans=0.0 2023-06-20 06:33:25,527 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.860e+02 3.314e+02 4.094e+02 6.959e+02, threshold=6.628e+02, percent-clipped=4.0 2023-06-20 06:33:30,037 INFO [train.py:996] (3/4) Epoch 3, batch 23300, loss[loss=0.2835, simple_loss=0.3429, pruned_loss=0.112, over 21782.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.3255, pruned_loss=0.09714, over 4286715.54 frames. ], batch size: 414, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:34:11,478 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.30 vs. limit=5.0 2023-06-20 06:35:06,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=505914.0, ans=0.1 2023-06-20 06:35:30,204 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.40 vs. limit=15.0 2023-06-20 06:35:36,334 INFO [train.py:996] (3/4) Epoch 3, batch 23350, loss[loss=0.1985, simple_loss=0.2788, pruned_loss=0.05914, over 21635.00 frames. ], tot_loss[loss=0.2609, simple_loss=0.3304, pruned_loss=0.09573, over 4290252.71 frames. ], batch size: 230, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:35:38,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=506034.0, ans=0.0 2023-06-20 06:35:39,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=506034.0, ans=0.2 2023-06-20 06:36:24,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=506094.0, ans=0.0 2023-06-20 06:36:30,914 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=506094.0, ans=0.125 2023-06-20 06:37:25,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=506274.0, ans=0.025 2023-06-20 06:37:27,311 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.76 vs. limit=10.0 2023-06-20 06:37:31,184 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=506274.0, ans=0.1 2023-06-20 06:37:32,143 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.703e+02 2.468e+02 2.769e+02 3.183e+02 5.356e+02, threshold=5.538e+02, percent-clipped=0.0 2023-06-20 06:37:36,525 INFO [train.py:996] (3/4) Epoch 3, batch 23400, loss[loss=0.2721, simple_loss=0.3353, pruned_loss=0.1045, over 21864.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3224, pruned_loss=0.09172, over 4286091.15 frames. ], batch size: 124, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:38:24,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=506394.0, ans=0.125 2023-06-20 06:38:30,361 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.13 vs. limit=6.0 2023-06-20 06:38:48,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=506454.0, ans=0.2 2023-06-20 06:38:59,005 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.83 vs. limit=15.0 2023-06-20 06:39:04,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=506514.0, ans=0.2 2023-06-20 06:39:07,570 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.95 vs. limit=12.0 2023-06-20 06:39:22,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=506574.0, ans=0.1 2023-06-20 06:39:24,451 INFO [train.py:996] (3/4) Epoch 3, batch 23450, loss[loss=0.2819, simple_loss=0.341, pruned_loss=0.1114, over 21701.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.3241, pruned_loss=0.09489, over 4285454.88 frames. ], batch size: 351, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:40:37,793 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=506754.0, ans=0.125 2023-06-20 06:41:04,028 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=506814.0, ans=0.2 2023-06-20 06:41:16,374 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.043e+02 2.915e+02 3.484e+02 4.358e+02 6.885e+02, threshold=6.969e+02, percent-clipped=11.0 2023-06-20 06:41:20,559 INFO [train.py:996] (3/4) Epoch 3, batch 23500, loss[loss=0.2433, simple_loss=0.3116, pruned_loss=0.0875, over 21876.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.3242, pruned_loss=0.09662, over 4291747.85 frames. ], batch size: 124, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:41:57,122 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=506994.0, ans=0.125 2023-06-20 06:42:22,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=507054.0, ans=0.0 2023-06-20 06:43:16,474 INFO [train.py:996] (3/4) Epoch 3, batch 23550, loss[loss=0.2243, simple_loss=0.2792, pruned_loss=0.08468, over 21657.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.3195, pruned_loss=0.096, over 4290634.58 frames. ], batch size: 264, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:44:29,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=507414.0, ans=0.0 2023-06-20 06:44:44,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=507474.0, ans=0.125 2023-06-20 06:44:52,572 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.954e+02 2.559e+02 3.076e+02 4.076e+02 7.256e+02, threshold=6.152e+02, percent-clipped=1.0 2023-06-20 06:45:02,174 INFO [train.py:996] (3/4) Epoch 3, batch 23600, loss[loss=0.2434, simple_loss=0.3107, pruned_loss=0.08809, over 21799.00 frames. ], tot_loss[loss=0.255, simple_loss=0.319, pruned_loss=0.09555, over 4282181.08 frames. ], batch size: 351, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:46:05,011 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=507654.0, ans=0.0 2023-06-20 06:46:37,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=507714.0, ans=0.125 2023-06-20 06:47:23,358 INFO [train.py:996] (3/4) Epoch 3, batch 23650, loss[loss=0.2517, simple_loss=0.3286, pruned_loss=0.08733, over 21302.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.3193, pruned_loss=0.09397, over 4278681.69 frames. ], batch size: 548, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:47:53,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=507894.0, ans=0.125 2023-06-20 06:48:00,890 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=507894.0, ans=0.95 2023-06-20 06:48:07,344 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-06-20 06:49:20,962 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.816e+02 3.255e+02 3.884e+02 5.582e+02, threshold=6.510e+02, percent-clipped=0.0 2023-06-20 06:49:25,381 INFO [train.py:996] (3/4) Epoch 3, batch 23700, loss[loss=0.2303, simple_loss=0.2971, pruned_loss=0.08173, over 21388.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3216, pruned_loss=0.09344, over 4280425.50 frames. ], batch size: 159, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:49:25,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=508134.0, ans=0.2 2023-06-20 06:50:06,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=508194.0, ans=0.04949747468305833 2023-06-20 06:50:44,307 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.08 vs. limit=15.0 2023-06-20 06:50:44,312 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.91 vs. limit=22.5 2023-06-20 06:50:49,697 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=508314.0, ans=0.0 2023-06-20 06:50:51,841 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.72 vs. limit=15.0 2023-06-20 06:51:16,303 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=508374.0, ans=0.125 2023-06-20 06:51:19,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=508374.0, ans=0.0 2023-06-20 06:51:23,919 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.57 vs. limit=15.0 2023-06-20 06:51:38,352 INFO [train.py:996] (3/4) Epoch 3, batch 23750, loss[loss=0.2494, simple_loss=0.337, pruned_loss=0.08093, over 21658.00 frames. ], tot_loss[loss=0.2571, simple_loss=0.3255, pruned_loss=0.09433, over 4279263.08 frames. ], batch size: 389, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 06:52:04,787 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=508494.0, ans=0.125 2023-06-20 06:53:32,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=508674.0, ans=0.125 2023-06-20 06:53:41,938 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.910e+02 2.562e+02 2.929e+02 3.446e+02 6.270e+02, threshold=5.857e+02, percent-clipped=0.0 2023-06-20 06:53:42,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=508674.0, ans=0.125 2023-06-20 06:53:46,593 INFO [train.py:996] (3/4) Epoch 3, batch 23800, loss[loss=0.2609, simple_loss=0.3462, pruned_loss=0.08784, over 21769.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3219, pruned_loss=0.09111, over 4276063.54 frames. ], batch size: 282, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 06:53:59,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=508734.0, ans=0.95 2023-06-20 06:54:12,302 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=508794.0, ans=0.1 2023-06-20 06:54:17,719 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=508794.0, ans=0.1 2023-06-20 06:54:54,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=508854.0, ans=0.125 2023-06-20 06:55:18,645 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.01 vs. limit=15.0 2023-06-20 06:55:32,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=508974.0, ans=0.125 2023-06-20 06:55:47,554 INFO [train.py:996] (3/4) Epoch 3, batch 23850, loss[loss=0.2688, simple_loss=0.342, pruned_loss=0.09778, over 21617.00 frames. ], tot_loss[loss=0.26, simple_loss=0.3321, pruned_loss=0.09394, over 4274715.68 frames. ], batch size: 230, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 06:56:23,928 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=509094.0, ans=0.0 2023-06-20 06:57:17,715 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.35 vs. limit=10.0 2023-06-20 06:57:30,673 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.57 vs. limit=15.0 2023-06-20 06:57:31,017 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.014e+02 2.767e+02 3.308e+02 4.321e+02 7.651e+02, threshold=6.615e+02, percent-clipped=11.0 2023-06-20 06:57:35,366 INFO [train.py:996] (3/4) Epoch 3, batch 23900, loss[loss=0.2753, simple_loss=0.3401, pruned_loss=0.1052, over 21475.00 frames. ], tot_loss[loss=0.267, simple_loss=0.3396, pruned_loss=0.09718, over 4277750.18 frames. ], batch size: 389, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 06:58:10,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=509394.0, ans=0.035 2023-06-20 06:58:33,280 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=509454.0, ans=0.09899494936611666 2023-06-20 06:58:43,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=509514.0, ans=0.2 2023-06-20 06:58:44,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=509514.0, ans=0.1 2023-06-20 06:59:19,417 INFO [train.py:996] (3/4) Epoch 3, batch 23950, loss[loss=0.2878, simple_loss=0.3415, pruned_loss=0.1171, over 21556.00 frames. ], tot_loss[loss=0.2634, simple_loss=0.3331, pruned_loss=0.09688, over 4271663.22 frames. ], batch size: 414, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 06:59:21,574 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=509634.0, ans=0.0 2023-06-20 07:00:18,597 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=509754.0, ans=10.0 2023-06-20 07:00:26,047 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=509814.0, ans=0.95 2023-06-20 07:00:36,958 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=509814.0, ans=0.035 2023-06-20 07:00:51,004 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.898e+02 2.718e+02 3.082e+02 3.773e+02 7.054e+02, threshold=6.164e+02, percent-clipped=1.0 2023-06-20 07:00:55,597 INFO [train.py:996] (3/4) Epoch 3, batch 24000, loss[loss=0.2678, simple_loss=0.3327, pruned_loss=0.1014, over 21653.00 frames. ], tot_loss[loss=0.2671, simple_loss=0.3346, pruned_loss=0.09975, over 4266440.05 frames. ], batch size: 263, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:00:55,598 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-20 07:01:58,978 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.0.layers.0.self_attn_weights, attn_weights_entropy = tensor([5.4633, 5.6055, 5.3254, 5.0435], device='cuda:3') 2023-06-20 07:02:00,873 INFO [train.py:1028] (3/4) Epoch 3, validation: loss=0.2795, simple_loss=0.3782, pruned_loss=0.09043, over 1796401.00 frames. 2023-06-20 07:02:00,874 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-20 07:03:04,349 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=510114.0, ans=0.1 2023-06-20 07:03:06,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=510114.0, ans=0.125 2023-06-20 07:03:33,679 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.76 vs. limit=15.0 2023-06-20 07:04:06,984 INFO [train.py:996] (3/4) Epoch 3, batch 24050, loss[loss=0.2196, simple_loss=0.3067, pruned_loss=0.06619, over 21642.00 frames. ], tot_loss[loss=0.2675, simple_loss=0.3357, pruned_loss=0.09961, over 4274911.42 frames. ], batch size: 263, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:04:09,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=510234.0, ans=0.04949747468305833 2023-06-20 07:04:09,714 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.34 vs. limit=6.0 2023-06-20 07:05:11,084 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=510354.0, ans=0.0 2023-06-20 07:06:02,373 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=510474.0, ans=0.0 2023-06-20 07:06:05,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=510474.0, ans=0.0 2023-06-20 07:06:07,956 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.765e+02 2.566e+02 3.148e+02 3.843e+02 5.773e+02, threshold=6.296e+02, percent-clipped=0.0 2023-06-20 07:06:11,802 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.31 vs. limit=22.5 2023-06-20 07:06:17,655 INFO [train.py:996] (3/4) Epoch 3, batch 24100, loss[loss=0.3629, simple_loss=0.4089, pruned_loss=0.1585, over 21384.00 frames. ], tot_loss[loss=0.2668, simple_loss=0.3366, pruned_loss=0.09847, over 4265424.69 frames. ], batch size: 507, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:06:51,950 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=510654.0, ans=0.0 2023-06-20 07:08:05,146 INFO [train.py:996] (3/4) Epoch 3, batch 24150, loss[loss=0.3137, simple_loss=0.3629, pruned_loss=0.1323, over 21695.00 frames. ], tot_loss[loss=0.268, simple_loss=0.3357, pruned_loss=0.1001, over 4271659.76 frames. ], batch size: 473, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:09:51,654 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=511074.0, ans=0.04949747468305833 2023-06-20 07:10:00,380 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.937e+02 2.945e+02 3.371e+02 4.147e+02 7.109e+02, threshold=6.741e+02, percent-clipped=1.0 2023-06-20 07:10:05,089 INFO [train.py:996] (3/4) Epoch 3, batch 24200, loss[loss=0.3043, simple_loss=0.3831, pruned_loss=0.1127, over 21587.00 frames. ], tot_loss[loss=0.2703, simple_loss=0.3381, pruned_loss=0.1012, over 4276590.10 frames. ], batch size: 441, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:10:18,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=511134.0, ans=0.0 2023-06-20 07:10:21,298 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=511134.0, ans=0.125 2023-06-20 07:12:05,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=511374.0, ans=0.025 2023-06-20 07:12:08,268 INFO [train.py:996] (3/4) Epoch 3, batch 24250, loss[loss=0.1754, simple_loss=0.2732, pruned_loss=0.03881, over 21665.00 frames. ], tot_loss[loss=0.2605, simple_loss=0.3337, pruned_loss=0.09364, over 4280736.39 frames. ], batch size: 247, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:12:15,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=511434.0, ans=0.125 2023-06-20 07:12:29,697 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-20 07:12:59,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=511554.0, ans=0.2 2023-06-20 07:13:07,709 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.41 vs. limit=15.0 2023-06-20 07:13:18,813 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=511554.0, ans=0.0 2023-06-20 07:13:51,207 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.452e+02 2.243e+02 2.741e+02 3.193e+02 5.760e+02, threshold=5.481e+02, percent-clipped=0.0 2023-06-20 07:14:00,788 INFO [train.py:996] (3/4) Epoch 3, batch 24300, loss[loss=0.238, simple_loss=0.3122, pruned_loss=0.08192, over 21363.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3255, pruned_loss=0.08707, over 4282555.61 frames. ], batch size: 548, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:15:39,006 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=511914.0, ans=0.125 2023-06-20 07:16:04,843 INFO [train.py:996] (3/4) Epoch 3, batch 24350, loss[loss=0.2833, simple_loss=0.3513, pruned_loss=0.1076, over 21857.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3221, pruned_loss=0.08768, over 4292198.78 frames. ], batch size: 371, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:17:23,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=512154.0, ans=0.04949747468305833 2023-06-20 07:17:30,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=512214.0, ans=0.125 2023-06-20 07:17:37,585 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-20 07:17:51,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=512274.0, ans=0.125 2023-06-20 07:18:15,539 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.793e+02 2.742e+02 3.210e+02 4.189e+02 6.509e+02, threshold=6.419e+02, percent-clipped=5.0 2023-06-20 07:18:20,251 INFO [train.py:996] (3/4) Epoch 3, batch 24400, loss[loss=0.1915, simple_loss=0.2586, pruned_loss=0.06223, over 21789.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3275, pruned_loss=0.09264, over 4292894.29 frames. ], batch size: 102, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:18:36,305 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=512334.0, ans=0.1 2023-06-20 07:18:55,857 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=512394.0, ans=0.1 2023-06-20 07:20:11,822 INFO [train.py:996] (3/4) Epoch 3, batch 24450, loss[loss=0.3541, simple_loss=0.4181, pruned_loss=0.1451, over 21471.00 frames. ], tot_loss[loss=0.2598, simple_loss=0.3303, pruned_loss=0.09461, over 4295716.34 frames. ], batch size: 508, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:22:18,357 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.573e+02 2.557e+02 2.863e+02 3.394e+02 4.374e+02, threshold=5.725e+02, percent-clipped=0.0 2023-06-20 07:22:28,442 INFO [train.py:996] (3/4) Epoch 3, batch 24500, loss[loss=0.2539, simple_loss=0.3398, pruned_loss=0.08399, over 21322.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.3302, pruned_loss=0.09464, over 4297206.79 frames. ], batch size: 176, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:23:10,877 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=513054.0, ans=0.1 2023-06-20 07:23:13,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=513054.0, ans=0.2 2023-06-20 07:23:36,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=513114.0, ans=0.2 2023-06-20 07:24:10,062 INFO [train.py:996] (3/4) Epoch 3, batch 24550, loss[loss=0.3184, simple_loss=0.3743, pruned_loss=0.1312, over 21721.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.3336, pruned_loss=0.09742, over 4297584.53 frames. ], batch size: 441, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:24:53,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=513354.0, ans=0.125 2023-06-20 07:25:13,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=513354.0, ans=0.0 2023-06-20 07:25:23,351 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=513414.0, ans=0.0 2023-06-20 07:25:27,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=513414.0, ans=0.0 2023-06-20 07:25:27,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=513414.0, ans=0.0 2023-06-20 07:25:57,166 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.946e+02 2.731e+02 3.290e+02 3.893e+02 7.449e+02, threshold=6.579e+02, percent-clipped=2.0 2023-06-20 07:26:00,007 INFO [train.py:996] (3/4) Epoch 3, batch 24600, loss[loss=0.2019, simple_loss=0.2638, pruned_loss=0.07, over 21186.00 frames. ], tot_loss[loss=0.2619, simple_loss=0.329, pruned_loss=0.09736, over 4297715.89 frames. ], batch size: 176, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:26:11,790 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.75 vs. limit=10.0 2023-06-20 07:27:24,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=513654.0, ans=0.2 2023-06-20 07:28:06,818 INFO [train.py:996] (3/4) Epoch 3, batch 24650, loss[loss=0.2024, simple_loss=0.258, pruned_loss=0.07337, over 21656.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3192, pruned_loss=0.09545, over 4285818.08 frames. ], batch size: 248, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:28:22,780 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.35 vs. limit=22.5 2023-06-20 07:28:23,322 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=513834.0, ans=0.015 2023-06-20 07:29:02,151 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=513954.0, ans=0.125 2023-06-20 07:29:06,452 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.99 vs. limit=15.0 2023-06-20 07:29:51,723 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=514074.0, ans=0.125 2023-06-20 07:29:55,433 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.705e+02 2.706e+02 3.131e+02 3.721e+02 9.290e+02, threshold=6.262e+02, percent-clipped=1.0 2023-06-20 07:29:58,418 INFO [train.py:996] (3/4) Epoch 3, batch 24700, loss[loss=0.26, simple_loss=0.3224, pruned_loss=0.09876, over 21539.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3171, pruned_loss=0.09288, over 4280712.66 frames. ], batch size: 441, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:30:39,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=514194.0, ans=0.0 2023-06-20 07:31:01,684 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2023-06-20 07:31:02,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=514254.0, ans=0.0 2023-06-20 07:31:43,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=514374.0, ans=0.1 2023-06-20 07:31:53,313 INFO [train.py:996] (3/4) Epoch 3, batch 24750, loss[loss=0.2245, simple_loss=0.2794, pruned_loss=0.08484, over 21595.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3116, pruned_loss=0.09016, over 4267429.68 frames. ], batch size: 263, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:31:55,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=514434.0, ans=0.0 2023-06-20 07:32:22,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=514434.0, ans=0.0 2023-06-20 07:32:37,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=514494.0, ans=0.1 2023-06-20 07:33:47,908 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=514674.0, ans=0.025 2023-06-20 07:33:49,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=514674.0, ans=0.0 2023-06-20 07:33:52,505 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.007e+02 2.345e+02 2.611e+02 2.913e+02 4.887e+02, threshold=5.223e+02, percent-clipped=0.0 2023-06-20 07:34:06,609 INFO [train.py:996] (3/4) Epoch 3, batch 24800, loss[loss=0.2605, simple_loss=0.3083, pruned_loss=0.1064, over 21560.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3066, pruned_loss=0.08932, over 4267916.48 frames. ], batch size: 548, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:34:18,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=514734.0, ans=0.125 2023-06-20 07:34:47,039 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.46 vs. limit=12.0 2023-06-20 07:35:52,707 INFO [train.py:996] (3/4) Epoch 3, batch 24850, loss[loss=0.2411, simple_loss=0.3109, pruned_loss=0.0856, over 21068.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3071, pruned_loss=0.0911, over 4277544.62 frames. ], batch size: 608, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:36:36,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=515094.0, ans=0.125 2023-06-20 07:37:24,089 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 07:37:30,598 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.78 vs. limit=22.5 2023-06-20 07:37:46,544 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.060e+02 2.912e+02 3.452e+02 3.888e+02 6.528e+02, threshold=6.903e+02, percent-clipped=3.0 2023-06-20 07:37:49,498 INFO [train.py:996] (3/4) Epoch 3, batch 24900, loss[loss=0.2776, simple_loss=0.3379, pruned_loss=0.1087, over 21197.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3078, pruned_loss=0.09081, over 4271424.57 frames. ], batch size: 143, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:38:00,468 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=515334.0, ans=0.125 2023-06-20 07:38:13,158 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=515334.0, ans=0.0 2023-06-20 07:38:28,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=515394.0, ans=0.125 2023-06-20 07:38:46,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=515454.0, ans=0.0 2023-06-20 07:39:12,775 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.34 vs. limit=15.0 2023-06-20 07:39:39,334 INFO [train.py:996] (3/4) Epoch 3, batch 24950, loss[loss=0.274, simple_loss=0.3358, pruned_loss=0.1061, over 21820.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.3169, pruned_loss=0.09608, over 4274904.29 frames. ], batch size: 282, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:40:04,936 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=515694.0, ans=0.125 2023-06-20 07:40:18,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=515754.0, ans=0.0 2023-06-20 07:40:54,423 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.51 vs. limit=8.0 2023-06-20 07:41:18,966 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=515874.0, ans=0.125 2023-06-20 07:41:33,067 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.257e+02 2.916e+02 3.619e+02 4.634e+02 7.027e+02, threshold=7.237e+02, percent-clipped=1.0 2023-06-20 07:41:36,087 INFO [train.py:996] (3/4) Epoch 3, batch 25000, loss[loss=0.232, simple_loss=0.3006, pruned_loss=0.08164, over 21232.00 frames. ], tot_loss[loss=0.2606, simple_loss=0.3247, pruned_loss=0.09825, over 4268877.11 frames. ], batch size: 176, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:41:36,789 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=515934.0, ans=0.125 2023-06-20 07:41:59,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=515994.0, ans=0.1 2023-06-20 07:42:07,282 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=515994.0, ans=0.125 2023-06-20 07:42:27,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=516054.0, ans=0.125 2023-06-20 07:42:27,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=516054.0, ans=0.1 2023-06-20 07:42:54,179 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=516114.0, ans=0.125 2023-06-20 07:43:28,176 INFO [train.py:996] (3/4) Epoch 3, batch 25050, loss[loss=0.2328, simple_loss=0.2901, pruned_loss=0.08777, over 21588.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.3164, pruned_loss=0.09529, over 4266201.37 frames. ], batch size: 298, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:43:56,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=516294.0, ans=0.05 2023-06-20 07:44:30,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=516354.0, ans=0.125 2023-06-20 07:44:30,953 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.34 vs. limit=10.0 2023-06-20 07:44:44,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=516414.0, ans=0.125 2023-06-20 07:44:59,145 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.00 vs. limit=15.0 2023-06-20 07:45:14,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=516474.0, ans=0.0 2023-06-20 07:45:30,414 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.908e+02 2.526e+02 2.799e+02 3.395e+02 4.701e+02, threshold=5.598e+02, percent-clipped=0.0 2023-06-20 07:45:33,105 INFO [train.py:996] (3/4) Epoch 3, batch 25100, loss[loss=0.2262, simple_loss=0.278, pruned_loss=0.08715, over 21469.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3092, pruned_loss=0.09451, over 4268515.45 frames. ], batch size: 195, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:45:59,940 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=516594.0, ans=0.125 2023-06-20 07:46:19,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=516654.0, ans=0.0 2023-06-20 07:47:01,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=516774.0, ans=0.0 2023-06-20 07:47:26,340 INFO [train.py:996] (3/4) Epoch 3, batch 25150, loss[loss=0.2387, simple_loss=0.3202, pruned_loss=0.07861, over 21660.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3127, pruned_loss=0.09169, over 4243496.55 frames. ], batch size: 263, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:47:51,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=516894.0, ans=0.09899494936611666 2023-06-20 07:48:47,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=517074.0, ans=0.125 2023-06-20 07:49:12,220 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.847e+02 2.389e+02 2.624e+02 3.346e+02 4.774e+02, threshold=5.249e+02, percent-clipped=0.0 2023-06-20 07:49:15,105 INFO [train.py:996] (3/4) Epoch 3, batch 25200, loss[loss=0.2262, simple_loss=0.3201, pruned_loss=0.06615, over 21770.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3122, pruned_loss=0.08959, over 4249978.18 frames. ], batch size: 332, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:49:24,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=517134.0, ans=0.2 2023-06-20 07:49:56,490 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.28 vs. limit=15.0 2023-06-20 07:49:57,265 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=517254.0, ans=0.125 2023-06-20 07:50:03,562 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.82 vs. limit=6.0 2023-06-20 07:50:04,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=517254.0, ans=0.125 2023-06-20 07:50:54,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=517374.0, ans=0.0 2023-06-20 07:51:12,211 INFO [train.py:996] (3/4) Epoch 3, batch 25250, loss[loss=0.2026, simple_loss=0.2715, pruned_loss=0.06681, over 21603.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3111, pruned_loss=0.08766, over 4254372.39 frames. ], batch size: 247, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:51:45,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=517554.0, ans=0.125 2023-06-20 07:51:51,548 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 07:52:14,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=517614.0, ans=0.0 2023-06-20 07:52:44,085 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=517614.0, ans=0.1 2023-06-20 07:52:56,934 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=517674.0, ans=0.125 2023-06-20 07:53:08,102 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=517674.0, ans=0.1 2023-06-20 07:53:09,341 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.005e+02 2.525e+02 2.860e+02 3.493e+02 8.717e+02, threshold=5.720e+02, percent-clipped=4.0 2023-06-20 07:53:09,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=517674.0, ans=0.0 2023-06-20 07:53:12,414 INFO [train.py:996] (3/4) Epoch 3, batch 25300, loss[loss=0.2664, simple_loss=0.3408, pruned_loss=0.09604, over 21737.00 frames. ], tot_loss[loss=0.243, simple_loss=0.31, pruned_loss=0.08796, over 4254480.28 frames. ], batch size: 332, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:53:42,524 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=517794.0, ans=0.2 2023-06-20 07:53:46,227 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.28 vs. limit=10.0 2023-06-20 07:54:30,619 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=517974.0, ans=0.1 2023-06-20 07:54:59,684 INFO [train.py:996] (3/4) Epoch 3, batch 25350, loss[loss=0.262, simple_loss=0.3533, pruned_loss=0.08534, over 21219.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3137, pruned_loss=0.08774, over 4260150.47 frames. ], batch size: 548, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:55:03,661 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.48 vs. limit=10.0 2023-06-20 07:55:59,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=518154.0, ans=0.125 2023-06-20 07:56:00,158 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.78 vs. limit=15.0 2023-06-20 07:56:10,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=518154.0, ans=0.1 2023-06-20 07:56:17,534 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=518214.0, ans=0.125 2023-06-20 07:56:51,071 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.743e+02 2.620e+02 3.050e+02 3.855e+02 6.289e+02, threshold=6.099e+02, percent-clipped=1.0 2023-06-20 07:56:53,753 INFO [train.py:996] (3/4) Epoch 3, batch 25400, loss[loss=0.2206, simple_loss=0.2768, pruned_loss=0.08219, over 21323.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3078, pruned_loss=0.08653, over 4258258.22 frames. ], batch size: 144, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:57:19,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=518394.0, ans=0.1 2023-06-20 07:57:43,333 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=518454.0, ans=0.125 2023-06-20 07:58:23,985 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=518574.0, ans=0.2 2023-06-20 07:58:31,276 INFO [train.py:996] (3/4) Epoch 3, batch 25450, loss[loss=0.2551, simple_loss=0.3313, pruned_loss=0.08939, over 21673.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3088, pruned_loss=0.08831, over 4239848.37 frames. ], batch size: 230, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 07:58:54,797 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.21 vs. limit=15.0 2023-06-20 07:58:57,000 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=518694.0, ans=0.2 2023-06-20 07:59:19,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=518754.0, ans=0.1 2023-06-20 08:00:13,845 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.672e+02 2.261e+02 2.543e+02 3.250e+02 4.751e+02, threshold=5.087e+02, percent-clipped=0.0 2023-06-20 08:00:14,953 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.49 vs. limit=15.0 2023-06-20 08:00:15,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=518934.0, ans=0.1 2023-06-20 08:00:16,652 INFO [train.py:996] (3/4) Epoch 3, batch 25500, loss[loss=0.3144, simple_loss=0.3782, pruned_loss=0.1253, over 21669.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3091, pruned_loss=0.08429, over 4240410.15 frames. ], batch size: 441, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:01:32,902 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=519054.0, ans=0.125 2023-06-20 08:01:44,744 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=519114.0, ans=0.0 2023-06-20 08:02:08,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=519174.0, ans=0.035 2023-06-20 08:02:39,012 INFO [train.py:996] (3/4) Epoch 3, batch 25550, loss[loss=0.2397, simple_loss=0.3443, pruned_loss=0.06752, over 21864.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3167, pruned_loss=0.08494, over 4246485.20 frames. ], batch size: 371, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:03:00,751 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=519294.0, ans=0.125 2023-06-20 08:03:22,915 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.61 vs. limit=22.5 2023-06-20 08:03:29,345 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=519354.0, ans=0.1 2023-06-20 08:04:12,719 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=519414.0, ans=0.2 2023-06-20 08:04:35,370 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 08:04:40,728 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.809e+02 2.576e+02 2.895e+02 3.439e+02 5.948e+02, threshold=5.790e+02, percent-clipped=2.0 2023-06-20 08:04:43,681 INFO [train.py:996] (3/4) Epoch 3, batch 25600, loss[loss=0.3156, simple_loss=0.3807, pruned_loss=0.1252, over 21845.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3209, pruned_loss=0.08592, over 4257199.64 frames. ], batch size: 124, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:04:50,811 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=519534.0, ans=0.1 2023-06-20 08:06:02,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=519714.0, ans=0.05 2023-06-20 08:06:25,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=519774.0, ans=0.09899494936611666 2023-06-20 08:06:33,950 INFO [train.py:996] (3/4) Epoch 3, batch 25650, loss[loss=0.2391, simple_loss=0.3081, pruned_loss=0.08502, over 21390.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3223, pruned_loss=0.08962, over 4260920.74 frames. ], batch size: 131, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:06:41,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=519834.0, ans=0.0 2023-06-20 08:06:56,206 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=519894.0, ans=0.0 2023-06-20 08:08:02,506 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.89 vs. limit=6.0 2023-06-20 08:08:07,370 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.135e+02 2.861e+02 3.374e+02 3.830e+02 5.312e+02, threshold=6.747e+02, percent-clipped=0.0 2023-06-20 08:08:10,528 INFO [train.py:996] (3/4) Epoch 3, batch 25700, loss[loss=0.267, simple_loss=0.3415, pruned_loss=0.09621, over 21289.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3187, pruned_loss=0.09077, over 4248606.11 frames. ], batch size: 159, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:08:14,059 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=520134.0, ans=0.0 2023-06-20 08:10:17,147 INFO [train.py:996] (3/4) Epoch 3, batch 25750, loss[loss=0.2884, simple_loss=0.3425, pruned_loss=0.1171, over 21247.00 frames. ], tot_loss[loss=0.258, simple_loss=0.326, pruned_loss=0.09499, over 4255478.92 frames. ], batch size: 143, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:12:04,370 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=520674.0, ans=0.125 2023-06-20 08:12:19,562 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.217e+02 2.970e+02 3.422e+02 4.154e+02 6.514e+02, threshold=6.844e+02, percent-clipped=0.0 2023-06-20 08:12:22,613 INFO [train.py:996] (3/4) Epoch 3, batch 25800, loss[loss=0.2872, simple_loss=0.3586, pruned_loss=0.1079, over 21517.00 frames. ], tot_loss[loss=0.2689, simple_loss=0.3384, pruned_loss=0.09973, over 4257156.80 frames. ], batch size: 194, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:12:25,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=520734.0, ans=0.0 2023-06-20 08:12:53,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=520794.0, ans=0.1 2023-06-20 08:14:04,069 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.35 vs. limit=22.5 2023-06-20 08:14:42,723 INFO [train.py:996] (3/4) Epoch 3, batch 25850, loss[loss=0.2573, simple_loss=0.317, pruned_loss=0.0988, over 21849.00 frames. ], tot_loss[loss=0.2684, simple_loss=0.3381, pruned_loss=0.09933, over 4259538.24 frames. ], batch size: 282, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:14:43,342 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=521034.0, ans=0.95 2023-06-20 08:14:59,410 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.88 vs. limit=10.0 2023-06-20 08:15:35,809 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=521154.0, ans=0.125 2023-06-20 08:15:57,237 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.69 vs. limit=15.0 2023-06-20 08:16:20,387 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.10 vs. limit=6.0 2023-06-20 08:16:42,375 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 2.635e+02 3.168e+02 4.552e+02 6.616e+02, threshold=6.336e+02, percent-clipped=0.0 2023-06-20 08:16:45,274 INFO [train.py:996] (3/4) Epoch 3, batch 25900, loss[loss=0.2941, simple_loss=0.373, pruned_loss=0.1076, over 21747.00 frames. ], tot_loss[loss=0.2701, simple_loss=0.3405, pruned_loss=0.09986, over 4267438.54 frames. ], batch size: 298, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:17:42,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=521394.0, ans=0.125 2023-06-20 08:17:58,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=521454.0, ans=0.125 2023-06-20 08:18:06,474 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=521514.0, ans=0.2 2023-06-20 08:18:54,004 INFO [train.py:996] (3/4) Epoch 3, batch 25950, loss[loss=0.2912, simple_loss=0.3557, pruned_loss=0.1133, over 21751.00 frames. ], tot_loss[loss=0.2768, simple_loss=0.3469, pruned_loss=0.1034, over 4273333.30 frames. ], batch size: 332, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:19:13,078 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=521694.0, ans=0.1 2023-06-20 08:19:30,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=521694.0, ans=0.0 2023-06-20 08:19:33,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=521754.0, ans=0.05 2023-06-20 08:19:44,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=521754.0, ans=0.2 2023-06-20 08:20:10,586 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.02 vs. limit=22.5 2023-06-20 08:20:13,645 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=521874.0, ans=15.0 2023-06-20 08:20:46,275 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.785e+02 2.642e+02 3.151e+02 3.673e+02 6.319e+02, threshold=6.303e+02, percent-clipped=0.0 2023-06-20 08:20:49,272 INFO [train.py:996] (3/4) Epoch 3, batch 26000, loss[loss=0.2686, simple_loss=0.35, pruned_loss=0.09363, over 21934.00 frames. ], tot_loss[loss=0.2757, simple_loss=0.3476, pruned_loss=0.1019, over 4276261.69 frames. ], batch size: 317, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:21:38,585 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=522054.0, ans=0.2 2023-06-20 08:22:05,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=522114.0, ans=0.125 2023-06-20 08:22:35,233 INFO [train.py:996] (3/4) Epoch 3, batch 26050, loss[loss=0.2596, simple_loss=0.3276, pruned_loss=0.09573, over 20659.00 frames. ], tot_loss[loss=0.276, simple_loss=0.3465, pruned_loss=0.1027, over 4283671.09 frames. ], batch size: 609, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:23:10,279 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=522294.0, ans=0.2 2023-06-20 08:23:26,776 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.81 vs. limit=15.0 2023-06-20 08:23:38,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=522354.0, ans=0.2 2023-06-20 08:23:51,270 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.73 vs. limit=15.0 2023-06-20 08:24:39,852 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.119e+02 2.804e+02 3.203e+02 3.918e+02 6.790e+02, threshold=6.407e+02, percent-clipped=4.0 2023-06-20 08:24:42,771 INFO [train.py:996] (3/4) Epoch 3, batch 26100, loss[loss=0.2788, simple_loss=0.3318, pruned_loss=0.1129, over 21915.00 frames. ], tot_loss[loss=0.2723, simple_loss=0.3403, pruned_loss=0.1021, over 4285471.53 frames. ], batch size: 414, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:24:52,879 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=522534.0, ans=0.0 2023-06-20 08:25:24,179 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=522654.0, ans=0.0 2023-06-20 08:25:42,352 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.48 vs. limit=15.0 2023-06-20 08:26:18,838 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.95 vs. limit=15.0 2023-06-20 08:26:44,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=522774.0, ans=0.1 2023-06-20 08:26:46,828 INFO [train.py:996] (3/4) Epoch 3, batch 26150, loss[loss=0.2517, simple_loss=0.3119, pruned_loss=0.09571, over 20947.00 frames. ], tot_loss[loss=0.2694, simple_loss=0.3355, pruned_loss=0.1017, over 4290251.69 frames. ], batch size: 608, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:27:04,654 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=522834.0, ans=0.0 2023-06-20 08:27:12,616 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.47 vs. limit=10.0 2023-06-20 08:27:13,402 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=522894.0, ans=0.0 2023-06-20 08:27:21,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=522894.0, ans=0.0 2023-06-20 08:27:43,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=522954.0, ans=0.0 2023-06-20 08:28:11,474 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=523014.0, ans=0.0 2023-06-20 08:28:18,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=523014.0, ans=0.125 2023-06-20 08:28:25,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=523014.0, ans=0.125 2023-06-20 08:28:37,294 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=523074.0, ans=0.0 2023-06-20 08:28:39,693 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.927e+02 2.726e+02 3.009e+02 3.723e+02 5.538e+02, threshold=6.017e+02, percent-clipped=0.0 2023-06-20 08:28:49,612 INFO [train.py:996] (3/4) Epoch 3, batch 26200, loss[loss=0.2552, simple_loss=0.353, pruned_loss=0.07875, over 21807.00 frames. ], tot_loss[loss=0.2668, simple_loss=0.3356, pruned_loss=0.09895, over 4292768.61 frames. ], batch size: 282, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:30:04,191 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.85 vs. limit=15.0 2023-06-20 08:30:13,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=523254.0, ans=0.1 2023-06-20 08:30:15,122 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=523254.0, ans=0.125 2023-06-20 08:30:59,894 INFO [train.py:996] (3/4) Epoch 3, batch 26250, loss[loss=0.2613, simple_loss=0.3394, pruned_loss=0.0916, over 21698.00 frames. ], tot_loss[loss=0.2672, simple_loss=0.3398, pruned_loss=0.09731, over 4292992.12 frames. ], batch size: 389, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:31:12,102 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=523434.0, ans=0.2 2023-06-20 08:31:49,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=523494.0, ans=0.125 2023-06-20 08:33:04,809 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.312e+02 2.693e+02 3.337e+02 4.034e+02 6.745e+02, threshold=6.673e+02, percent-clipped=1.0 2023-06-20 08:33:07,764 INFO [train.py:996] (3/4) Epoch 3, batch 26300, loss[loss=0.2239, simple_loss=0.295, pruned_loss=0.07638, over 21960.00 frames. ], tot_loss[loss=0.2655, simple_loss=0.336, pruned_loss=0.09747, over 4300072.41 frames. ], batch size: 316, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:33:38,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=523794.0, ans=0.0 2023-06-20 08:34:28,894 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=523914.0, ans=0.125 2023-06-20 08:34:59,105 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=523974.0, ans=0.125 2023-06-20 08:35:13,488 INFO [train.py:996] (3/4) Epoch 3, batch 26350, loss[loss=0.2792, simple_loss=0.3434, pruned_loss=0.1075, over 21784.00 frames. ], tot_loss[loss=0.2659, simple_loss=0.3345, pruned_loss=0.09868, over 4298777.32 frames. ], batch size: 332, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:36:12,995 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.48 vs. limit=15.0 2023-06-20 08:36:13,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=524154.0, ans=0.2 2023-06-20 08:36:19,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=524154.0, ans=0.125 2023-06-20 08:37:02,443 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.050e+02 2.700e+02 3.038e+02 3.604e+02 6.055e+02, threshold=6.077e+02, percent-clipped=0.0 2023-06-20 08:37:05,307 INFO [train.py:996] (3/4) Epoch 3, batch 26400, loss[loss=0.2357, simple_loss=0.2898, pruned_loss=0.09087, over 21839.00 frames. ], tot_loss[loss=0.263, simple_loss=0.3294, pruned_loss=0.09833, over 4294735.09 frames. ], batch size: 118, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:37:07,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=524334.0, ans=0.1 2023-06-20 08:37:37,001 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=524394.0, ans=0.125 2023-06-20 08:38:36,763 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=524574.0, ans=0.1 2023-06-20 08:38:48,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=524574.0, ans=0.1 2023-06-20 08:39:02,890 INFO [train.py:996] (3/4) Epoch 3, batch 26450, loss[loss=0.2551, simple_loss=0.336, pruned_loss=0.08716, over 21377.00 frames. ], tot_loss[loss=0.2618, simple_loss=0.3286, pruned_loss=0.0975, over 4289334.08 frames. ], batch size: 211, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:41:08,731 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.035e+02 2.874e+02 3.456e+02 4.321e+02 8.810e+02, threshold=6.911e+02, percent-clipped=7.0 2023-06-20 08:41:11,690 INFO [train.py:996] (3/4) Epoch 3, batch 26500, loss[loss=0.213, simple_loss=0.2776, pruned_loss=0.07425, over 21361.00 frames. ], tot_loss[loss=0.2598, simple_loss=0.3289, pruned_loss=0.09539, over 4279790.73 frames. ], batch size: 194, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:41:16,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=524934.0, ans=0.0 2023-06-20 08:41:51,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=525054.0, ans=0.125 2023-06-20 08:42:31,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=525114.0, ans=0.1 2023-06-20 08:43:28,416 INFO [train.py:996] (3/4) Epoch 3, batch 26550, loss[loss=0.1934, simple_loss=0.2561, pruned_loss=0.06537, over 21188.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3239, pruned_loss=0.09145, over 4270658.00 frames. ], batch size: 159, lr: 1.02e-02, grad_scale: 64.0 2023-06-20 08:44:09,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=525294.0, ans=0.125 2023-06-20 08:44:45,283 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.83 vs. limit=6.0 2023-06-20 08:45:36,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=525474.0, ans=0.125 2023-06-20 08:45:38,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=525474.0, ans=0.0 2023-06-20 08:45:39,076 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.810e+02 2.519e+02 3.120e+02 3.991e+02 8.354e+02, threshold=6.239e+02, percent-clipped=2.0 2023-06-20 08:45:40,628 INFO [train.py:996] (3/4) Epoch 3, batch 26600, loss[loss=0.284, simple_loss=0.333, pruned_loss=0.1176, over 21339.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3237, pruned_loss=0.08901, over 4264274.48 frames. ], batch size: 471, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:45:41,117 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=525534.0, ans=0.125 2023-06-20 08:46:01,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=525594.0, ans=0.2 2023-06-20 08:47:41,372 INFO [train.py:996] (3/4) Epoch 3, batch 26650, loss[loss=0.183, simple_loss=0.2676, pruned_loss=0.04918, over 21676.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3162, pruned_loss=0.0877, over 4258866.81 frames. ], batch size: 415, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:47:41,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=525834.0, ans=0.125 2023-06-20 08:48:27,266 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 08:49:16,479 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.669e+02 2.277e+02 2.544e+02 2.972e+02 4.413e+02, threshold=5.088e+02, percent-clipped=0.0 2023-06-20 08:49:23,175 INFO [train.py:996] (3/4) Epoch 3, batch 26700, loss[loss=0.2638, simple_loss=0.3229, pruned_loss=0.1023, over 21911.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3094, pruned_loss=0.08435, over 4262864.34 frames. ], batch size: 351, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:49:24,158 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.52 vs. limit=22.5 2023-06-20 08:50:26,630 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.02 vs. limit=12.0 2023-06-20 08:51:36,254 INFO [train.py:996] (3/4) Epoch 3, batch 26750, loss[loss=0.2382, simple_loss=0.3147, pruned_loss=0.08083, over 21427.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3106, pruned_loss=0.0842, over 4268178.15 frames. ], batch size: 211, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:53:55,878 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.184e+02 2.827e+02 3.440e+02 3.863e+02 5.872e+02, threshold=6.879e+02, percent-clipped=7.0 2023-06-20 08:54:02,661 INFO [train.py:996] (3/4) Epoch 3, batch 26800, loss[loss=0.2763, simple_loss=0.3441, pruned_loss=0.1042, over 21390.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3183, pruned_loss=0.0889, over 4270398.53 frames. ], batch size: 548, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:54:39,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=526854.0, ans=0.0 2023-06-20 08:55:15,581 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=526914.0, ans=0.1 2023-06-20 08:55:17,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=526914.0, ans=0.125 2023-06-20 08:55:24,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=526914.0, ans=0.125 2023-06-20 08:55:29,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=526974.0, ans=0.125 2023-06-20 08:56:05,475 INFO [train.py:996] (3/4) Epoch 3, batch 26850, loss[loss=0.2468, simple_loss=0.3012, pruned_loss=0.09621, over 21693.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3217, pruned_loss=0.09252, over 4275488.75 frames. ], batch size: 112, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:56:19,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=527094.0, ans=0.0 2023-06-20 08:56:51,322 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=527154.0, ans=0.125 2023-06-20 08:57:01,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=527214.0, ans=0.05 2023-06-20 08:57:18,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=527274.0, ans=0.09899494936611666 2023-06-20 08:57:38,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=527274.0, ans=0.125 2023-06-20 08:57:39,245 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.906e+02 2.442e+02 3.000e+02 3.641e+02 8.761e+02, threshold=6.000e+02, percent-clipped=1.0 2023-06-20 08:57:40,684 INFO [train.py:996] (3/4) Epoch 3, batch 26900, loss[loss=0.214, simple_loss=0.273, pruned_loss=0.0775, over 21600.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3126, pruned_loss=0.0908, over 4272752.73 frames. ], batch size: 298, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:58:29,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=527394.0, ans=10.0 2023-06-20 08:58:32,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=527394.0, ans=0.2 2023-06-20 08:58:39,944 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=527454.0, ans=0.125 2023-06-20 08:58:57,501 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.75 vs. limit=15.0 2023-06-20 08:59:41,507 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.36 vs. limit=15.0 2023-06-20 08:59:43,728 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=527574.0, ans=0.0 2023-06-20 08:59:47,428 INFO [train.py:996] (3/4) Epoch 3, batch 26950, loss[loss=0.2898, simple_loss=0.3726, pruned_loss=0.1035, over 21751.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3127, pruned_loss=0.0911, over 4272785.62 frames. ], batch size: 351, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 09:00:13,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=527694.0, ans=0.125 2023-06-20 09:00:16,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=527694.0, ans=0.025 2023-06-20 09:01:18,443 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.21 vs. limit=15.0 2023-06-20 09:01:34,573 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.52 vs. limit=6.0 2023-06-20 09:01:39,290 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.867e+02 2.583e+02 3.256e+02 3.814e+02 7.772e+02, threshold=6.512e+02, percent-clipped=3.0 2023-06-20 09:01:52,867 INFO [train.py:996] (3/4) Epoch 3, batch 27000, loss[loss=0.2959, simple_loss=0.3732, pruned_loss=0.1093, over 21452.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3129, pruned_loss=0.08887, over 4278019.17 frames. ], batch size: 471, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 09:01:52,868 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-20 09:02:49,144 INFO [train.py:1028] (3/4) Epoch 3, validation: loss=0.2585, simple_loss=0.355, pruned_loss=0.081, over 1796401.00 frames. 2023-06-20 09:02:49,145 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-20 09:02:57,751 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.71 vs. limit=6.0 2023-06-20 09:03:09,181 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=527994.0, ans=0.0 2023-06-20 09:04:29,068 INFO [train.py:996] (3/4) Epoch 3, batch 27050, loss[loss=0.2616, simple_loss=0.3275, pruned_loss=0.09787, over 21330.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3158, pruned_loss=0.08677, over 4276223.09 frames. ], batch size: 144, lr: 1.02e-02, grad_scale: 16.0 2023-06-20 09:04:43,891 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=528294.0, ans=0.125 2023-06-20 09:06:17,413 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=528474.0, ans=0.0 2023-06-20 09:06:31,044 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.841e+02 2.379e+02 2.798e+02 3.211e+02 4.498e+02, threshold=5.597e+02, percent-clipped=0.0 2023-06-20 09:06:31,087 INFO [train.py:996] (3/4) Epoch 3, batch 27100, loss[loss=0.2434, simple_loss=0.3326, pruned_loss=0.07711, over 21891.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3178, pruned_loss=0.08659, over 4280149.14 frames. ], batch size: 316, lr: 1.02e-02, grad_scale: 16.0 2023-06-20 09:06:58,357 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=528594.0, ans=0.0 2023-06-20 09:07:12,888 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.18 vs. limit=15.0 2023-06-20 09:07:28,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=528654.0, ans=0.0 2023-06-20 09:08:24,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=528774.0, ans=0.125 2023-06-20 09:08:32,556 INFO [train.py:996] (3/4) Epoch 3, batch 27150, loss[loss=0.259, simple_loss=0.338, pruned_loss=0.08999, over 21279.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3293, pruned_loss=0.09006, over 4280742.60 frames. ], batch size: 176, lr: 1.01e-02, grad_scale: 16.0 2023-06-20 09:08:33,198 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=528834.0, ans=0.0 2023-06-20 09:10:50,520 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.089e+02 2.751e+02 3.147e+02 3.770e+02 6.870e+02, threshold=6.294e+02, percent-clipped=5.0 2023-06-20 09:10:50,543 INFO [train.py:996] (3/4) Epoch 3, batch 27200, loss[loss=0.3252, simple_loss=0.3927, pruned_loss=0.1288, over 21711.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.3374, pruned_loss=0.09294, over 4281200.78 frames. ], batch size: 441, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:11:24,061 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.92 vs. limit=22.5 2023-06-20 09:12:12,290 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=529314.0, ans=0.1 2023-06-20 09:12:22,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=529314.0, ans=0.125 2023-06-20 09:12:50,329 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=529374.0, ans=0.0 2023-06-20 09:12:55,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=529374.0, ans=0.0 2023-06-20 09:12:57,434 INFO [train.py:996] (3/4) Epoch 3, batch 27250, loss[loss=0.3062, simple_loss=0.3588, pruned_loss=0.1268, over 21832.00 frames. ], tot_loss[loss=0.2691, simple_loss=0.3413, pruned_loss=0.09846, over 4279441.36 frames. ], batch size: 441, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:14:15,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=529614.0, ans=0.1 2023-06-20 09:14:53,324 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-20 09:14:53,324 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=529674.0, ans=15.0 2023-06-20 09:15:03,696 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.174e+02 2.889e+02 3.267e+02 4.171e+02 5.993e+02, threshold=6.535e+02, percent-clipped=0.0 2023-06-20 09:15:03,719 INFO [train.py:996] (3/4) Epoch 3, batch 27300, loss[loss=0.2805, simple_loss=0.3611, pruned_loss=0.09993, over 21336.00 frames. ], tot_loss[loss=0.272, simple_loss=0.3443, pruned_loss=0.09987, over 4281990.76 frames. ], batch size: 549, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:15:53,758 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.89 vs. limit=6.0 2023-06-20 09:16:11,929 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=529854.0, ans=0.1 2023-06-20 09:16:23,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=529914.0, ans=0.05 2023-06-20 09:16:23,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=529914.0, ans=0.0 2023-06-20 09:17:23,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=529974.0, ans=0.2 2023-06-20 09:17:26,158 INFO [train.py:996] (3/4) Epoch 3, batch 27350, loss[loss=0.2637, simple_loss=0.334, pruned_loss=0.09668, over 21265.00 frames. ], tot_loss[loss=0.2736, simple_loss=0.3458, pruned_loss=0.1008, over 4275688.53 frames. ], batch size: 159, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:17:48,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=530094.0, ans=0.125 2023-06-20 09:18:03,463 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.32 vs. limit=15.0 2023-06-20 09:18:14,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=530154.0, ans=0.125 2023-06-20 09:18:42,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=530154.0, ans=0.0 2023-06-20 09:19:31,132 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.955e+02 2.578e+02 2.892e+02 3.255e+02 4.289e+02, threshold=5.785e+02, percent-clipped=0.0 2023-06-20 09:19:31,156 INFO [train.py:996] (3/4) Epoch 3, batch 27400, loss[loss=0.2387, simple_loss=0.2909, pruned_loss=0.09328, over 21407.00 frames. ], tot_loss[loss=0.2705, simple_loss=0.341, pruned_loss=0.09994, over 4281320.40 frames. ], batch size: 177, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:20:04,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=530394.0, ans=0.125 2023-06-20 09:20:07,716 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=530394.0, ans=0.2 2023-06-20 09:20:38,786 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.45 vs. limit=15.0 2023-06-20 09:21:32,888 INFO [train.py:996] (3/4) Epoch 3, batch 27450, loss[loss=0.2482, simple_loss=0.3323, pruned_loss=0.08206, over 21707.00 frames. ], tot_loss[loss=0.2655, simple_loss=0.3347, pruned_loss=0.09814, over 4277754.46 frames. ], batch size: 332, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:21:53,551 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.64 vs. limit=22.5 2023-06-20 09:21:58,733 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=530694.0, ans=0.0 2023-06-20 09:22:19,383 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.64 vs. limit=22.5 2023-06-20 09:22:23,204 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=530754.0, ans=0.2 2023-06-20 09:23:33,232 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.961e+02 2.511e+02 2.901e+02 3.384e+02 5.453e+02, threshold=5.802e+02, percent-clipped=0.0 2023-06-20 09:23:33,278 INFO [train.py:996] (3/4) Epoch 3, batch 27500, loss[loss=0.2327, simple_loss=0.2905, pruned_loss=0.08747, over 21125.00 frames. ], tot_loss[loss=0.265, simple_loss=0.3333, pruned_loss=0.09834, over 4280174.28 frames. ], batch size: 608, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:23:50,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=530934.0, ans=0.1 2023-06-20 09:24:10,029 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.52 vs. limit=15.0 2023-06-20 09:25:21,182 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.51 vs. limit=10.0 2023-06-20 09:25:24,437 INFO [train.py:996] (3/4) Epoch 3, batch 27550, loss[loss=0.2466, simple_loss=0.3007, pruned_loss=0.09627, over 21517.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.3268, pruned_loss=0.09496, over 4272702.00 frames. ], batch size: 441, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:25:24,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=531234.0, ans=0.07 2023-06-20 09:26:08,197 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=531294.0, ans=0.1 2023-06-20 09:26:10,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=531354.0, ans=0.125 2023-06-20 09:26:25,943 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=531354.0, ans=0.0 2023-06-20 09:26:28,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=531354.0, ans=0.0 2023-06-20 09:27:04,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=531474.0, ans=0.125 2023-06-20 09:27:16,438 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.687e+02 2.453e+02 3.210e+02 4.120e+02 6.200e+02, threshold=6.421e+02, percent-clipped=4.0 2023-06-20 09:27:16,472 INFO [train.py:996] (3/4) Epoch 3, batch 27600, loss[loss=0.2357, simple_loss=0.2888, pruned_loss=0.09131, over 21209.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3192, pruned_loss=0.09333, over 4264218.32 frames. ], batch size: 176, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:27:33,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=531534.0, ans=0.125 2023-06-20 09:28:05,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=531594.0, ans=0.1 2023-06-20 09:28:10,434 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.67 vs. limit=12.0 2023-06-20 09:28:30,521 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=531714.0, ans=0.125 2023-06-20 09:28:54,543 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.38 vs. limit=22.5 2023-06-20 09:29:12,395 INFO [train.py:996] (3/4) Epoch 3, batch 27650, loss[loss=0.2504, simple_loss=0.309, pruned_loss=0.09588, over 21407.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3142, pruned_loss=0.09328, over 4258344.02 frames. ], batch size: 548, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:29:19,950 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=531834.0, ans=0.1 2023-06-20 09:29:23,240 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.68 vs. limit=15.0 2023-06-20 09:29:41,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=531894.0, ans=0.0 2023-06-20 09:31:06,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=532074.0, ans=0.05 2023-06-20 09:31:10,931 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.765e+02 2.456e+02 3.066e+02 3.957e+02 5.583e+02, threshold=6.132e+02, percent-clipped=0.0 2023-06-20 09:31:10,954 INFO [train.py:996] (3/4) Epoch 3, batch 27700, loss[loss=0.2746, simple_loss=0.36, pruned_loss=0.09461, over 19854.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3136, pruned_loss=0.09046, over 4260677.36 frames. ], batch size: 703, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:33:20,142 INFO [train.py:996] (3/4) Epoch 3, batch 27750, loss[loss=0.2242, simple_loss=0.291, pruned_loss=0.07867, over 21820.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3171, pruned_loss=0.0902, over 4257444.25 frames. ], batch size: 118, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:33:58,122 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=532494.0, ans=0.125 2023-06-20 09:33:58,201 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=532494.0, ans=0.125 2023-06-20 09:34:14,504 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.43 vs. limit=22.5 2023-06-20 09:35:16,982 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.863e+02 2.578e+02 3.101e+02 3.636e+02 6.452e+02, threshold=6.201e+02, percent-clipped=3.0 2023-06-20 09:35:17,008 INFO [train.py:996] (3/4) Epoch 3, batch 27800, loss[loss=0.2593, simple_loss=0.323, pruned_loss=0.09776, over 21758.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3163, pruned_loss=0.0909, over 4271308.11 frames. ], batch size: 389, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:35:25,000 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=532734.0, ans=0.125 2023-06-20 09:35:53,813 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=532794.0, ans=0.125 2023-06-20 09:35:55,080 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=532794.0, ans=0.125 2023-06-20 09:36:06,756 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=532794.0, ans=0.125 2023-06-20 09:36:09,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=532854.0, ans=0.0 2023-06-20 09:36:20,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=532854.0, ans=0.125 2023-06-20 09:36:22,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=532854.0, ans=0.125 2023-06-20 09:36:33,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=532914.0, ans=0.1 2023-06-20 09:37:27,379 INFO [train.py:996] (3/4) Epoch 3, batch 27850, loss[loss=0.2699, simple_loss=0.3391, pruned_loss=0.1004, over 21880.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3154, pruned_loss=0.09246, over 4274676.93 frames. ], batch size: 118, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:37:35,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=533034.0, ans=0.0 2023-06-20 09:39:41,746 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.948e+02 2.579e+02 2.899e+02 3.562e+02 7.537e+02, threshold=5.798e+02, percent-clipped=1.0 2023-06-20 09:39:41,769 INFO [train.py:996] (3/4) Epoch 3, batch 27900, loss[loss=0.3009, simple_loss=0.3855, pruned_loss=0.1082, over 21513.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.3264, pruned_loss=0.09372, over 4271990.56 frames. ], batch size: 471, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:40:06,669 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=533334.0, ans=0.125 2023-06-20 09:41:47,950 INFO [train.py:996] (3/4) Epoch 3, batch 27950, loss[loss=0.24, simple_loss=0.3166, pruned_loss=0.08169, over 21428.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3256, pruned_loss=0.08992, over 4273047.44 frames. ], batch size: 211, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:43:40,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=533874.0, ans=0.0 2023-06-20 09:43:44,435 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.33 vs. limit=22.5 2023-06-20 09:43:45,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=533874.0, ans=0.1 2023-06-20 09:43:54,334 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.725e+02 2.301e+02 2.628e+02 3.233e+02 4.917e+02, threshold=5.255e+02, percent-clipped=0.0 2023-06-20 09:43:54,358 INFO [train.py:996] (3/4) Epoch 3, batch 28000, loss[loss=0.2188, simple_loss=0.2967, pruned_loss=0.07044, over 21682.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.3225, pruned_loss=0.08748, over 4280660.62 frames. ], batch size: 263, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:44:03,319 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=533934.0, ans=0.1 2023-06-20 09:44:03,333 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=533934.0, ans=0.1 2023-06-20 09:44:16,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=533994.0, ans=0.125 2023-06-20 09:44:43,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=534054.0, ans=0.0 2023-06-20 09:44:57,164 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.61 vs. limit=22.5 2023-06-20 09:45:15,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=534174.0, ans=0.0 2023-06-20 09:45:32,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=534174.0, ans=15.0 2023-06-20 09:45:32,243 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.19 vs. limit=15.0 2023-06-20 09:46:01,518 INFO [train.py:996] (3/4) Epoch 3, batch 28050, loss[loss=0.2142, simple_loss=0.2628, pruned_loss=0.08281, over 21304.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3208, pruned_loss=0.08898, over 4290171.13 frames. ], batch size: 159, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:46:08,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=534234.0, ans=0.125 2023-06-20 09:46:39,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=534294.0, ans=0.04949747468305833 2023-06-20 09:46:48,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=534354.0, ans=10.0 2023-06-20 09:47:22,403 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 09:48:07,107 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.943e+02 2.725e+02 3.043e+02 3.696e+02 6.736e+02, threshold=6.086e+02, percent-clipped=4.0 2023-06-20 09:48:07,131 INFO [train.py:996] (3/4) Epoch 3, batch 28100, loss[loss=0.2182, simple_loss=0.2732, pruned_loss=0.08155, over 21210.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3186, pruned_loss=0.08923, over 4280995.36 frames. ], batch size: 548, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:48:40,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=534594.0, ans=0.125 2023-06-20 09:49:51,903 INFO [train.py:996] (3/4) Epoch 3, batch 28150, loss[loss=0.241, simple_loss=0.2979, pruned_loss=0.09203, over 21511.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3121, pruned_loss=0.089, over 4279089.43 frames. ], batch size: 391, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:50:08,554 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.86 vs. limit=15.0 2023-06-20 09:50:26,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=534894.0, ans=0.2 2023-06-20 09:50:59,877 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=535014.0, ans=0.125 2023-06-20 09:51:01,779 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.95 vs. limit=15.0 2023-06-20 09:51:23,771 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=535014.0, ans=0.125 2023-06-20 09:51:50,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=535074.0, ans=0.1 2023-06-20 09:51:52,813 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.120e+02 2.680e+02 3.048e+02 3.566e+02 6.007e+02, threshold=6.096e+02, percent-clipped=0.0 2023-06-20 09:51:52,837 INFO [train.py:996] (3/4) Epoch 3, batch 28200, loss[loss=0.2942, simple_loss=0.4234, pruned_loss=0.08246, over 19776.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3132, pruned_loss=0.09077, over 4270374.08 frames. ], batch size: 702, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:52:22,262 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=535194.0, ans=0.0 2023-06-20 09:52:49,630 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.79 vs. limit=15.0 2023-06-20 09:54:03,160 INFO [train.py:996] (3/4) Epoch 3, batch 28250, loss[loss=0.2021, simple_loss=0.2669, pruned_loss=0.06859, over 21617.00 frames. ], tot_loss[loss=0.2517, simple_loss=0.3165, pruned_loss=0.09345, over 4267734.82 frames. ], batch size: 247, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:54:41,387 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=535554.0, ans=0.0 2023-06-20 09:54:55,855 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=535614.0, ans=0.125 2023-06-20 09:55:52,598 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.840e+02 2.642e+02 2.920e+02 3.400e+02 5.282e+02, threshold=5.841e+02, percent-clipped=0.0 2023-06-20 09:55:52,621 INFO [train.py:996] (3/4) Epoch 3, batch 28300, loss[loss=0.1757, simple_loss=0.2559, pruned_loss=0.04778, over 21410.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3142, pruned_loss=0.09107, over 4268896.17 frames. ], batch size: 211, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:55:54,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=535734.0, ans=0.1 2023-06-20 09:56:44,062 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.21 vs. limit=15.0 2023-06-20 09:57:08,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=535914.0, ans=0.125 2023-06-20 09:57:55,377 INFO [train.py:996] (3/4) Epoch 3, batch 28350, loss[loss=0.2271, simple_loss=0.2966, pruned_loss=0.07885, over 21848.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3108, pruned_loss=0.08563, over 4258436.21 frames. ], batch size: 372, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:57:56,297 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.58 vs. limit=6.0 2023-06-20 09:58:29,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=536094.0, ans=0.125 2023-06-20 09:58:47,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=536154.0, ans=0.125 2023-06-20 09:58:51,462 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.51 vs. limit=6.0 2023-06-20 09:59:57,217 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.905e+02 2.442e+02 2.906e+02 3.756e+02 6.474e+02, threshold=5.811e+02, percent-clipped=1.0 2023-06-20 09:59:57,240 INFO [train.py:996] (3/4) Epoch 3, batch 28400, loss[loss=0.2897, simple_loss=0.3328, pruned_loss=0.1233, over 21305.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3075, pruned_loss=0.08593, over 4257193.23 frames. ], batch size: 471, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 10:00:05,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=536334.0, ans=0.125 2023-06-20 10:01:05,887 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-20 10:01:29,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=536514.0, ans=0.025 2023-06-20 10:01:54,091 INFO [train.py:996] (3/4) Epoch 3, batch 28450, loss[loss=0.3366, simple_loss=0.3663, pruned_loss=0.1534, over 21621.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3114, pruned_loss=0.08851, over 4254558.31 frames. ], batch size: 507, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 10:01:59,530 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.80 vs. limit=15.0 2023-06-20 10:02:23,128 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.24 vs. limit=15.0 2023-06-20 10:02:51,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=536754.0, ans=0.0 2023-06-20 10:02:52,343 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.55 vs. limit=15.0 2023-06-20 10:03:45,331 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.03 vs. limit=10.0 2023-06-20 10:04:20,982 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.267e+02 2.862e+02 3.602e+02 4.332e+02 6.960e+02, threshold=7.204e+02, percent-clipped=5.0 2023-06-20 10:04:21,005 INFO [train.py:996] (3/4) Epoch 3, batch 28500, loss[loss=0.2395, simple_loss=0.2968, pruned_loss=0.0911, over 21142.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3139, pruned_loss=0.09154, over 4265048.95 frames. ], batch size: 608, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 10:04:53,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=536994.0, ans=0.0 2023-06-20 10:05:42,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=537174.0, ans=0.0 2023-06-20 10:05:54,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=537174.0, ans=0.5 2023-06-20 10:05:54,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=537174.0, ans=0.0 2023-06-20 10:06:02,420 INFO [train.py:996] (3/4) Epoch 3, batch 28550, loss[loss=0.4024, simple_loss=0.4588, pruned_loss=0.173, over 21428.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.323, pruned_loss=0.0946, over 4272965.46 frames. ], batch size: 507, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 10:06:02,921 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=537234.0, ans=0.1 2023-06-20 10:06:36,576 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=537294.0, ans=0.0 2023-06-20 10:07:14,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=537354.0, ans=0.0 2023-06-20 10:07:37,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=537414.0, ans=0.0 2023-06-20 10:07:49,229 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=537474.0, ans=0.0 2023-06-20 10:08:09,722 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=537474.0, ans=0.0 2023-06-20 10:08:13,360 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.256e+02 2.757e+02 3.378e+02 4.291e+02 7.271e+02, threshold=6.756e+02, percent-clipped=1.0 2023-06-20 10:08:13,384 INFO [train.py:996] (3/4) Epoch 3, batch 28600, loss[loss=0.2552, simple_loss=0.3284, pruned_loss=0.09096, over 21471.00 frames. ], tot_loss[loss=0.2629, simple_loss=0.3311, pruned_loss=0.09742, over 4279200.98 frames. ], batch size: 194, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 10:08:15,832 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.65 vs. limit=15.0 2023-06-20 10:09:22,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=537654.0, ans=0.1 2023-06-20 10:09:55,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=537774.0, ans=0.1 2023-06-20 10:10:05,462 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.47 vs. limit=15.0 2023-06-20 10:10:12,630 INFO [train.py:996] (3/4) Epoch 3, batch 28650, loss[loss=0.2833, simple_loss=0.3202, pruned_loss=0.1232, over 21220.00 frames. ], tot_loss[loss=0.258, simple_loss=0.324, pruned_loss=0.09603, over 4279389.70 frames. ], batch size: 471, lr: 1.01e-02, grad_scale: 16.0 2023-06-20 10:10:20,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=537834.0, ans=0.04949747468305833 2023-06-20 10:10:27,651 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=537894.0, ans=0.2 2023-06-20 10:11:02,304 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.18 vs. limit=22.5 2023-06-20 10:11:28,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=538014.0, ans=0.0 2023-06-20 10:12:15,820 INFO [train.py:996] (3/4) Epoch 3, batch 28700, loss[loss=0.2615, simple_loss=0.3325, pruned_loss=0.09528, over 21765.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.3226, pruned_loss=0.09743, over 4271235.82 frames. ], batch size: 351, lr: 1.01e-02, grad_scale: 16.0 2023-06-20 10:12:17,215 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.131e+02 2.709e+02 3.341e+02 4.150e+02 7.060e+02, threshold=6.681e+02, percent-clipped=1.0 2023-06-20 10:13:11,992 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.95 vs. limit=6.0 2023-06-20 10:14:01,453 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=538374.0, ans=0.2 2023-06-20 10:14:15,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=538374.0, ans=0.125 2023-06-20 10:14:19,921 INFO [train.py:996] (3/4) Epoch 3, batch 28750, loss[loss=0.2315, simple_loss=0.313, pruned_loss=0.075, over 21827.00 frames. ], tot_loss[loss=0.2606, simple_loss=0.3244, pruned_loss=0.09835, over 4278723.06 frames. ], batch size: 332, lr: 1.01e-02, grad_scale: 16.0 2023-06-20 10:15:05,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=538554.0, ans=0.125 2023-06-20 10:16:17,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=538734.0, ans=0.125 2023-06-20 10:16:18,020 INFO [train.py:996] (3/4) Epoch 3, batch 28800, loss[loss=0.258, simple_loss=0.3252, pruned_loss=0.09544, over 21608.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.3264, pruned_loss=0.09833, over 4276427.28 frames. ], batch size: 263, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 10:16:25,197 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.038e+02 2.544e+02 3.077e+02 3.520e+02 7.771e+02, threshold=6.153e+02, percent-clipped=2.0 2023-06-20 10:17:26,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=538854.0, ans=0.04949747468305833 2023-06-20 10:17:28,924 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=538854.0, ans=0.125 2023-06-20 10:17:30,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=538854.0, ans=0.125 2023-06-20 10:17:48,763 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=538914.0, ans=0.0 2023-06-20 10:18:14,810 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=538974.0, ans=0.125 2023-06-20 10:18:25,602 INFO [train.py:996] (3/4) Epoch 3, batch 28850, loss[loss=0.2333, simple_loss=0.3032, pruned_loss=0.08164, over 21648.00 frames. ], tot_loss[loss=0.264, simple_loss=0.328, pruned_loss=0.1, over 4285027.66 frames. ], batch size: 263, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 10:18:29,468 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=539034.0, ans=0.125 2023-06-20 10:18:53,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=539094.0, ans=0.125 2023-06-20 10:19:16,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=539154.0, ans=0.5 2023-06-20 10:19:23,642 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.00 vs. limit=15.0 2023-06-20 10:19:24,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=539154.0, ans=0.125 2023-06-20 10:19:53,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=539214.0, ans=0.2 2023-06-20 10:19:55,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=539214.0, ans=0.2 2023-06-20 10:20:23,104 INFO [train.py:996] (3/4) Epoch 3, batch 28900, loss[loss=0.2632, simple_loss=0.3291, pruned_loss=0.09861, over 21731.00 frames. ], tot_loss[loss=0.2673, simple_loss=0.3314, pruned_loss=0.1016, over 4283700.40 frames. ], batch size: 298, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 10:20:24,597 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.163e+02 2.799e+02 3.210e+02 3.937e+02 8.118e+02, threshold=6.420e+02, percent-clipped=2.0 2023-06-20 10:21:14,025 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=22.5 2023-06-20 10:21:42,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=539514.0, ans=0.125 2023-06-20 10:22:08,519 INFO [train.py:996] (3/4) Epoch 3, batch 28950, loss[loss=0.2464, simple_loss=0.3076, pruned_loss=0.09262, over 20962.00 frames. ], tot_loss[loss=0.2651, simple_loss=0.33, pruned_loss=0.1001, over 4281008.74 frames. ], batch size: 608, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:23:43,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=539814.0, ans=0.1 2023-06-20 10:24:00,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=539814.0, ans=0.125 2023-06-20 10:24:30,736 INFO [train.py:996] (3/4) Epoch 3, batch 29000, loss[loss=0.2874, simple_loss=0.3627, pruned_loss=0.106, over 21362.00 frames. ], tot_loss[loss=0.267, simple_loss=0.3342, pruned_loss=0.0999, over 4284529.81 frames. ], batch size: 131, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:24:32,019 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.238e+02 2.768e+02 3.396e+02 4.275e+02 6.208e+02, threshold=6.793e+02, percent-clipped=0.0 2023-06-20 10:24:34,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=539934.0, ans=0.2 2023-06-20 10:25:27,608 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.83 vs. limit=22.5 2023-06-20 10:25:28,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=540054.0, ans=0.125 2023-06-20 10:26:06,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=540174.0, ans=0.1 2023-06-20 10:26:38,119 INFO [train.py:996] (3/4) Epoch 3, batch 29050, loss[loss=0.273, simple_loss=0.3327, pruned_loss=0.1067, over 21826.00 frames. ], tot_loss[loss=0.2664, simple_loss=0.3338, pruned_loss=0.0995, over 4287010.22 frames. ], batch size: 441, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:27:13,293 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 10:27:19,266 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=540354.0, ans=0.125 2023-06-20 10:27:30,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=540354.0, ans=0.125 2023-06-20 10:27:31,515 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=540354.0, ans=0.125 2023-06-20 10:28:27,053 INFO [train.py:996] (3/4) Epoch 3, batch 29100, loss[loss=0.2298, simple_loss=0.2892, pruned_loss=0.08518, over 21252.00 frames. ], tot_loss[loss=0.2585, simple_loss=0.3241, pruned_loss=0.09641, over 4290458.81 frames. ], batch size: 159, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:28:34,009 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.165e+02 2.763e+02 3.093e+02 3.779e+02 6.198e+02, threshold=6.186e+02, percent-clipped=0.0 2023-06-20 10:28:41,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=540534.0, ans=0.125 2023-06-20 10:28:53,789 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=540594.0, ans=0.125 2023-06-20 10:29:35,582 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.98 vs. limit=22.5 2023-06-20 10:30:09,105 INFO [train.py:996] (3/4) Epoch 3, batch 29150, loss[loss=0.2538, simple_loss=0.3375, pruned_loss=0.08507, over 21780.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.321, pruned_loss=0.09481, over 4283219.86 frames. ], batch size: 316, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:30:15,236 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=540834.0, ans=0.0 2023-06-20 10:31:13,569 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=541014.0, ans=0.2 2023-06-20 10:31:19,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=541014.0, ans=0.2 2023-06-20 10:31:51,710 INFO [train.py:996] (3/4) Epoch 3, batch 29200, loss[loss=0.2402, simple_loss=0.2959, pruned_loss=0.09231, over 21786.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3167, pruned_loss=0.09372, over 4281978.93 frames. ], batch size: 371, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:31:53,131 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.013e+02 2.609e+02 3.171e+02 4.055e+02 6.216e+02, threshold=6.341e+02, percent-clipped=1.0 2023-06-20 10:32:14,448 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.32 vs. limit=10.0 2023-06-20 10:32:19,991 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=541194.0, ans=0.05 2023-06-20 10:34:01,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=541374.0, ans=0.125 2023-06-20 10:34:03,921 INFO [train.py:996] (3/4) Epoch 3, batch 29250, loss[loss=0.2642, simple_loss=0.3442, pruned_loss=0.09203, over 21706.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3149, pruned_loss=0.09095, over 4284712.78 frames. ], batch size: 391, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:34:08,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=541434.0, ans=0.125 2023-06-20 10:35:27,290 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=541674.0, ans=0.0 2023-06-20 10:35:47,776 INFO [train.py:996] (3/4) Epoch 3, batch 29300, loss[loss=0.241, simple_loss=0.3016, pruned_loss=0.09024, over 21316.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3159, pruned_loss=0.08956, over 4284931.49 frames. ], batch size: 549, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:36:05,803 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.805e+02 2.416e+02 2.645e+02 3.207e+02 5.648e+02, threshold=5.289e+02, percent-clipped=0.0 2023-06-20 10:36:53,455 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.17 vs. limit=15.0 2023-06-20 10:37:50,505 INFO [train.py:996] (3/4) Epoch 3, batch 29350, loss[loss=0.1926, simple_loss=0.2521, pruned_loss=0.06654, over 21470.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3126, pruned_loss=0.08887, over 4284373.67 frames. ], batch size: 195, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:37:50,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=542034.0, ans=0.0 2023-06-20 10:38:05,628 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=542034.0, ans=0.125 2023-06-20 10:39:20,680 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=542214.0, ans=0.1 2023-06-20 10:39:29,386 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=542214.0, ans=0.09899494936611666 2023-06-20 10:40:03,520 INFO [train.py:996] (3/4) Epoch 3, batch 29400, loss[loss=0.1925, simple_loss=0.2608, pruned_loss=0.06205, over 21768.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3117, pruned_loss=0.08681, over 4267442.10 frames. ], batch size: 282, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:40:04,937 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.004e+02 2.619e+02 2.949e+02 3.526e+02 5.601e+02, threshold=5.897e+02, percent-clipped=1.0 2023-06-20 10:41:06,565 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=542514.0, ans=0.2 2023-06-20 10:41:29,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=542574.0, ans=0.1 2023-06-20 10:41:31,328 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.08 vs. limit=10.0 2023-06-20 10:41:46,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=542574.0, ans=0.0 2023-06-20 10:42:05,237 INFO [train.py:996] (3/4) Epoch 3, batch 29450, loss[loss=0.272, simple_loss=0.3358, pruned_loss=0.104, over 21744.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3121, pruned_loss=0.08674, over 4267743.02 frames. ], batch size: 247, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:42:09,263 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-06-20 10:42:17,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=542634.0, ans=0.125 2023-06-20 10:42:25,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=542634.0, ans=0.125 2023-06-20 10:42:57,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=542694.0, ans=0.0 2023-06-20 10:43:55,595 INFO [train.py:996] (3/4) Epoch 3, batch 29500, loss[loss=0.2358, simple_loss=0.3114, pruned_loss=0.08005, over 21797.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3153, pruned_loss=0.08954, over 4274526.98 frames. ], batch size: 351, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:43:57,014 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.068e+02 2.681e+02 3.088e+02 3.658e+02 6.266e+02, threshold=6.176e+02, percent-clipped=1.0 2023-06-20 10:43:58,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=542934.0, ans=0.1 2023-06-20 10:44:11,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=542934.0, ans=0.1 2023-06-20 10:44:46,356 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.68 vs. limit=22.5 2023-06-20 10:45:16,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=543114.0, ans=0.1 2023-06-20 10:46:06,820 INFO [train.py:996] (3/4) Epoch 3, batch 29550, loss[loss=0.2658, simple_loss=0.325, pruned_loss=0.1033, over 21305.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3156, pruned_loss=0.09139, over 4287352.56 frames. ], batch size: 159, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:46:33,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=543294.0, ans=0.0 2023-06-20 10:46:58,839 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.05 vs. limit=6.0 2023-06-20 10:48:26,411 INFO [train.py:996] (3/4) Epoch 3, batch 29600, loss[loss=0.367, simple_loss=0.4319, pruned_loss=0.151, over 21517.00 frames. ], tot_loss[loss=0.2596, simple_loss=0.327, pruned_loss=0.0961, over 4282115.55 frames. ], batch size: 471, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:48:27,808 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.156e+02 3.045e+02 3.811e+02 4.571e+02 9.006e+02, threshold=7.623e+02, percent-clipped=4.0 2023-06-20 10:48:55,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=543594.0, ans=0.0 2023-06-20 10:49:17,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=543594.0, ans=0.125 2023-06-20 10:49:47,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=543714.0, ans=0.125 2023-06-20 10:50:34,288 INFO [train.py:996] (3/4) Epoch 3, batch 29650, loss[loss=0.1625, simple_loss=0.2461, pruned_loss=0.03945, over 21625.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.3239, pruned_loss=0.09257, over 4276514.12 frames. ], batch size: 263, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:51:19,369 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 10:52:17,999 INFO [train.py:996] (3/4) Epoch 3, batch 29700, loss[loss=0.2653, simple_loss=0.3585, pruned_loss=0.08598, over 21628.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.3253, pruned_loss=0.09243, over 4277248.31 frames. ], batch size: 230, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:52:18,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=544134.0, ans=0.2 2023-06-20 10:52:19,324 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.600e+02 2.285e+02 2.545e+02 2.906e+02 5.391e+02, threshold=5.090e+02, percent-clipped=0.0 2023-06-20 10:52:19,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=544134.0, ans=0.125 2023-06-20 10:52:24,298 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=544134.0, ans=0.125 2023-06-20 10:53:24,945 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.12 vs. limit=12.0 2023-06-20 10:53:52,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=544374.0, ans=0.0 2023-06-20 10:54:13,239 INFO [train.py:996] (3/4) Epoch 3, batch 29750, loss[loss=0.2479, simple_loss=0.3271, pruned_loss=0.08438, over 21890.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.3287, pruned_loss=0.09247, over 4271481.15 frames. ], batch size: 316, lr: 1.00e-02, grad_scale: 16.0 2023-06-20 10:54:55,111 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.62 vs. limit=15.0 2023-06-20 10:54:55,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=544494.0, ans=15.0 2023-06-20 10:55:13,176 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=544554.0, ans=0.125 2023-06-20 10:55:25,010 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 10:56:03,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=544674.0, ans=0.125 2023-06-20 10:56:14,791 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=544674.0, ans=0.125 2023-06-20 10:56:17,042 INFO [train.py:996] (3/4) Epoch 3, batch 29800, loss[loss=0.2507, simple_loss=0.3307, pruned_loss=0.08537, over 21861.00 frames. ], tot_loss[loss=0.2585, simple_loss=0.3311, pruned_loss=0.09301, over 4270281.49 frames. ], batch size: 371, lr: 1.00e-02, grad_scale: 16.0 2023-06-20 10:56:28,614 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.984e+02 2.744e+02 3.274e+02 3.878e+02 6.407e+02, threshold=6.548e+02, percent-clipped=7.0 2023-06-20 10:56:50,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=544794.0, ans=0.0 2023-06-20 10:57:48,964 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.52 vs. limit=10.0 2023-06-20 10:58:16,257 INFO [train.py:996] (3/4) Epoch 3, batch 29850, loss[loss=0.2338, simple_loss=0.2994, pruned_loss=0.08407, over 21820.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3254, pruned_loss=0.09002, over 4271909.14 frames. ], batch size: 282, lr: 1.00e-02, grad_scale: 16.0 2023-06-20 10:58:41,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=545034.0, ans=0.1 2023-06-20 10:59:16,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=545154.0, ans=0.125 2023-06-20 11:00:11,568 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=545274.0, ans=0.0 2023-06-20 11:00:15,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=545274.0, ans=0.1 2023-06-20 11:00:18,349 INFO [train.py:996] (3/4) Epoch 3, batch 29900, loss[loss=0.3302, simple_loss=0.371, pruned_loss=0.1447, over 21530.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.3233, pruned_loss=0.09157, over 4276561.15 frames. ], batch size: 471, lr: 1.00e-02, grad_scale: 16.0 2023-06-20 11:00:21,255 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.785e+02 2.554e+02 2.921e+02 3.179e+02 4.891e+02, threshold=5.842e+02, percent-clipped=0.0 2023-06-20 11:00:28,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=545334.0, ans=0.0 2023-06-20 11:01:01,290 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=545394.0, ans=0.07 2023-06-20 11:01:14,473 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=545454.0, ans=0.2 2023-06-20 11:02:07,274 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=545574.0, ans=0.015 2023-06-20 11:02:27,195 INFO [train.py:996] (3/4) Epoch 3, batch 29950, loss[loss=0.2939, simple_loss=0.3549, pruned_loss=0.1165, over 21614.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.3272, pruned_loss=0.09557, over 4281238.31 frames. ], batch size: 415, lr: 9.99e-03, grad_scale: 16.0 2023-06-20 11:02:27,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=545634.0, ans=0.1 2023-06-20 11:02:53,319 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=22.5 2023-06-20 11:02:57,007 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=545694.0, ans=0.125 2023-06-20 11:04:30,368 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.12 vs. limit=12.0 2023-06-20 11:04:43,796 INFO [train.py:996] (3/4) Epoch 3, batch 30000, loss[loss=0.2191, simple_loss=0.3068, pruned_loss=0.06568, over 21781.00 frames. ], tot_loss[loss=0.2594, simple_loss=0.3287, pruned_loss=0.09501, over 4279488.95 frames. ], batch size: 332, lr: 9.99e-03, grad_scale: 32.0 2023-06-20 11:04:43,797 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-20 11:05:44,594 INFO [train.py:1028] (3/4) Epoch 3, validation: loss=0.2515, simple_loss=0.3537, pruned_loss=0.07464, over 1796401.00 frames. 2023-06-20 11:05:44,596 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-20 11:05:47,603 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.065e+02 2.773e+02 3.132e+02 3.473e+02 5.556e+02, threshold=6.264e+02, percent-clipped=0.0 2023-06-20 11:06:29,072 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=546054.0, ans=0.125 2023-06-20 11:07:09,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=546174.0, ans=0.2 2023-06-20 11:07:51,366 INFO [train.py:996] (3/4) Epoch 3, batch 30050, loss[loss=0.2784, simple_loss=0.3594, pruned_loss=0.09869, over 21672.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.3305, pruned_loss=0.0907, over 4264043.98 frames. ], batch size: 247, lr: 9.99e-03, grad_scale: 32.0 2023-06-20 11:08:42,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=546354.0, ans=0.1 2023-06-20 11:09:03,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=546414.0, ans=0.125 2023-06-20 11:09:04,389 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=546414.0, ans=0.125 2023-06-20 11:09:33,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=546474.0, ans=0.0 2023-06-20 11:09:36,692 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.21 vs. limit=10.0 2023-06-20 11:09:38,686 INFO [train.py:996] (3/4) Epoch 3, batch 30100, loss[loss=0.2439, simple_loss=0.3017, pruned_loss=0.09306, over 21496.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3289, pruned_loss=0.09038, over 4261938.80 frames. ], batch size: 132, lr: 9.99e-03, grad_scale: 32.0 2023-06-20 11:09:41,726 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.826e+02 2.525e+02 3.109e+02 3.772e+02 7.845e+02, threshold=6.218e+02, percent-clipped=1.0 2023-06-20 11:09:55,028 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=546534.0, ans=0.2 2023-06-20 11:10:21,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=546654.0, ans=0.125 2023-06-20 11:10:30,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=546654.0, ans=0.125 2023-06-20 11:10:47,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=546714.0, ans=0.125 2023-06-20 11:10:59,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=546774.0, ans=0.05 2023-06-20 11:11:25,730 INFO [train.py:996] (3/4) Epoch 3, batch 30150, loss[loss=0.2848, simple_loss=0.3431, pruned_loss=0.1133, over 21947.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.3256, pruned_loss=0.09234, over 4269746.32 frames. ], batch size: 372, lr: 9.98e-03, grad_scale: 32.0 2023-06-20 11:12:00,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=546894.0, ans=0.1 2023-06-20 11:12:21,150 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.06 vs. limit=6.0 2023-06-20 11:12:58,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=547014.0, ans=0.0 2023-06-20 11:13:43,231 INFO [train.py:996] (3/4) Epoch 3, batch 30200, loss[loss=0.2624, simple_loss=0.3127, pruned_loss=0.1061, over 20114.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3293, pruned_loss=0.09299, over 4271287.91 frames. ], batch size: 702, lr: 9.98e-03, grad_scale: 32.0 2023-06-20 11:13:46,148 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.833e+02 2.482e+02 2.831e+02 3.246e+02 4.619e+02, threshold=5.661e+02, percent-clipped=0.0 2023-06-20 11:13:56,651 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=547134.0, ans=0.125 2023-06-20 11:14:09,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=547194.0, ans=0.0 2023-06-20 11:14:11,922 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.48 vs. limit=12.0 2023-06-20 11:15:24,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=547314.0, ans=0.2 2023-06-20 11:15:57,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=547374.0, ans=0.0 2023-06-20 11:15:58,449 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=547374.0, ans=0.125 2023-06-20 11:15:59,763 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=547434.0, ans=0.1 2023-06-20 11:16:00,890 INFO [train.py:996] (3/4) Epoch 3, batch 30250, loss[loss=0.3459, simple_loss=0.4269, pruned_loss=0.1325, over 21684.00 frames. ], tot_loss[loss=0.2647, simple_loss=0.3377, pruned_loss=0.09583, over 4273334.88 frames. ], batch size: 441, lr: 9.98e-03, grad_scale: 16.0 2023-06-20 11:16:12,312 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.82 vs. limit=15.0 2023-06-20 11:17:03,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=547554.0, ans=0.1 2023-06-20 11:17:04,690 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=547614.0, ans=0.0 2023-06-20 11:17:58,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=547674.0, ans=0.2 2023-06-20 11:18:02,148 INFO [train.py:996] (3/4) Epoch 3, batch 30300, loss[loss=0.2217, simple_loss=0.2816, pruned_loss=0.08092, over 21262.00 frames. ], tot_loss[loss=0.2635, simple_loss=0.3352, pruned_loss=0.09593, over 4264710.73 frames. ], batch size: 177, lr: 9.97e-03, grad_scale: 16.0 2023-06-20 11:18:06,672 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.928e+02 2.736e+02 3.183e+02 3.962e+02 5.943e+02, threshold=6.366e+02, percent-clipped=1.0 2023-06-20 11:18:19,176 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=547734.0, ans=0.125 2023-06-20 11:18:49,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=547854.0, ans=0.5 2023-06-20 11:19:14,771 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=547914.0, ans=0.2 2023-06-20 11:19:24,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=547914.0, ans=0.1 2023-06-20 11:19:44,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=547974.0, ans=0.125 2023-06-20 11:20:01,348 INFO [train.py:996] (3/4) Epoch 3, batch 30350, loss[loss=0.2905, simple_loss=0.3624, pruned_loss=0.1093, over 21855.00 frames. ], tot_loss[loss=0.2663, simple_loss=0.3365, pruned_loss=0.09805, over 4269666.96 frames. ], batch size: 372, lr: 9.97e-03, grad_scale: 16.0 2023-06-20 11:20:28,178 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.22 vs. limit=22.5 2023-06-20 11:22:57,192 INFO [train.py:996] (3/4) Epoch 3, batch 30400, loss[loss=0.2484, simple_loss=0.278, pruned_loss=0.1094, over 20341.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3312, pruned_loss=0.09639, over 4247330.78 frames. ], batch size: 703, lr: 9.97e-03, grad_scale: 32.0 2023-06-20 11:22:57,427 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=548334.0, ans=0.2 2023-06-20 11:22:58,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=548334.0, ans=0.0 2023-06-20 11:23:00,870 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.303e+02 3.083e+02 3.599e+02 4.389e+02 8.139e+02, threshold=7.198e+02, percent-clipped=3.0 2023-06-20 11:26:54,478 INFO [train.py:996] (3/4) Epoch 3, batch 30450, loss[loss=0.3319, simple_loss=0.4411, pruned_loss=0.1114, over 19825.00 frames. ], tot_loss[loss=0.263, simple_loss=0.3321, pruned_loss=0.09695, over 4190954.47 frames. ], batch size: 702, lr: 9.97e-03, grad_scale: 32.0 2023-06-20 11:27:33,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=548634.0, ans=0.125 2023-06-20 11:27:33,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=548634.0, ans=0.1 2023-06-20 11:29:15,073 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=548754.0, ans=0.125 2023-06-20 11:32:21,683 INFO [train.py:996] (3/4) Epoch 4, batch 0, loss[loss=0.3033, simple_loss=0.3403, pruned_loss=0.1332, over 21353.00 frames. ], tot_loss[loss=0.3033, simple_loss=0.3403, pruned_loss=0.1332, over 21353.00 frames. ], batch size: 473, lr: 8.60e-03, grad_scale: 32.0 2023-06-20 11:32:21,684 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-20 11:33:10,764 INFO [train.py:1028] (3/4) Epoch 4, validation: loss=0.2494, simple_loss=0.3589, pruned_loss=0.06994, over 1796401.00 frames. 2023-06-20 11:33:10,765 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-20 11:33:23,653 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.619e+02 4.575e+02 6.276e+02 9.904e+02 2.096e+03, threshold=1.255e+03, percent-clipped=39.0 2023-06-20 11:34:20,735 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.08 vs. limit=22.5 2023-06-20 11:34:53,769 INFO [train.py:996] (3/4) Epoch 4, batch 50, loss[loss=0.2456, simple_loss=0.3373, pruned_loss=0.07701, over 21737.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.3234, pruned_loss=0.09181, over 942531.60 frames. ], batch size: 332, lr: 8.60e-03, grad_scale: 32.0 2023-06-20 11:34:54,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=549204.0, ans=0.125 2023-06-20 11:35:30,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=549264.0, ans=0.125 2023-06-20 11:35:56,612 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.43 vs. limit=15.0 2023-06-20 11:35:56,752 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.10 vs. limit=15.0 2023-06-20 11:36:23,330 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=549444.0, ans=0.0 2023-06-20 11:36:57,385 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.31 vs. limit=15.0 2023-06-20 11:36:59,301 INFO [train.py:996] (3/4) Epoch 4, batch 100, loss[loss=0.2914, simple_loss=0.3704, pruned_loss=0.1062, over 21674.00 frames. ], tot_loss[loss=0.2686, simple_loss=0.3457, pruned_loss=0.09572, over 1683424.78 frames. ], batch size: 389, lr: 8.60e-03, grad_scale: 32.0 2023-06-20 11:37:14,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=549504.0, ans=0.0 2023-06-20 11:37:24,735 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.917e+02 2.415e+02 2.758e+02 3.125e+02 7.692e+02, threshold=5.515e+02, percent-clipped=0.0 2023-06-20 11:38:02,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=549624.0, ans=0.125 2023-06-20 11:38:25,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=549684.0, ans=0.1 2023-06-20 11:38:28,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=549744.0, ans=0.1 2023-06-20 11:38:43,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=549744.0, ans=0.0 2023-06-20 11:38:47,460 INFO [train.py:996] (3/4) Epoch 4, batch 150, loss[loss=0.2662, simple_loss=0.3463, pruned_loss=0.09305, over 21229.00 frames. ], tot_loss[loss=0.268, simple_loss=0.3472, pruned_loss=0.09438, over 2244396.93 frames. ], batch size: 549, lr: 8.59e-03, grad_scale: 32.0 2023-06-20 11:39:11,453 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=549864.0, ans=0.0 2023-06-20 11:39:17,908 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.50 vs. limit=15.0 2023-06-20 11:39:23,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=549924.0, ans=0.125 2023-06-20 11:39:33,050 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.30 vs. limit=15.0 2023-06-20 11:40:26,093 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.00 vs. limit=22.5 2023-06-20 11:40:46,830 INFO [train.py:996] (3/4) Epoch 4, batch 200, loss[loss=0.2663, simple_loss=0.3349, pruned_loss=0.09886, over 21849.00 frames. ], tot_loss[loss=0.2671, simple_loss=0.344, pruned_loss=0.09506, over 2691847.81 frames. ], batch size: 124, lr: 8.59e-03, grad_scale: 32.0 2023-06-20 11:41:04,824 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.980e+02 2.490e+02 2.754e+02 3.308e+02 4.592e+02, threshold=5.508e+02, percent-clipped=0.0 2023-06-20 11:41:06,719 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=550164.0, ans=0.125 2023-06-20 11:41:21,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=550224.0, ans=15.0 2023-06-20 11:41:22,496 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=550224.0, ans=0.125 2023-06-20 11:41:25,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=550224.0, ans=0.125 2023-06-20 11:41:59,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=550284.0, ans=0.125 2023-06-20 11:42:19,196 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=550344.0, ans=0.95 2023-06-20 11:42:45,749 INFO [train.py:996] (3/4) Epoch 4, batch 250, loss[loss=0.263, simple_loss=0.3536, pruned_loss=0.08621, over 21640.00 frames. ], tot_loss[loss=0.2647, simple_loss=0.3411, pruned_loss=0.09415, over 3043260.34 frames. ], batch size: 414, lr: 8.59e-03, grad_scale: 32.0 2023-06-20 11:43:03,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=550404.0, ans=0.125 2023-06-20 11:43:30,478 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=550464.0, ans=0.125 2023-06-20 11:44:33,769 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=550584.0, ans=0.0 2023-06-20 11:44:47,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=550644.0, ans=0.0 2023-06-20 11:45:08,292 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=550704.0, ans=0.0 2023-06-20 11:45:28,959 INFO [train.py:996] (3/4) Epoch 4, batch 300, loss[loss=0.3173, simple_loss=0.3703, pruned_loss=0.1322, over 21429.00 frames. ], tot_loss[loss=0.2602, simple_loss=0.3352, pruned_loss=0.09264, over 3307608.74 frames. ], batch size: 471, lr: 8.59e-03, grad_scale: 32.0 2023-06-20 11:45:31,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=550704.0, ans=0.025 2023-06-20 11:45:45,020 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=550704.0, ans=0.0 2023-06-20 11:45:47,477 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.926e+02 2.545e+02 3.030e+02 3.596e+02 5.664e+02, threshold=6.060e+02, percent-clipped=1.0 2023-06-20 11:46:27,880 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=550824.0, ans=0.0 2023-06-20 11:47:39,859 INFO [train.py:996] (3/4) Epoch 4, batch 350, loss[loss=0.2235, simple_loss=0.3129, pruned_loss=0.06703, over 21748.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3274, pruned_loss=0.09159, over 3520958.11 frames. ], batch size: 351, lr: 8.59e-03, grad_scale: 32.0 2023-06-20 11:48:06,148 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.09 vs. limit=12.0 2023-06-20 11:48:32,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=551064.0, ans=0.1 2023-06-20 11:48:32,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=551064.0, ans=0.125 2023-06-20 11:49:10,534 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=551184.0, ans=0.125 2023-06-20 11:50:13,571 INFO [train.py:996] (3/4) Epoch 4, batch 400, loss[loss=0.2206, simple_loss=0.2929, pruned_loss=0.07414, over 21662.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3177, pruned_loss=0.08957, over 3682512.33 frames. ], batch size: 298, lr: 8.58e-03, grad_scale: 32.0 2023-06-20 11:50:36,582 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.917e+02 2.584e+02 2.891e+02 3.548e+02 6.771e+02, threshold=5.782e+02, percent-clipped=1.0 2023-06-20 11:50:39,854 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=551364.0, ans=0.035 2023-06-20 11:51:28,328 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.70 vs. limit=15.0 2023-06-20 11:52:21,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=551544.0, ans=0.1 2023-06-20 11:52:22,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=551544.0, ans=0.5 2023-06-20 11:52:30,845 INFO [train.py:996] (3/4) Epoch 4, batch 450, loss[loss=0.2103, simple_loss=0.2695, pruned_loss=0.07554, over 21303.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3157, pruned_loss=0.08752, over 3824031.92 frames. ], batch size: 177, lr: 8.58e-03, grad_scale: 32.0 2023-06-20 11:52:54,342 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=551664.0, ans=0.125 2023-06-20 11:53:37,928 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=551724.0, ans=0.0 2023-06-20 11:53:51,617 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.31 vs. limit=22.5 2023-06-20 11:53:53,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=551784.0, ans=0.0 2023-06-20 11:53:55,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=551784.0, ans=0.125 2023-06-20 11:54:29,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=551844.0, ans=0.0 2023-06-20 11:54:31,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=551904.0, ans=0.2 2023-06-20 11:54:32,206 INFO [train.py:996] (3/4) Epoch 4, batch 500, loss[loss=0.2032, simple_loss=0.2986, pruned_loss=0.05392, over 21644.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3177, pruned_loss=0.08627, over 3927553.18 frames. ], batch size: 263, lr: 8.58e-03, grad_scale: 32.0 2023-06-20 11:54:57,395 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.767e+02 2.704e+02 3.098e+02 4.611e+02 6.929e+02, threshold=6.196e+02, percent-clipped=8.0 2023-06-20 11:55:23,763 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.21 vs. limit=15.0 2023-06-20 11:55:38,453 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=552024.0, ans=10.0 2023-06-20 11:56:22,614 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 11:56:23,591 INFO [train.py:996] (3/4) Epoch 4, batch 550, loss[loss=0.2403, simple_loss=0.3149, pruned_loss=0.0829, over 21385.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3192, pruned_loss=0.08604, over 4013195.12 frames. ], batch size: 194, lr: 8.58e-03, grad_scale: 32.0 2023-06-20 11:57:28,736 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=552324.0, ans=0.125 2023-06-20 11:57:34,376 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.75 vs. limit=15.0 2023-06-20 11:57:41,198 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=552324.0, ans=0.1 2023-06-20 11:58:22,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=552444.0, ans=0.2 2023-06-20 11:58:45,244 INFO [train.py:996] (3/4) Epoch 4, batch 600, loss[loss=0.2269, simple_loss=0.2943, pruned_loss=0.07975, over 21813.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3233, pruned_loss=0.08733, over 4076250.66 frames. ], batch size: 118, lr: 8.57e-03, grad_scale: 32.0 2023-06-20 11:58:58,258 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 2.831e+02 3.316e+02 4.076e+02 6.310e+02, threshold=6.631e+02, percent-clipped=1.0 2023-06-20 11:59:48,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=552684.0, ans=0.125 2023-06-20 11:59:51,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=552684.0, ans=0.0 2023-06-20 11:59:54,682 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.50 vs. limit=12.0 2023-06-20 12:00:29,237 INFO [train.py:996] (3/4) Epoch 4, batch 650, loss[loss=0.2244, simple_loss=0.2875, pruned_loss=0.08061, over 21683.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3235, pruned_loss=0.08734, over 4130645.11 frames. ], batch size: 333, lr: 8.57e-03, grad_scale: 16.0 2023-06-20 12:00:58,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=552864.0, ans=0.125 2023-06-20 12:01:06,285 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=552864.0, ans=0.0 2023-06-20 12:02:45,253 INFO [train.py:996] (3/4) Epoch 4, batch 700, loss[loss=0.2208, simple_loss=0.3012, pruned_loss=0.07016, over 21761.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3211, pruned_loss=0.08687, over 4161181.89 frames. ], batch size: 351, lr: 8.57e-03, grad_scale: 16.0 2023-06-20 12:02:50,544 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=16.55 vs. limit=15.0 2023-06-20 12:02:52,779 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=553104.0, ans=0.0 2023-06-20 12:02:59,440 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.076e+02 2.784e+02 3.467e+02 4.888e+02 7.822e+02, threshold=6.935e+02, percent-clipped=3.0 2023-06-20 12:04:28,800 INFO [train.py:996] (3/4) Epoch 4, batch 750, loss[loss=0.2335, simple_loss=0.302, pruned_loss=0.08253, over 21727.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3207, pruned_loss=0.08731, over 4189021.44 frames. ], batch size: 298, lr: 8.57e-03, grad_scale: 16.0 2023-06-20 12:04:35,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=553404.0, ans=0.125 2023-06-20 12:05:07,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=553464.0, ans=0.2 2023-06-20 12:05:16,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=553524.0, ans=0.1 2023-06-20 12:05:46,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=553644.0, ans=0.125 2023-06-20 12:06:12,195 INFO [train.py:996] (3/4) Epoch 4, batch 800, loss[loss=0.2669, simple_loss=0.3533, pruned_loss=0.0902, over 21414.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3186, pruned_loss=0.08791, over 4198284.41 frames. ], batch size: 548, lr: 8.56e-03, grad_scale: 32.0 2023-06-20 12:06:37,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=553704.0, ans=0.125 2023-06-20 12:06:43,595 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.113e+02 2.671e+02 3.153e+02 3.774e+02 5.879e+02, threshold=6.307e+02, percent-clipped=0.0 2023-06-20 12:07:29,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=553824.0, ans=0.1 2023-06-20 12:07:33,266 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.95 vs. limit=15.0 2023-06-20 12:07:38,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=553884.0, ans=0.1 2023-06-20 12:08:14,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=553944.0, ans=0.125 2023-06-20 12:08:29,752 INFO [train.py:996] (3/4) Epoch 4, batch 850, loss[loss=0.2169, simple_loss=0.2743, pruned_loss=0.0798, over 21196.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3163, pruned_loss=0.08771, over 4215053.25 frames. ], batch size: 159, lr: 8.56e-03, grad_scale: 32.0 2023-06-20 12:09:33,394 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=554184.0, ans=0.0 2023-06-20 12:10:06,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=554304.0, ans=0.1 2023-06-20 12:10:07,103 INFO [train.py:996] (3/4) Epoch 4, batch 900, loss[loss=0.2236, simple_loss=0.2958, pruned_loss=0.07573, over 21301.00 frames. ], tot_loss[loss=0.245, simple_loss=0.315, pruned_loss=0.08756, over 4229085.26 frames. ], batch size: 159, lr: 8.56e-03, grad_scale: 32.0 2023-06-20 12:10:36,833 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.23 vs. limit=6.0 2023-06-20 12:10:42,159 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.949e+02 2.695e+02 3.057e+02 3.508e+02 5.893e+02, threshold=6.115e+02, percent-clipped=0.0 2023-06-20 12:10:44,193 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=554364.0, ans=0.125 2023-06-20 12:10:50,921 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.74 vs. limit=10.0 2023-06-20 12:12:05,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=554544.0, ans=0.2 2023-06-20 12:12:05,612 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=554544.0, ans=0.0 2023-06-20 12:12:17,115 INFO [train.py:996] (3/4) Epoch 4, batch 950, loss[loss=0.1963, simple_loss=0.2863, pruned_loss=0.05318, over 21771.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3139, pruned_loss=0.08706, over 4240095.68 frames. ], batch size: 282, lr: 8.56e-03, grad_scale: 32.0 2023-06-20 12:12:17,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=554604.0, ans=0.0 2023-06-20 12:12:39,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=554664.0, ans=0.2 2023-06-20 12:13:10,357 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=554724.0, ans=0.04949747468305833 2023-06-20 12:13:22,826 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 12:13:53,070 INFO [train.py:996] (3/4) Epoch 4, batch 1000, loss[loss=0.2508, simple_loss=0.3264, pruned_loss=0.08759, over 21895.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.313, pruned_loss=0.08679, over 4247164.62 frames. ], batch size: 316, lr: 8.56e-03, grad_scale: 32.0 2023-06-20 12:13:57,953 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=554904.0, ans=0.2 2023-06-20 12:14:14,186 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.958e+02 2.396e+02 2.779e+02 3.341e+02 4.374e+02, threshold=5.558e+02, percent-clipped=0.0 2023-06-20 12:14:45,938 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.74 vs. limit=15.0 2023-06-20 12:15:16,644 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=555084.0, ans=0.125 2023-06-20 12:15:40,574 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.34 vs. limit=12.0 2023-06-20 12:15:57,392 INFO [train.py:996] (3/4) Epoch 4, batch 1050, loss[loss=0.3148, simple_loss=0.4177, pruned_loss=0.1059, over 20831.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3157, pruned_loss=0.08829, over 4257253.63 frames. ], batch size: 608, lr: 8.55e-03, grad_scale: 32.0 2023-06-20 12:16:32,820 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.89 vs. limit=15.0 2023-06-20 12:17:08,220 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=555384.0, ans=0.1 2023-06-20 12:18:00,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=555444.0, ans=0.2 2023-06-20 12:18:12,961 INFO [train.py:996] (3/4) Epoch 4, batch 1100, loss[loss=0.2162, simple_loss=0.3004, pruned_loss=0.06601, over 21801.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3179, pruned_loss=0.08827, over 4258890.56 frames. ], batch size: 247, lr: 8.55e-03, grad_scale: 16.0 2023-06-20 12:18:29,533 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.835e+02 2.724e+02 3.414e+02 3.990e+02 8.036e+02, threshold=6.829e+02, percent-clipped=6.0 2023-06-20 12:19:51,992 INFO [train.py:996] (3/4) Epoch 4, batch 1150, loss[loss=0.1892, simple_loss=0.2745, pruned_loss=0.05193, over 21401.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3185, pruned_loss=0.0882, over 4268000.04 frames. ], batch size: 211, lr: 8.55e-03, grad_scale: 16.0 2023-06-20 12:19:55,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=555804.0, ans=0.125 2023-06-20 12:20:46,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=555864.0, ans=0.125 2023-06-20 12:22:03,041 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.52 vs. limit=22.5 2023-06-20 12:22:03,439 INFO [train.py:996] (3/4) Epoch 4, batch 1200, loss[loss=0.2571, simple_loss=0.3332, pruned_loss=0.09051, over 21741.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3197, pruned_loss=0.08915, over 4275771.03 frames. ], batch size: 282, lr: 8.55e-03, grad_scale: 32.0 2023-06-20 12:22:12,042 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.79 vs. limit=15.0 2023-06-20 12:22:32,985 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=556164.0, ans=0.125 2023-06-20 12:22:33,414 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.67 vs. limit=22.5 2023-06-20 12:22:33,900 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.884e+02 2.663e+02 2.998e+02 3.388e+02 6.421e+02, threshold=5.997e+02, percent-clipped=0.0 2023-06-20 12:22:38,741 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=556164.0, ans=0.125 2023-06-20 12:23:13,532 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=556224.0, ans=0.1 2023-06-20 12:24:08,859 INFO [train.py:996] (3/4) Epoch 4, batch 1250, loss[loss=0.2531, simple_loss=0.3188, pruned_loss=0.09375, over 21888.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3229, pruned_loss=0.09121, over 4282742.47 frames. ], batch size: 118, lr: 8.54e-03, grad_scale: 32.0 2023-06-20 12:24:45,789 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=556464.0, ans=0.125 2023-06-20 12:25:06,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=556524.0, ans=0.0 2023-06-20 12:26:17,531 INFO [train.py:996] (3/4) Epoch 4, batch 1300, loss[loss=0.2466, simple_loss=0.3279, pruned_loss=0.08268, over 21632.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3229, pruned_loss=0.09149, over 4285395.68 frames. ], batch size: 230, lr: 8.54e-03, grad_scale: 32.0 2023-06-20 12:26:43,507 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.031e+02 2.594e+02 2.933e+02 3.629e+02 6.395e+02, threshold=5.867e+02, percent-clipped=1.0 2023-06-20 12:28:25,549 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=557004.0, ans=0.04949747468305833 2023-06-20 12:28:26,530 INFO [train.py:996] (3/4) Epoch 4, batch 1350, loss[loss=0.267, simple_loss=0.329, pruned_loss=0.1025, over 21855.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.3243, pruned_loss=0.09217, over 4289428.73 frames. ], batch size: 124, lr: 8.54e-03, grad_scale: 32.0 2023-06-20 12:28:35,706 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=557004.0, ans=0.0 2023-06-20 12:29:43,288 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=557184.0, ans=0.0 2023-06-20 12:30:31,699 INFO [train.py:996] (3/4) Epoch 4, batch 1400, loss[loss=0.2268, simple_loss=0.2846, pruned_loss=0.08448, over 21897.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3213, pruned_loss=0.0913, over 4277377.31 frames. ], batch size: 373, lr: 8.54e-03, grad_scale: 32.0 2023-06-20 12:30:33,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=557304.0, ans=0.125 2023-06-20 12:30:58,535 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.947e+02 2.685e+02 2.934e+02 3.988e+02 7.934e+02, threshold=5.867e+02, percent-clipped=7.0 2023-06-20 12:32:45,058 INFO [train.py:996] (3/4) Epoch 4, batch 1450, loss[loss=0.2825, simple_loss=0.3456, pruned_loss=0.1098, over 21579.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.3213, pruned_loss=0.09167, over 4267471.50 frames. ], batch size: 471, lr: 8.54e-03, grad_scale: 32.0 2023-06-20 12:33:19,447 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.29 vs. limit=12.0 2023-06-20 12:34:16,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=557844.0, ans=0.025 2023-06-20 12:34:22,835 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.51 vs. limit=15.0 2023-06-20 12:34:43,985 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=557844.0, ans=0.125 2023-06-20 12:34:53,736 INFO [train.py:996] (3/4) Epoch 4, batch 1500, loss[loss=0.3009, simple_loss=0.3745, pruned_loss=0.1136, over 21562.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.3245, pruned_loss=0.09321, over 4272719.00 frames. ], batch size: 471, lr: 8.53e-03, grad_scale: 32.0 2023-06-20 12:35:05,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=557904.0, ans=0.1 2023-06-20 12:35:16,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=557964.0, ans=0.2 2023-06-20 12:35:18,457 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.243e+02 2.667e+02 3.053e+02 3.442e+02 4.744e+02, threshold=6.106e+02, percent-clipped=0.0 2023-06-20 12:36:06,650 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=558024.0, ans=0.125 2023-06-20 12:36:19,899 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.39 vs. limit=15.0 2023-06-20 12:36:22,938 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.73 vs. limit=15.0 2023-06-20 12:36:25,345 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=558084.0, ans=0.125 2023-06-20 12:36:51,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=558144.0, ans=0.125 2023-06-20 12:37:17,133 INFO [train.py:996] (3/4) Epoch 4, batch 1550, loss[loss=0.2145, simple_loss=0.3005, pruned_loss=0.06429, over 21791.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.321, pruned_loss=0.09114, over 4275917.82 frames. ], batch size: 371, lr: 8.53e-03, grad_scale: 32.0 2023-06-20 12:37:25,226 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.54 vs. limit=15.0 2023-06-20 12:37:41,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=558264.0, ans=0.0 2023-06-20 12:38:09,920 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-06-20 12:38:28,342 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=558384.0, ans=0.2 2023-06-20 12:38:45,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=558444.0, ans=0.1 2023-06-20 12:39:12,244 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 12:39:13,417 INFO [train.py:996] (3/4) Epoch 4, batch 1600, loss[loss=0.2532, simple_loss=0.3249, pruned_loss=0.09074, over 21315.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3193, pruned_loss=0.08951, over 4282051.95 frames. ], batch size: 176, lr: 8.53e-03, grad_scale: 32.0 2023-06-20 12:39:15,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=558504.0, ans=0.95 2023-06-20 12:39:29,431 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.788e+02 2.678e+02 3.179e+02 3.660e+02 5.340e+02, threshold=6.358e+02, percent-clipped=0.0 2023-06-20 12:40:21,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=558684.0, ans=0.125 2023-06-20 12:41:19,057 INFO [train.py:996] (3/4) Epoch 4, batch 1650, loss[loss=0.2199, simple_loss=0.2888, pruned_loss=0.07547, over 21297.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3167, pruned_loss=0.08789, over 4266606.37 frames. ], batch size: 159, lr: 8.53e-03, grad_scale: 32.0 2023-06-20 12:41:21,363 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.44 vs. limit=15.0 2023-06-20 12:42:23,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=558924.0, ans=0.0 2023-06-20 12:42:28,403 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.88 vs. limit=22.5 2023-06-20 12:42:30,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=558924.0, ans=0.0 2023-06-20 12:42:38,279 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=558984.0, ans=0.2 2023-06-20 12:43:10,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=559044.0, ans=0.0 2023-06-20 12:43:21,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=559044.0, ans=0.125 2023-06-20 12:43:30,334 INFO [train.py:996] (3/4) Epoch 4, batch 1700, loss[loss=0.2472, simple_loss=0.316, pruned_loss=0.08918, over 21389.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3206, pruned_loss=0.08987, over 4273751.18 frames. ], batch size: 131, lr: 8.52e-03, grad_scale: 32.0 2023-06-20 12:43:38,368 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=559104.0, ans=0.0 2023-06-20 12:43:46,007 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=559104.0, ans=0.5 2023-06-20 12:43:58,558 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.789e+02 2.589e+02 2.821e+02 3.325e+02 4.601e+02, threshold=5.642e+02, percent-clipped=0.0 2023-06-20 12:44:41,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=559224.0, ans=0.2 2023-06-20 12:45:26,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=559344.0, ans=0.2 2023-06-20 12:45:31,703 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=559344.0, ans=0.125 2023-06-20 12:45:45,534 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=559344.0, ans=0.125 2023-06-20 12:45:49,023 INFO [train.py:996] (3/4) Epoch 4, batch 1750, loss[loss=0.2384, simple_loss=0.3271, pruned_loss=0.07485, over 21609.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.3201, pruned_loss=0.08876, over 4268132.50 frames. ], batch size: 441, lr: 8.52e-03, grad_scale: 32.0 2023-06-20 12:45:52,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=559404.0, ans=0.1 2023-06-20 12:46:40,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=559464.0, ans=0.125 2023-06-20 12:47:39,332 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=559584.0, ans=0.125 2023-06-20 12:47:42,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=559644.0, ans=0.125 2023-06-20 12:47:43,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=559644.0, ans=0.125 2023-06-20 12:48:08,789 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=559644.0, ans=0.125 2023-06-20 12:48:18,427 INFO [train.py:996] (3/4) Epoch 4, batch 1800, loss[loss=0.2635, simple_loss=0.3335, pruned_loss=0.09669, over 21535.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.316, pruned_loss=0.08526, over 4270243.02 frames. ], batch size: 441, lr: 8.52e-03, grad_scale: 32.0 2023-06-20 12:48:34,542 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.759e+02 2.695e+02 3.353e+02 3.851e+02 6.055e+02, threshold=6.706e+02, percent-clipped=1.0 2023-06-20 12:49:29,817 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.71 vs. limit=22.5 2023-06-20 12:50:15,943 INFO [train.py:996] (3/4) Epoch 4, batch 1850, loss[loss=0.2525, simple_loss=0.3395, pruned_loss=0.08281, over 21669.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3179, pruned_loss=0.0845, over 4267809.11 frames. ], batch size: 389, lr: 8.52e-03, grad_scale: 32.0 2023-06-20 12:50:23,565 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=560004.0, ans=0.125 2023-06-20 12:50:47,694 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.86 vs. limit=15.0 2023-06-20 12:50:48,449 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=560064.0, ans=0.0 2023-06-20 12:51:01,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=560124.0, ans=0.0 2023-06-20 12:51:20,477 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=560184.0, ans=0.04949747468305833 2023-06-20 12:52:18,570 INFO [train.py:996] (3/4) Epoch 4, batch 1900, loss[loss=0.2253, simple_loss=0.2981, pruned_loss=0.07624, over 21460.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3171, pruned_loss=0.08434, over 4271141.26 frames. ], batch size: 194, lr: 8.51e-03, grad_scale: 32.0 2023-06-20 12:52:22,623 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.55 vs. limit=12.0 2023-06-20 12:52:40,244 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.808e+02 2.526e+02 2.932e+02 3.560e+02 6.916e+02, threshold=5.863e+02, percent-clipped=1.0 2023-06-20 12:53:29,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=560484.0, ans=0.125 2023-06-20 12:54:36,915 INFO [train.py:996] (3/4) Epoch 4, batch 1950, loss[loss=0.2271, simple_loss=0.2833, pruned_loss=0.08551, over 21449.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3148, pruned_loss=0.08434, over 4272739.77 frames. ], batch size: 389, lr: 8.51e-03, grad_scale: 32.0 2023-06-20 12:54:44,925 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.47 vs. limit=15.0 2023-06-20 12:55:13,351 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=560664.0, ans=0.125 2023-06-20 12:55:15,575 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.09 vs. limit=12.0 2023-06-20 12:55:16,971 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.61 vs. limit=6.0 2023-06-20 12:56:40,390 INFO [train.py:996] (3/4) Epoch 4, batch 2000, loss[loss=0.2311, simple_loss=0.3096, pruned_loss=0.0763, over 21566.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3107, pruned_loss=0.08216, over 4277431.37 frames. ], batch size: 212, lr: 8.51e-03, grad_scale: 32.0 2023-06-20 12:57:18,136 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.010e+02 2.659e+02 3.074e+02 3.933e+02 7.372e+02, threshold=6.149e+02, percent-clipped=6.0 2023-06-20 12:57:24,832 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=560964.0, ans=0.0 2023-06-20 12:58:37,540 INFO [train.py:996] (3/4) Epoch 4, batch 2050, loss[loss=0.2496, simple_loss=0.3292, pruned_loss=0.08496, over 17137.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3104, pruned_loss=0.0816, over 4263741.76 frames. ], batch size: 60, lr: 8.51e-03, grad_scale: 32.0 2023-06-20 12:59:36,144 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=561324.0, ans=0.1 2023-06-20 12:59:43,434 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=561324.0, ans=0.125 2023-06-20 13:00:48,878 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.71 vs. limit=6.0 2023-06-20 13:00:51,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=561504.0, ans=0.0 2023-06-20 13:01:09,993 INFO [train.py:996] (3/4) Epoch 4, batch 2100, loss[loss=0.2641, simple_loss=0.3289, pruned_loss=0.09966, over 20702.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3147, pruned_loss=0.08416, over 4272182.13 frames. ], batch size: 607, lr: 8.51e-03, grad_scale: 32.0 2023-06-20 13:01:37,540 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.974e+02 2.531e+02 2.915e+02 3.438e+02 5.326e+02, threshold=5.830e+02, percent-clipped=0.0 2023-06-20 13:02:25,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=561684.0, ans=0.1 2023-06-20 13:02:27,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=561684.0, ans=15.0 2023-06-20 13:03:06,315 INFO [train.py:996] (3/4) Epoch 4, batch 2150, loss[loss=0.2398, simple_loss=0.3027, pruned_loss=0.08842, over 21611.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3141, pruned_loss=0.08541, over 4268824.18 frames. ], batch size: 247, lr: 8.50e-03, grad_scale: 32.0 2023-06-20 13:03:29,418 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.23 vs. limit=15.0 2023-06-20 13:03:55,974 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=561924.0, ans=0.125 2023-06-20 13:04:06,725 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 13:04:47,647 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=561984.0, ans=0.2 2023-06-20 13:04:54,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=561984.0, ans=0.125 2023-06-20 13:04:58,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=562044.0, ans=0.2 2023-06-20 13:05:01,473 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=562044.0, ans=0.125 2023-06-20 13:05:08,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=562044.0, ans=10.0 2023-06-20 13:05:31,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=562104.0, ans=0.125 2023-06-20 13:05:32,864 INFO [train.py:996] (3/4) Epoch 4, batch 2200, loss[loss=0.2415, simple_loss=0.308, pruned_loss=0.08756, over 21787.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3163, pruned_loss=0.08525, over 4270339.95 frames. ], batch size: 112, lr: 8.50e-03, grad_scale: 32.0 2023-06-20 13:06:00,308 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.951e+02 2.421e+02 2.845e+02 3.269e+02 4.462e+02, threshold=5.690e+02, percent-clipped=0.0 2023-06-20 13:06:02,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=562164.0, ans=0.125 2023-06-20 13:06:06,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=562164.0, ans=0.015 2023-06-20 13:06:57,057 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.70 vs. limit=22.5 2023-06-20 13:07:20,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=562344.0, ans=0.125 2023-06-20 13:07:40,841 INFO [train.py:996] (3/4) Epoch 4, batch 2250, loss[loss=0.1995, simple_loss=0.2623, pruned_loss=0.06835, over 21724.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3135, pruned_loss=0.08405, over 4274210.45 frames. ], batch size: 124, lr: 8.50e-03, grad_scale: 32.0 2023-06-20 13:08:02,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=562404.0, ans=0.2 2023-06-20 13:09:14,170 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.55 vs. limit=6.0 2023-06-20 13:09:51,971 INFO [train.py:996] (3/4) Epoch 4, batch 2300, loss[loss=0.243, simple_loss=0.2876, pruned_loss=0.09916, over 21636.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3105, pruned_loss=0.08406, over 4258273.80 frames. ], batch size: 416, lr: 8.50e-03, grad_scale: 32.0 2023-06-20 13:10:19,784 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.011e+02 2.850e+02 3.293e+02 4.156e+02 7.467e+02, threshold=6.587e+02, percent-clipped=11.0 2023-06-20 13:11:36,328 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.92 vs. limit=22.5 2023-06-20 13:11:47,684 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.99 vs. limit=15.0 2023-06-20 13:11:49,251 INFO [train.py:996] (3/4) Epoch 4, batch 2350, loss[loss=0.2713, simple_loss=0.3341, pruned_loss=0.1042, over 21915.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3065, pruned_loss=0.08463, over 4252676.09 frames. ], batch size: 372, lr: 8.49e-03, grad_scale: 32.0 2023-06-20 13:12:50,041 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=563124.0, ans=0.2 2023-06-20 13:12:51,598 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=563124.0, ans=0.1 2023-06-20 13:13:16,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=563184.0, ans=0.2 2023-06-20 13:13:22,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=563244.0, ans=0.1 2023-06-20 13:13:24,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=563244.0, ans=0.0 2023-06-20 13:13:27,927 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-20 13:13:42,854 INFO [train.py:996] (3/4) Epoch 4, batch 2400, loss[loss=0.2694, simple_loss=0.3454, pruned_loss=0.09669, over 21456.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.313, pruned_loss=0.08794, over 4261573.85 frames. ], batch size: 131, lr: 8.49e-03, grad_scale: 32.0 2023-06-20 13:13:43,973 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-20 13:14:05,167 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.168e+02 2.727e+02 3.231e+02 4.252e+02 7.805e+02, threshold=6.463e+02, percent-clipped=2.0 2023-06-20 13:15:50,102 INFO [train.py:996] (3/4) Epoch 4, batch 2450, loss[loss=0.2186, simple_loss=0.2898, pruned_loss=0.07369, over 21649.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3187, pruned_loss=0.09057, over 4267362.19 frames. ], batch size: 263, lr: 8.49e-03, grad_scale: 32.0 2023-06-20 13:16:04,304 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.53 vs. limit=15.0 2023-06-20 13:16:36,986 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=563724.0, ans=0.0 2023-06-20 13:17:06,764 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=563784.0, ans=0.1 2023-06-20 13:17:51,347 INFO [train.py:996] (3/4) Epoch 4, batch 2500, loss[loss=0.2443, simple_loss=0.3218, pruned_loss=0.08345, over 21670.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3139, pruned_loss=0.08859, over 4257262.08 frames. ], batch size: 298, lr: 8.49e-03, grad_scale: 32.0 2023-06-20 13:18:09,380 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.59 vs. limit=8.0 2023-06-20 13:18:18,163 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.81 vs. limit=6.0 2023-06-20 13:18:25,929 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.919e+02 2.483e+02 2.988e+02 3.628e+02 7.218e+02, threshold=5.976e+02, percent-clipped=3.0 2023-06-20 13:18:58,850 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=564084.0, ans=0.0 2023-06-20 13:19:21,280 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=564084.0, ans=0.125 2023-06-20 13:20:01,570 INFO [train.py:996] (3/4) Epoch 4, batch 2550, loss[loss=0.2457, simple_loss=0.3078, pruned_loss=0.09182, over 21449.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3134, pruned_loss=0.08669, over 4262564.40 frames. ], batch size: 389, lr: 8.49e-03, grad_scale: 32.0 2023-06-20 13:20:09,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=564204.0, ans=0.125 2023-06-20 13:21:29,872 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.82 vs. limit=10.0 2023-06-20 13:21:45,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=564444.0, ans=0.1 2023-06-20 13:22:14,096 INFO [train.py:996] (3/4) Epoch 4, batch 2600, loss[loss=0.2824, simple_loss=0.3602, pruned_loss=0.1023, over 21412.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3168, pruned_loss=0.08925, over 4265809.89 frames. ], batch size: 131, lr: 8.48e-03, grad_scale: 32.0 2023-06-20 13:22:25,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=564504.0, ans=0.125 2023-06-20 13:22:30,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=564504.0, ans=0.0 2023-06-20 13:22:35,613 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.854e+02 2.603e+02 3.043e+02 3.665e+02 5.448e+02, threshold=6.087e+02, percent-clipped=0.0 2023-06-20 13:22:53,738 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.50 vs. limit=15.0 2023-06-20 13:24:04,589 INFO [train.py:996] (3/4) Epoch 4, batch 2650, loss[loss=0.2466, simple_loss=0.3273, pruned_loss=0.08293, over 21832.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3182, pruned_loss=0.09077, over 4271207.20 frames. ], batch size: 351, lr: 8.48e-03, grad_scale: 32.0 2023-06-20 13:24:39,949 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=564864.0, ans=0.5 2023-06-20 13:24:53,293 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=564924.0, ans=0.0 2023-06-20 13:25:06,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=564924.0, ans=0.125 2023-06-20 13:26:08,581 INFO [train.py:996] (3/4) Epoch 4, batch 2700, loss[loss=0.214, simple_loss=0.2913, pruned_loss=0.06833, over 21813.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3162, pruned_loss=0.08959, over 4275298.01 frames. ], batch size: 333, lr: 8.48e-03, grad_scale: 32.0 2023-06-20 13:26:37,697 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=565104.0, ans=0.1 2023-06-20 13:26:50,176 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=565164.0, ans=10.0 2023-06-20 13:26:51,100 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 2.726e+02 3.108e+02 3.967e+02 7.517e+02, threshold=6.217e+02, percent-clipped=3.0 2023-06-20 13:28:02,158 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=565344.0, ans=0.125 2023-06-20 13:28:36,562 INFO [train.py:996] (3/4) Epoch 4, batch 2750, loss[loss=0.2024, simple_loss=0.252, pruned_loss=0.07645, over 21216.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3174, pruned_loss=0.0909, over 4282200.40 frames. ], batch size: 159, lr: 8.48e-03, grad_scale: 32.0 2023-06-20 13:29:28,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=565464.0, ans=0.2 2023-06-20 13:29:36,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=565524.0, ans=0.2 2023-06-20 13:30:19,430 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.14 vs. limit=15.0 2023-06-20 13:31:01,576 INFO [train.py:996] (3/4) Epoch 4, batch 2800, loss[loss=0.3924, simple_loss=0.452, pruned_loss=0.1664, over 21426.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3213, pruned_loss=0.09203, over 4277962.75 frames. ], batch size: 507, lr: 8.47e-03, grad_scale: 32.0 2023-06-20 13:31:03,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=565704.0, ans=0.125 2023-06-20 13:31:24,212 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.099e+02 2.619e+02 3.182e+02 3.699e+02 5.789e+02, threshold=6.364e+02, percent-clipped=0.0 2023-06-20 13:33:03,261 INFO [train.py:996] (3/4) Epoch 4, batch 2850, loss[loss=0.2439, simple_loss=0.3153, pruned_loss=0.08628, over 21719.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.3215, pruned_loss=0.09278, over 4284495.10 frames. ], batch size: 351, lr: 8.47e-03, grad_scale: 32.0 2023-06-20 13:33:06,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=566004.0, ans=0.2 2023-06-20 13:33:55,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=566124.0, ans=0.125 2023-06-20 13:34:10,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=566124.0, ans=0.125 2023-06-20 13:34:34,755 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.08 vs. limit=15.0 2023-06-20 13:35:06,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=566244.0, ans=0.2 2023-06-20 13:35:07,822 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=566244.0, ans=0.125 2023-06-20 13:35:19,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=566304.0, ans=0.2 2023-06-20 13:35:20,756 INFO [train.py:996] (3/4) Epoch 4, batch 2900, loss[loss=0.2587, simple_loss=0.3535, pruned_loss=0.08191, over 20737.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3179, pruned_loss=0.09149, over 4282752.10 frames. ], batch size: 607, lr: 8.47e-03, grad_scale: 32.0 2023-06-20 13:35:43,044 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.160e+02 2.738e+02 3.167e+02 3.892e+02 8.808e+02, threshold=6.333e+02, percent-clipped=7.0 2023-06-20 13:37:07,681 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 13:37:12,613 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.30 vs. limit=22.5 2023-06-20 13:37:19,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=566544.0, ans=0.125 2023-06-20 13:37:31,908 INFO [train.py:996] (3/4) Epoch 4, batch 2950, loss[loss=0.2492, simple_loss=0.2895, pruned_loss=0.1045, over 20195.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3197, pruned_loss=0.09161, over 4280214.86 frames. ], batch size: 703, lr: 8.47e-03, grad_scale: 32.0 2023-06-20 13:37:58,231 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=566664.0, ans=0.1 2023-06-20 13:38:23,812 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.42 vs. limit=12.0 2023-06-20 13:39:37,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=566844.0, ans=0.125 2023-06-20 13:39:55,844 INFO [train.py:996] (3/4) Epoch 4, batch 3000, loss[loss=0.286, simple_loss=0.3629, pruned_loss=0.1045, over 21454.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.323, pruned_loss=0.09156, over 4287180.01 frames. ], batch size: 131, lr: 8.47e-03, grad_scale: 32.0 2023-06-20 13:39:55,844 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-20 13:40:41,245 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([4.5069, 2.1790, 3.8640, 2.1845], device='cuda:3') 2023-06-20 13:40:43,511 INFO [train.py:1028] (3/4) Epoch 4, validation: loss=0.2581, simple_loss=0.352, pruned_loss=0.08208, over 1796401.00 frames. 2023-06-20 13:40:43,513 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-20 13:41:00,248 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.160e+02 2.790e+02 3.201e+02 3.672e+02 6.689e+02, threshold=6.402e+02, percent-clipped=1.0 2023-06-20 13:41:02,696 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.04 vs. limit=15.0 2023-06-20 13:41:06,024 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.85 vs. limit=15.0 2023-06-20 13:41:23,334 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.22 vs. limit=15.0 2023-06-20 13:41:32,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=567024.0, ans=0.125 2023-06-20 13:42:04,398 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=567144.0, ans=0.1 2023-06-20 13:42:39,648 INFO [train.py:996] (3/4) Epoch 4, batch 3050, loss[loss=0.1961, simple_loss=0.2797, pruned_loss=0.05627, over 21417.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.3227, pruned_loss=0.08974, over 4284940.33 frames. ], batch size: 194, lr: 8.46e-03, grad_scale: 32.0 2023-06-20 13:44:38,267 INFO [train.py:996] (3/4) Epoch 4, batch 3100, loss[loss=0.2647, simple_loss=0.3446, pruned_loss=0.09237, over 21773.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3209, pruned_loss=0.08823, over 4283982.22 frames. ], batch size: 414, lr: 8.46e-03, grad_scale: 32.0 2023-06-20 13:44:41,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=567504.0, ans=0.1 2023-06-20 13:44:44,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=567504.0, ans=0.2 2023-06-20 13:44:44,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=567504.0, ans=0.2 2023-06-20 13:44:50,234 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=567504.0, ans=0.1 2023-06-20 13:45:06,093 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.835e+02 2.520e+02 2.947e+02 3.627e+02 6.337e+02, threshold=5.895e+02, percent-clipped=0.0 2023-06-20 13:45:19,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=567564.0, ans=0.125 2023-06-20 13:45:19,703 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=567564.0, ans=0.125 2023-06-20 13:46:18,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=567744.0, ans=0.04949747468305833 2023-06-20 13:46:33,576 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=567804.0, ans=0.2 2023-06-20 13:46:33,650 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=567804.0, ans=0.0 2023-06-20 13:46:34,663 INFO [train.py:996] (3/4) Epoch 4, batch 3150, loss[loss=0.2985, simple_loss=0.3654, pruned_loss=0.1158, over 21588.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3229, pruned_loss=0.08861, over 4275024.51 frames. ], batch size: 414, lr: 8.46e-03, grad_scale: 32.0 2023-06-20 13:46:51,860 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.05 vs. limit=12.0 2023-06-20 13:47:05,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=567864.0, ans=0.125 2023-06-20 13:47:23,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=567924.0, ans=0.125 2023-06-20 13:48:40,622 INFO [train.py:996] (3/4) Epoch 4, batch 3200, loss[loss=0.2852, simple_loss=0.3468, pruned_loss=0.1119, over 21563.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.326, pruned_loss=0.08942, over 4278161.23 frames. ], batch size: 471, lr: 8.46e-03, grad_scale: 32.0 2023-06-20 13:49:09,863 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.795e+02 2.444e+02 2.825e+02 3.335e+02 7.265e+02, threshold=5.650e+02, percent-clipped=2.0 2023-06-20 13:49:27,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=568164.0, ans=0.125 2023-06-20 13:49:31,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=568224.0, ans=0.125 2023-06-20 13:50:11,938 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.18 vs. limit=15.0 2023-06-20 13:50:26,472 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=568344.0, ans=0.04949747468305833 2023-06-20 13:50:34,800 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=568344.0, ans=0.125 2023-06-20 13:50:50,896 INFO [train.py:996] (3/4) Epoch 4, batch 3250, loss[loss=0.2192, simple_loss=0.2802, pruned_loss=0.07909, over 21465.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3266, pruned_loss=0.09145, over 4283313.25 frames. ], batch size: 230, lr: 8.45e-03, grad_scale: 16.0 2023-06-20 13:51:17,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=568464.0, ans=0.0 2023-06-20 13:51:25,556 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=568464.0, ans=0.125 2023-06-20 13:51:38,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=568524.0, ans=0.125 2023-06-20 13:51:46,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=568524.0, ans=0.125 2023-06-20 13:51:49,423 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 13:52:19,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=568644.0, ans=0.0 2023-06-20 13:52:24,886 INFO [train.py:996] (3/4) Epoch 4, batch 3300, loss[loss=0.2577, simple_loss=0.3215, pruned_loss=0.09695, over 21563.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3226, pruned_loss=0.09202, over 4273775.20 frames. ], batch size: 414, lr: 8.45e-03, grad_scale: 16.0 2023-06-20 13:52:39,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=568704.0, ans=0.125 2023-06-20 13:52:51,754 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.938e+02 2.805e+02 3.280e+02 3.867e+02 6.254e+02, threshold=6.560e+02, percent-clipped=3.0 2023-06-20 13:53:45,782 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.47 vs. limit=15.0 2023-06-20 13:54:16,868 INFO [train.py:996] (3/4) Epoch 4, batch 3350, loss[loss=0.2882, simple_loss=0.3406, pruned_loss=0.1179, over 21391.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3257, pruned_loss=0.09177, over 4268600.87 frames. ], batch size: 159, lr: 8.45e-03, grad_scale: 16.0 2023-06-20 13:54:53,420 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=569124.0, ans=0.125 2023-06-20 13:55:00,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=569124.0, ans=0.1 2023-06-20 13:55:02,702 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.72 vs. limit=15.0 2023-06-20 13:55:16,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=569184.0, ans=0.05 2023-06-20 13:56:09,593 INFO [train.py:996] (3/4) Epoch 4, batch 3400, loss[loss=0.2534, simple_loss=0.34, pruned_loss=0.08339, over 20912.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3238, pruned_loss=0.09216, over 4269030.78 frames. ], batch size: 607, lr: 8.45e-03, grad_scale: 16.0 2023-06-20 13:56:14,888 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=569304.0, ans=0.0 2023-06-20 13:56:35,104 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.131e+02 2.736e+02 3.073e+02 3.574e+02 7.893e+02, threshold=6.146e+02, percent-clipped=1.0 2023-06-20 13:56:51,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=569424.0, ans=0.2 2023-06-20 13:57:19,044 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.32 vs. limit=6.0 2023-06-20 13:57:34,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=569544.0, ans=0.0 2023-06-20 13:57:43,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=569544.0, ans=0.0 2023-06-20 13:57:53,502 INFO [train.py:996] (3/4) Epoch 4, batch 3450, loss[loss=0.3138, simple_loss=0.3841, pruned_loss=0.1217, over 21775.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.319, pruned_loss=0.09133, over 4267003.88 frames. ], batch size: 316, lr: 8.45e-03, grad_scale: 16.0 2023-06-20 13:57:55,956 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=15.0 2023-06-20 13:58:21,072 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=569664.0, ans=0.125 2023-06-20 13:58:28,888 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.69 vs. limit=12.0 2023-06-20 13:58:30,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=569724.0, ans=0.1 2023-06-20 13:59:14,184 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.54 vs. limit=22.5 2023-06-20 14:00:02,854 INFO [train.py:996] (3/4) Epoch 4, batch 3500, loss[loss=0.3655, simple_loss=0.4056, pruned_loss=0.1627, over 21353.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.3287, pruned_loss=0.09534, over 4268690.89 frames. ], batch size: 507, lr: 8.44e-03, grad_scale: 16.0 2023-06-20 14:00:03,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=569904.0, ans=0.0 2023-06-20 14:00:31,888 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=569964.0, ans=0.125 2023-06-20 14:00:34,137 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.161e+02 2.895e+02 3.377e+02 4.214e+02 8.364e+02, threshold=6.755e+02, percent-clipped=8.0 2023-06-20 14:00:58,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=570024.0, ans=0.0 2023-06-20 14:01:33,966 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=570084.0, ans=0.2 2023-06-20 14:02:02,765 INFO [train.py:996] (3/4) Epoch 4, batch 3550, loss[loss=0.2265, simple_loss=0.2939, pruned_loss=0.07954, over 21737.00 frames. ], tot_loss[loss=0.2627, simple_loss=0.3322, pruned_loss=0.09666, over 4274882.24 frames. ], batch size: 316, lr: 8.44e-03, grad_scale: 16.0 2023-06-20 14:02:48,462 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.35 vs. limit=15.0 2023-06-20 14:03:28,121 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=570384.0, ans=0.125 2023-06-20 14:03:37,818 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.48 vs. limit=22.5 2023-06-20 14:03:59,271 INFO [train.py:996] (3/4) Epoch 4, batch 3600, loss[loss=0.2263, simple_loss=0.3057, pruned_loss=0.07344, over 20034.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3253, pruned_loss=0.09531, over 4271726.18 frames. ], batch size: 702, lr: 8.44e-03, grad_scale: 32.0 2023-06-20 14:04:25,642 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.156e+02 2.783e+02 3.214e+02 3.773e+02 6.590e+02, threshold=6.428e+02, percent-clipped=0.0 2023-06-20 14:05:15,442 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=570684.0, ans=0.1 2023-06-20 14:05:15,477 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=570684.0, ans=0.0 2023-06-20 14:05:22,452 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 14:05:55,920 INFO [train.py:996] (3/4) Epoch 4, batch 3650, loss[loss=0.278, simple_loss=0.3552, pruned_loss=0.1004, over 21698.00 frames. ], tot_loss[loss=0.2585, simple_loss=0.3264, pruned_loss=0.09531, over 4271592.96 frames. ], batch size: 441, lr: 8.44e-03, grad_scale: 16.0 2023-06-20 14:06:50,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=570924.0, ans=0.95 2023-06-20 14:08:03,200 INFO [train.py:996] (3/4) Epoch 4, batch 3700, loss[loss=0.2432, simple_loss=0.3165, pruned_loss=0.08494, over 21856.00 frames. ], tot_loss[loss=0.2571, simple_loss=0.3253, pruned_loss=0.09442, over 4280653.61 frames. ], batch size: 371, lr: 8.43e-03, grad_scale: 16.0 2023-06-20 14:08:04,209 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.65 vs. limit=12.0 2023-06-20 14:08:21,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=571104.0, ans=0.0 2023-06-20 14:08:41,028 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 2.519e+02 3.000e+02 3.598e+02 6.843e+02, threshold=6.000e+02, percent-clipped=1.0 2023-06-20 14:09:16,964 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.68 vs. limit=22.5 2023-06-20 14:10:11,716 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.32 vs. limit=15.0 2023-06-20 14:10:13,611 INFO [train.py:996] (3/4) Epoch 4, batch 3750, loss[loss=0.251, simple_loss=0.3237, pruned_loss=0.08917, over 20770.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3234, pruned_loss=0.0936, over 4283250.33 frames. ], batch size: 607, lr: 8.43e-03, grad_scale: 16.0 2023-06-20 14:11:19,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=571584.0, ans=0.0 2023-06-20 14:11:20,676 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 14:11:40,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=571644.0, ans=0.125 2023-06-20 14:12:13,301 INFO [train.py:996] (3/4) Epoch 4, batch 3800, loss[loss=0.2413, simple_loss=0.314, pruned_loss=0.08436, over 21940.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3217, pruned_loss=0.09249, over 4287961.20 frames. ], batch size: 316, lr: 8.43e-03, grad_scale: 16.0 2023-06-20 14:12:45,347 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.843e+02 2.431e+02 2.892e+02 3.588e+02 7.450e+02, threshold=5.785e+02, percent-clipped=3.0 2023-06-20 14:13:07,333 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=571824.0, ans=0.07 2023-06-20 14:13:09,351 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.27 vs. limit=15.0 2023-06-20 14:13:13,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=571824.0, ans=0.125 2023-06-20 14:13:50,989 INFO [train.py:996] (3/4) Epoch 4, batch 3850, loss[loss=0.2313, simple_loss=0.2872, pruned_loss=0.08769, over 21794.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3203, pruned_loss=0.09276, over 4283818.13 frames. ], batch size: 352, lr: 8.43e-03, grad_scale: 16.0 2023-06-20 14:14:00,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=572004.0, ans=0.1 2023-06-20 14:15:38,373 INFO [train.py:996] (3/4) Epoch 4, batch 3900, loss[loss=0.2446, simple_loss=0.3037, pruned_loss=0.09278, over 21856.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.3161, pruned_loss=0.09242, over 4286336.93 frames. ], batch size: 414, lr: 8.43e-03, grad_scale: 16.0 2023-06-20 14:15:57,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=572304.0, ans=0.025 2023-06-20 14:16:05,785 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=572304.0, ans=0.125 2023-06-20 14:16:16,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=572364.0, ans=0.0 2023-06-20 14:16:17,380 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.048e+02 2.553e+02 2.993e+02 3.640e+02 7.151e+02, threshold=5.987e+02, percent-clipped=1.0 2023-06-20 14:16:36,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=572424.0, ans=0.1 2023-06-20 14:16:48,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=572484.0, ans=0.125 2023-06-20 14:16:48,596 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-06-20 14:17:00,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=572484.0, ans=0.125 2023-06-20 14:17:06,368 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.03 vs. limit=15.0 2023-06-20 14:17:39,768 INFO [train.py:996] (3/4) Epoch 4, batch 3950, loss[loss=0.1852, simple_loss=0.2546, pruned_loss=0.05793, over 21195.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3185, pruned_loss=0.09113, over 4285353.00 frames. ], batch size: 159, lr: 8.42e-03, grad_scale: 16.0 2023-06-20 14:18:27,426 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 14:19:47,312 INFO [train.py:996] (3/4) Epoch 4, batch 4000, loss[loss=0.1676, simple_loss=0.2462, pruned_loss=0.04447, over 21584.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3107, pruned_loss=0.08689, over 4281869.17 frames. ], batch size: 230, lr: 8.42e-03, grad_scale: 32.0 2023-06-20 14:19:47,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=572904.0, ans=0.04949747468305833 2023-06-20 14:19:52,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=572904.0, ans=0.125 2023-06-20 14:20:13,748 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.785e+02 2.382e+02 2.703e+02 3.231e+02 4.558e+02, threshold=5.407e+02, percent-clipped=0.0 2023-06-20 14:20:17,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=572964.0, ans=0.125 2023-06-20 14:20:59,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=573144.0, ans=0.125 2023-06-20 14:21:28,560 INFO [train.py:996] (3/4) Epoch 4, batch 4050, loss[loss=0.2225, simple_loss=0.2965, pruned_loss=0.07431, over 21764.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3101, pruned_loss=0.08497, over 4283218.72 frames. ], batch size: 247, lr: 8.42e-03, grad_scale: 32.0 2023-06-20 14:22:16,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=573264.0, ans=0.125 2023-06-20 14:22:28,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=573324.0, ans=0.125 2023-06-20 14:22:35,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=573324.0, ans=0.125 2023-06-20 14:22:49,297 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.41 vs. limit=22.5 2023-06-20 14:23:32,508 INFO [train.py:996] (3/4) Epoch 4, batch 4100, loss[loss=0.2204, simple_loss=0.2971, pruned_loss=0.07185, over 21625.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3112, pruned_loss=0.08547, over 4290706.52 frames. ], batch size: 230, lr: 8.42e-03, grad_scale: 32.0 2023-06-20 14:24:01,019 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.09 vs. limit=10.0 2023-06-20 14:24:14,722 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.799e+02 2.729e+02 3.089e+02 3.698e+02 6.441e+02, threshold=6.178e+02, percent-clipped=7.0 2023-06-20 14:24:15,279 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=573564.0, ans=0.125 2023-06-20 14:24:20,490 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.68 vs. limit=22.5 2023-06-20 14:24:54,558 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.11 vs. limit=10.0 2023-06-20 14:24:58,400 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=573684.0, ans=0.125 2023-06-20 14:25:28,426 INFO [train.py:996] (3/4) Epoch 4, batch 4150, loss[loss=0.1824, simple_loss=0.2751, pruned_loss=0.04483, over 21360.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3093, pruned_loss=0.08206, over 4284986.28 frames. ], batch size: 194, lr: 8.41e-03, grad_scale: 32.0 2023-06-20 14:25:38,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=573804.0, ans=0.1 2023-06-20 14:27:29,741 INFO [train.py:996] (3/4) Epoch 4, batch 4200, loss[loss=0.2974, simple_loss=0.377, pruned_loss=0.1089, over 21539.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3108, pruned_loss=0.08218, over 4290182.14 frames. ], batch size: 441, lr: 8.41e-03, grad_scale: 32.0 2023-06-20 14:27:36,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=574104.0, ans=0.0 2023-06-20 14:27:49,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=574164.0, ans=0.125 2023-06-20 14:27:50,391 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.48 vs. limit=15.0 2023-06-20 14:27:56,305 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.588e+02 2.492e+02 2.868e+02 3.338e+02 5.959e+02, threshold=5.736e+02, percent-clipped=0.0 2023-06-20 14:28:55,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=574284.0, ans=0.0 2023-06-20 14:29:20,839 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=574344.0, ans=0.125 2023-06-20 14:29:23,353 INFO [train.py:996] (3/4) Epoch 4, batch 4250, loss[loss=0.3183, simple_loss=0.3858, pruned_loss=0.1253, over 21450.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3169, pruned_loss=0.0842, over 4287103.64 frames. ], batch size: 471, lr: 8.41e-03, grad_scale: 32.0 2023-06-20 14:29:31,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=574404.0, ans=0.125 2023-06-20 14:30:43,014 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=574584.0, ans=0.2 2023-06-20 14:30:47,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=574584.0, ans=0.0 2023-06-20 14:31:33,601 INFO [train.py:996] (3/4) Epoch 4, batch 4300, loss[loss=0.2195, simple_loss=0.3058, pruned_loss=0.06658, over 21420.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3235, pruned_loss=0.08643, over 4279924.28 frames. ], batch size: 131, lr: 8.41e-03, grad_scale: 32.0 2023-06-20 14:31:55,792 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=574704.0, ans=0.125 2023-06-20 14:32:18,443 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.022e+02 2.663e+02 3.157e+02 3.844e+02 7.898e+02, threshold=6.314e+02, percent-clipped=3.0 2023-06-20 14:32:19,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=574764.0, ans=0.2 2023-06-20 14:32:42,731 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=574824.0, ans=0.2 2023-06-20 14:32:49,382 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.01 vs. limit=12.0 2023-06-20 14:32:50,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=574824.0, ans=0.125 2023-06-20 14:33:14,422 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=574944.0, ans=0.0 2023-06-20 14:33:35,451 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.83 vs. limit=15.0 2023-06-20 14:33:35,740 INFO [train.py:996] (3/4) Epoch 4, batch 4350, loss[loss=0.1881, simple_loss=0.2529, pruned_loss=0.06164, over 21692.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3213, pruned_loss=0.08497, over 4275535.21 frames. ], batch size: 232, lr: 8.41e-03, grad_scale: 32.0 2023-06-20 14:35:27,075 INFO [train.py:996] (3/4) Epoch 4, batch 4400, loss[loss=0.2233, simple_loss=0.3115, pruned_loss=0.06752, over 21283.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3169, pruned_loss=0.08423, over 4272875.24 frames. ], batch size: 176, lr: 8.40e-03, grad_scale: 32.0 2023-06-20 14:36:06,018 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.951e+02 2.625e+02 3.087e+02 3.539e+02 6.162e+02, threshold=6.175e+02, percent-clipped=0.0 2023-06-20 14:36:26,097 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.66 vs. limit=15.0 2023-06-20 14:37:37,828 INFO [train.py:996] (3/4) Epoch 4, batch 4450, loss[loss=0.263, simple_loss=0.3454, pruned_loss=0.09026, over 21613.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3239, pruned_loss=0.08634, over 4268579.47 frames. ], batch size: 230, lr: 8.40e-03, grad_scale: 32.0 2023-06-20 14:37:49,645 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=575604.0, ans=0.0 2023-06-20 14:38:50,855 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=575784.0, ans=0.125 2023-06-20 14:39:22,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=575844.0, ans=0.125 2023-06-20 14:39:35,197 INFO [train.py:996] (3/4) Epoch 4, batch 4500, loss[loss=0.2489, simple_loss=0.3272, pruned_loss=0.08533, over 21257.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3248, pruned_loss=0.08855, over 4276840.46 frames. ], batch size: 176, lr: 8.40e-03, grad_scale: 32.0 2023-06-20 14:39:36,324 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.49 vs. limit=6.0 2023-06-20 14:39:37,243 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=575904.0, ans=0.125 2023-06-20 14:40:07,218 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.914e+02 2.623e+02 2.937e+02 3.558e+02 5.301e+02, threshold=5.874e+02, percent-clipped=0.0 2023-06-20 14:41:22,671 INFO [train.py:996] (3/4) Epoch 4, batch 4550, loss[loss=0.2597, simple_loss=0.3336, pruned_loss=0.09285, over 21704.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3278, pruned_loss=0.08884, over 4275901.38 frames. ], batch size: 298, lr: 8.40e-03, grad_scale: 32.0 2023-06-20 14:42:40,568 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=576384.0, ans=0.125 2023-06-20 14:43:01,539 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=576444.0, ans=10.0 2023-06-20 14:43:23,255 INFO [train.py:996] (3/4) Epoch 4, batch 4600, loss[loss=0.2388, simple_loss=0.3167, pruned_loss=0.0804, over 21725.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.33, pruned_loss=0.0904, over 4279184.46 frames. ], batch size: 414, lr: 8.39e-03, grad_scale: 32.0 2023-06-20 14:43:47,820 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=576564.0, ans=0.125 2023-06-20 14:43:47,873 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=576564.0, ans=0.0 2023-06-20 14:43:49,984 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.983e+02 2.820e+02 3.441e+02 4.090e+02 6.361e+02, threshold=6.882e+02, percent-clipped=1.0 2023-06-20 14:44:12,578 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=576624.0, ans=0.125 2023-06-20 14:44:40,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=576744.0, ans=0.125 2023-06-20 14:44:42,504 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=576744.0, ans=0.125 2023-06-20 14:44:53,268 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.97 vs. limit=6.0 2023-06-20 14:44:59,200 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.88 vs. limit=15.0 2023-06-20 14:45:01,217 INFO [train.py:996] (3/4) Epoch 4, batch 4650, loss[loss=0.2576, simple_loss=0.3353, pruned_loss=0.09, over 19885.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3232, pruned_loss=0.08871, over 4279293.84 frames. ], batch size: 702, lr: 8.39e-03, grad_scale: 32.0 2023-06-20 14:45:23,810 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.49 vs. limit=12.0 2023-06-20 14:45:24,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=576864.0, ans=0.125 2023-06-20 14:45:35,858 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 14:45:49,547 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=576924.0, ans=0.1 2023-06-20 14:45:52,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=576924.0, ans=0.0 2023-06-20 14:46:08,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=576984.0, ans=0.0 2023-06-20 14:46:36,007 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=577104.0, ans=0.125 2023-06-20 14:46:36,961 INFO [train.py:996] (3/4) Epoch 4, batch 4700, loss[loss=0.2076, simple_loss=0.2701, pruned_loss=0.07255, over 21594.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3148, pruned_loss=0.08638, over 4282554.67 frames. ], batch size: 247, lr: 8.39e-03, grad_scale: 32.0 2023-06-20 14:47:09,025 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.012e+02 2.385e+02 2.859e+02 3.485e+02 5.797e+02, threshold=5.718e+02, percent-clipped=0.0 2023-06-20 14:47:18,731 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=577224.0, ans=0.125 2023-06-20 14:47:33,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=577284.0, ans=0.5 2023-06-20 14:48:14,347 INFO [train.py:996] (3/4) Epoch 4, batch 4750, loss[loss=0.242, simple_loss=0.2976, pruned_loss=0.09324, over 21470.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3087, pruned_loss=0.08598, over 4274481.01 frames. ], batch size: 144, lr: 8.39e-03, grad_scale: 32.0 2023-06-20 14:48:29,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=577404.0, ans=0.1 2023-06-20 14:48:36,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=577464.0, ans=0.2 2023-06-20 14:48:58,125 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=577524.0, ans=10.0 2023-06-20 14:50:08,454 INFO [train.py:996] (3/4) Epoch 4, batch 4800, loss[loss=0.2355, simple_loss=0.2888, pruned_loss=0.09104, over 21512.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3088, pruned_loss=0.08597, over 4276218.14 frames. ], batch size: 194, lr: 8.39e-03, grad_scale: 32.0 2023-06-20 14:50:08,826 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=577704.0, ans=0.125 2023-06-20 14:50:26,618 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=577704.0, ans=0.09899494936611666 2023-06-20 14:50:45,674 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.974e+02 2.623e+02 2.992e+02 3.457e+02 8.061e+02, threshold=5.984e+02, percent-clipped=2.0 2023-06-20 14:51:08,908 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=577824.0, ans=0.2 2023-06-20 14:51:20,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=577884.0, ans=0.1 2023-06-20 14:52:00,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=577944.0, ans=0.1 2023-06-20 14:52:02,774 INFO [train.py:996] (3/4) Epoch 4, batch 4850, loss[loss=0.2355, simple_loss=0.3475, pruned_loss=0.06175, over 21289.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3089, pruned_loss=0.08486, over 4275682.72 frames. ], batch size: 548, lr: 8.38e-03, grad_scale: 32.0 2023-06-20 14:52:51,327 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.20 vs. limit=22.5 2023-06-20 14:53:16,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=578184.0, ans=0.125 2023-06-20 14:53:29,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=578244.0, ans=0.07 2023-06-20 14:53:41,184 INFO [train.py:996] (3/4) Epoch 4, batch 4900, loss[loss=0.3001, simple_loss=0.368, pruned_loss=0.1161, over 21477.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3111, pruned_loss=0.08688, over 4282957.41 frames. ], batch size: 471, lr: 8.38e-03, grad_scale: 32.0 2023-06-20 14:54:13,436 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.029e+02 2.490e+02 2.860e+02 3.460e+02 5.530e+02, threshold=5.721e+02, percent-clipped=0.0 2023-06-20 14:54:23,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=578424.0, ans=0.1 2023-06-20 14:55:07,331 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=578544.0, ans=0.0 2023-06-20 14:55:25,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=578604.0, ans=0.125 2023-06-20 14:55:27,079 INFO [train.py:996] (3/4) Epoch 4, batch 4950, loss[loss=0.2644, simple_loss=0.3558, pruned_loss=0.08648, over 21447.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3137, pruned_loss=0.08508, over 4276085.29 frames. ], batch size: 471, lr: 8.38e-03, grad_scale: 32.0 2023-06-20 14:55:54,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=578664.0, ans=0.1 2023-06-20 14:56:40,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=578784.0, ans=0.125 2023-06-20 14:56:46,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=578844.0, ans=0.0 2023-06-20 14:56:50,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=578844.0, ans=0.2 2023-06-20 14:56:56,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=578844.0, ans=0.125 2023-06-20 14:57:04,932 INFO [train.py:996] (3/4) Epoch 4, batch 5000, loss[loss=0.2122, simple_loss=0.2906, pruned_loss=0.0669, over 21521.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.313, pruned_loss=0.08168, over 4280417.49 frames. ], batch size: 212, lr: 8.38e-03, grad_scale: 32.0 2023-06-20 14:57:24,850 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.67 vs. limit=22.5 2023-06-20 14:57:31,401 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.789e+02 2.437e+02 2.852e+02 3.653e+02 5.289e+02, threshold=5.704e+02, percent-clipped=0.0 2023-06-20 14:57:36,642 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=578964.0, ans=0.1 2023-06-20 14:58:10,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=579084.0, ans=0.2 2023-06-20 14:58:36,499 INFO [train.py:996] (3/4) Epoch 4, batch 5050, loss[loss=0.2426, simple_loss=0.3097, pruned_loss=0.08773, over 21887.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3157, pruned_loss=0.08472, over 4288787.79 frames. ], batch size: 351, lr: 8.38e-03, grad_scale: 32.0 2023-06-20 14:58:54,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=579204.0, ans=0.1 2023-06-20 14:59:21,332 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=579324.0, ans=0.2 2023-06-20 14:59:42,629 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.14 vs. limit=15.0 2023-06-20 14:59:50,692 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=579384.0, ans=0.125 2023-06-20 15:00:01,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=579444.0, ans=0.125 2023-06-20 15:00:18,920 INFO [train.py:996] (3/4) Epoch 4, batch 5100, loss[loss=0.255, simple_loss=0.3205, pruned_loss=0.09471, over 21437.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3139, pruned_loss=0.08513, over 4284563.92 frames. ], batch size: 131, lr: 8.37e-03, grad_scale: 32.0 2023-06-20 15:00:35,404 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.82 vs. limit=22.5 2023-06-20 15:00:45,742 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.915e+02 2.522e+02 2.846e+02 3.197e+02 5.068e+02, threshold=5.691e+02, percent-clipped=0.0 2023-06-20 15:00:47,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=579564.0, ans=0.125 2023-06-20 15:01:17,963 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.75 vs. limit=22.5 2023-06-20 15:01:40,338 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.74 vs. limit=12.0 2023-06-20 15:01:57,152 INFO [train.py:996] (3/4) Epoch 4, batch 5150, loss[loss=0.2359, simple_loss=0.299, pruned_loss=0.08634, over 21900.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3136, pruned_loss=0.08617, over 4289574.06 frames. ], batch size: 316, lr: 8.37e-03, grad_scale: 32.0 2023-06-20 15:02:10,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=579804.0, ans=0.1 2023-06-20 15:02:29,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=579864.0, ans=0.1 2023-06-20 15:02:52,041 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.91 vs. limit=15.0 2023-06-20 15:03:10,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=579984.0, ans=0.0 2023-06-20 15:03:35,258 INFO [train.py:996] (3/4) Epoch 4, batch 5200, loss[loss=0.2519, simple_loss=0.3561, pruned_loss=0.07386, over 21781.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3141, pruned_loss=0.08631, over 4280895.47 frames. ], batch size: 332, lr: 8.37e-03, grad_scale: 32.0 2023-06-20 15:04:01,157 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.077e+02 2.678e+02 3.286e+02 3.852e+02 6.243e+02, threshold=6.571e+02, percent-clipped=1.0 2023-06-20 15:05:19,476 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=580344.0, ans=0.0 2023-06-20 15:05:21,823 INFO [train.py:996] (3/4) Epoch 4, batch 5250, loss[loss=0.2202, simple_loss=0.3027, pruned_loss=0.06884, over 21601.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3178, pruned_loss=0.08506, over 4271513.74 frames. ], batch size: 230, lr: 8.37e-03, grad_scale: 32.0 2023-06-20 15:05:27,637 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.10 vs. limit=22.5 2023-06-20 15:05:42,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=580464.0, ans=0.1 2023-06-20 15:05:42,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=580464.0, ans=0.2 2023-06-20 15:05:54,680 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=580464.0, ans=0.1 2023-06-20 15:05:58,486 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.80 vs. limit=15.0 2023-06-20 15:06:09,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=580524.0, ans=0.1 2023-06-20 15:06:09,664 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=15.0 2023-06-20 15:06:49,045 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=580644.0, ans=0.125 2023-06-20 15:06:59,109 INFO [train.py:996] (3/4) Epoch 4, batch 5300, loss[loss=0.2588, simple_loss=0.3179, pruned_loss=0.09986, over 21368.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3182, pruned_loss=0.08555, over 4271560.98 frames. ], batch size: 144, lr: 8.36e-03, grad_scale: 32.0 2023-06-20 15:07:25,988 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.942e+02 2.445e+02 2.866e+02 3.418e+02 4.898e+02, threshold=5.732e+02, percent-clipped=0.0 2023-06-20 15:07:58,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=580824.0, ans=0.125 2023-06-20 15:08:35,818 INFO [train.py:996] (3/4) Epoch 4, batch 5350, loss[loss=0.2523, simple_loss=0.32, pruned_loss=0.09227, over 21780.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3198, pruned_loss=0.08688, over 4281375.73 frames. ], batch size: 112, lr: 8.36e-03, grad_scale: 32.0 2023-06-20 15:08:38,075 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.18 vs. limit=22.5 2023-06-20 15:08:40,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=581004.0, ans=0.0 2023-06-20 15:09:28,784 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=581124.0, ans=0.125 2023-06-20 15:09:28,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=581124.0, ans=0.0 2023-06-20 15:09:46,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=581124.0, ans=0.2 2023-06-20 15:09:51,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=581184.0, ans=0.0 2023-06-20 15:10:26,849 INFO [train.py:996] (3/4) Epoch 4, batch 5400, loss[loss=0.2237, simple_loss=0.2966, pruned_loss=0.07538, over 21397.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3187, pruned_loss=0.08847, over 4279173.05 frames. ], batch size: 194, lr: 8.36e-03, grad_scale: 32.0 2023-06-20 15:10:59,003 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.571e+02 2.501e+02 3.019e+02 3.777e+02 8.074e+02, threshold=6.038e+02, percent-clipped=3.0 2023-06-20 15:10:59,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=581364.0, ans=0.0 2023-06-20 15:11:46,473 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=581484.0, ans=0.125 2023-06-20 15:12:15,879 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=581544.0, ans=0.95 2023-06-20 15:12:18,264 INFO [train.py:996] (3/4) Epoch 4, batch 5450, loss[loss=0.2574, simple_loss=0.3512, pruned_loss=0.08177, over 21019.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3179, pruned_loss=0.08597, over 4282511.39 frames. ], batch size: 143, lr: 8.36e-03, grad_scale: 32.0 2023-06-20 15:12:20,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=581604.0, ans=0.125 2023-06-20 15:12:33,831 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.73 vs. limit=22.5 2023-06-20 15:12:42,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=581664.0, ans=0.1 2023-06-20 15:13:56,313 INFO [train.py:996] (3/4) Epoch 4, batch 5500, loss[loss=0.2124, simple_loss=0.313, pruned_loss=0.05587, over 21727.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3229, pruned_loss=0.0827, over 4272400.59 frames. ], batch size: 332, lr: 8.36e-03, grad_scale: 32.0 2023-06-20 15:14:28,500 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.66 vs. limit=15.0 2023-06-20 15:14:28,841 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.630e+02 2.356e+02 2.658e+02 3.357e+02 7.374e+02, threshold=5.315e+02, percent-clipped=2.0 2023-06-20 15:14:31,441 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.37 vs. limit=22.5 2023-06-20 15:14:39,958 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=581964.0, ans=0.04949747468305833 2023-06-20 15:14:49,637 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.38 vs. limit=22.5 2023-06-20 15:15:24,725 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.99 vs. limit=15.0 2023-06-20 15:15:41,308 INFO [train.py:996] (3/4) Epoch 4, batch 5550, loss[loss=0.2769, simple_loss=0.3766, pruned_loss=0.08856, over 21161.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3221, pruned_loss=0.07996, over 4276487.63 frames. ], batch size: 548, lr: 8.35e-03, grad_scale: 32.0 2023-06-20 15:15:56,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=582204.0, ans=0.125 2023-06-20 15:15:56,509 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=582204.0, ans=0.125 2023-06-20 15:16:08,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=582264.0, ans=0.05 2023-06-20 15:16:34,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=582324.0, ans=0.125 2023-06-20 15:16:58,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=582384.0, ans=0.125 2023-06-20 15:16:58,704 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=10.09 vs. limit=15.0 2023-06-20 15:17:01,058 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=582384.0, ans=0.1 2023-06-20 15:17:04,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=582444.0, ans=0.0 2023-06-20 15:17:04,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=582444.0, ans=0.0 2023-06-20 15:17:26,386 INFO [train.py:996] (3/4) Epoch 4, batch 5600, loss[loss=0.2561, simple_loss=0.3479, pruned_loss=0.08217, over 21785.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3197, pruned_loss=0.07679, over 4276759.27 frames. ], batch size: 316, lr: 8.35e-03, grad_scale: 32.0 2023-06-20 15:17:35,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=582504.0, ans=0.125 2023-06-20 15:17:54,168 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.628e+02 2.189e+02 2.611e+02 3.161e+02 5.286e+02, threshold=5.221e+02, percent-clipped=0.0 2023-06-20 15:18:19,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=582624.0, ans=0.125 2023-06-20 15:18:33,539 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=582684.0, ans=0.0 2023-06-20 15:19:00,124 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=582744.0, ans=0.2 2023-06-20 15:19:02,457 INFO [train.py:996] (3/4) Epoch 4, batch 5650, loss[loss=0.2579, simple_loss=0.3257, pruned_loss=0.09509, over 21862.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.323, pruned_loss=0.07998, over 4277940.19 frames. ], batch size: 371, lr: 8.35e-03, grad_scale: 32.0 2023-06-20 15:19:14,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=582804.0, ans=0.125 2023-06-20 15:19:22,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=582864.0, ans=0.0 2023-06-20 15:19:26,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=582864.0, ans=0.125 2023-06-20 15:20:03,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=582984.0, ans=0.1 2023-06-20 15:20:07,494 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-06-20 15:20:28,943 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=583044.0, ans=0.125 2023-06-20 15:20:41,886 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=583044.0, ans=0.125 2023-06-20 15:20:44,352 INFO [train.py:996] (3/4) Epoch 4, batch 5700, loss[loss=0.2858, simple_loss=0.3554, pruned_loss=0.1081, over 21491.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3221, pruned_loss=0.08256, over 4283863.15 frames. ], batch size: 471, lr: 8.35e-03, grad_scale: 32.0 2023-06-20 15:21:12,330 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.938e+02 2.371e+02 2.953e+02 3.348e+02 5.178e+02, threshold=5.907e+02, percent-clipped=0.0 2023-06-20 15:21:15,758 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=583164.0, ans=0.2 2023-06-20 15:22:13,020 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=583284.0, ans=0.0 2023-06-20 15:22:29,580 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.95 vs. limit=15.0 2023-06-20 15:22:33,077 INFO [train.py:996] (3/4) Epoch 4, batch 5750, loss[loss=0.2078, simple_loss=0.2916, pruned_loss=0.06205, over 21250.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3173, pruned_loss=0.0792, over 4287292.33 frames. ], batch size: 159, lr: 8.35e-03, grad_scale: 32.0 2023-06-20 15:22:40,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=583404.0, ans=0.125 2023-06-20 15:24:04,754 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=583644.0, ans=0.125 2023-06-20 15:24:12,029 INFO [train.py:996] (3/4) Epoch 4, batch 5800, loss[loss=0.2666, simple_loss=0.357, pruned_loss=0.08804, over 21708.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3174, pruned_loss=0.07763, over 4289002.17 frames. ], batch size: 351, lr: 8.34e-03, grad_scale: 32.0 2023-06-20 15:24:38,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=583764.0, ans=0.125 2023-06-20 15:24:46,017 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.706e+02 2.366e+02 2.813e+02 3.634e+02 6.586e+02, threshold=5.626e+02, percent-clipped=4.0 2023-06-20 15:25:04,338 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 15:26:02,447 INFO [train.py:996] (3/4) Epoch 4, batch 5850, loss[loss=0.175, simple_loss=0.2533, pruned_loss=0.04836, over 21109.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3141, pruned_loss=0.07382, over 4289411.43 frames. ], batch size: 143, lr: 8.34e-03, grad_scale: 32.0 2023-06-20 15:26:19,284 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.83 vs. limit=15.0 2023-06-20 15:26:21,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=584064.0, ans=0.1 2023-06-20 15:26:28,059 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 15:27:03,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=584184.0, ans=0.0 2023-06-20 15:27:04,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=584184.0, ans=0.1 2023-06-20 15:27:12,001 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=584184.0, ans=0.2 2023-06-20 15:27:32,209 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.79 vs. limit=15.0 2023-06-20 15:27:39,555 INFO [train.py:996] (3/4) Epoch 4, batch 5900, loss[loss=0.1907, simple_loss=0.2744, pruned_loss=0.05345, over 21783.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.3059, pruned_loss=0.0683, over 4291042.57 frames. ], batch size: 298, lr: 8.34e-03, grad_scale: 32.0 2023-06-20 15:28:10,983 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.558e+02 2.239e+02 2.663e+02 3.390e+02 4.720e+02, threshold=5.325e+02, percent-clipped=0.0 2023-06-20 15:28:20,829 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.61 vs. limit=12.0 2023-06-20 15:28:40,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=584424.0, ans=0.125 2023-06-20 15:28:40,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=584424.0, ans=0.125 2023-06-20 15:29:26,563 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.19 vs. limit=22.5 2023-06-20 15:29:26,768 INFO [train.py:996] (3/4) Epoch 4, batch 5950, loss[loss=0.2616, simple_loss=0.3054, pruned_loss=0.1089, over 21540.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.306, pruned_loss=0.07282, over 4291555.38 frames. ], batch size: 441, lr: 8.34e-03, grad_scale: 32.0 2023-06-20 15:30:05,683 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=584664.0, ans=0.125 2023-06-20 15:31:13,852 INFO [train.py:996] (3/4) Epoch 4, batch 6000, loss[loss=0.2204, simple_loss=0.2833, pruned_loss=0.07875, over 21881.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3028, pruned_loss=0.0759, over 4276656.72 frames. ], batch size: 98, lr: 8.33e-03, grad_scale: 32.0 2023-06-20 15:31:13,853 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-20 15:32:19,089 INFO [train.py:1028] (3/4) Epoch 4, validation: loss=0.2612, simple_loss=0.3595, pruned_loss=0.08138, over 1796401.00 frames. 2023-06-20 15:32:19,090 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-20 15:32:51,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=584964.0, ans=0.0 2023-06-20 15:32:52,773 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.981e+02 2.636e+02 3.112e+02 3.720e+02 8.461e+02, threshold=6.223e+02, percent-clipped=3.0 2023-06-20 15:33:59,945 INFO [train.py:996] (3/4) Epoch 4, batch 6050, loss[loss=0.2148, simple_loss=0.2676, pruned_loss=0.08103, over 21621.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.2975, pruned_loss=0.07734, over 4276242.85 frames. ], batch size: 247, lr: 8.33e-03, grad_scale: 32.0 2023-06-20 15:34:26,088 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=585204.0, ans=0.2 2023-06-20 15:35:35,437 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=585384.0, ans=0.0 2023-06-20 15:35:52,085 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.89 vs. limit=15.0 2023-06-20 15:35:59,087 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=585444.0, ans=0.125 2023-06-20 15:36:00,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=585444.0, ans=0.04949747468305833 2023-06-20 15:36:02,800 INFO [train.py:996] (3/4) Epoch 4, batch 6100, loss[loss=0.264, simple_loss=0.3296, pruned_loss=0.09922, over 21768.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.2968, pruned_loss=0.07578, over 4281060.90 frames. ], batch size: 389, lr: 8.33e-03, grad_scale: 32.0 2023-06-20 15:36:38,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=585564.0, ans=0.0 2023-06-20 15:36:43,685 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.924e+02 2.490e+02 2.900e+02 3.284e+02 4.880e+02, threshold=5.799e+02, percent-clipped=0.0 2023-06-20 15:37:38,051 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.68 vs. limit=22.5 2023-06-20 15:37:40,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=585684.0, ans=0.125 2023-06-20 15:38:11,426 INFO [train.py:996] (3/4) Epoch 4, batch 6150, loss[loss=0.2395, simple_loss=0.3076, pruned_loss=0.08566, over 21642.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3001, pruned_loss=0.07864, over 4277175.30 frames. ], batch size: 391, lr: 8.33e-03, grad_scale: 32.0 2023-06-20 15:38:42,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=585864.0, ans=0.0 2023-06-20 15:38:51,305 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=585924.0, ans=0.125 2023-06-20 15:39:00,124 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=585924.0, ans=0.0 2023-06-20 15:39:17,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=585984.0, ans=0.125 2023-06-20 15:39:24,374 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.58 vs. limit=15.0 2023-06-20 15:40:11,442 INFO [train.py:996] (3/4) Epoch 4, batch 6200, loss[loss=0.2556, simple_loss=0.3223, pruned_loss=0.09448, over 21336.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3032, pruned_loss=0.07964, over 4280619.20 frames. ], batch size: 159, lr: 8.33e-03, grad_scale: 32.0 2023-06-20 15:40:27,831 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=586104.0, ans=0.125 2023-06-20 15:40:39,492 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.810e+02 2.393e+02 2.716e+02 3.174e+02 4.783e+02, threshold=5.432e+02, percent-clipped=0.0 2023-06-20 15:42:09,652 INFO [train.py:996] (3/4) Epoch 4, batch 6250, loss[loss=0.2859, simple_loss=0.3795, pruned_loss=0.09617, over 21520.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3069, pruned_loss=0.07862, over 4273900.55 frames. ], batch size: 471, lr: 8.32e-03, grad_scale: 32.0 2023-06-20 15:42:54,879 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=586464.0, ans=0.125 2023-06-20 15:43:19,070 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=586584.0, ans=0.0 2023-06-20 15:43:19,088 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=586584.0, ans=0.04949747468305833 2023-06-20 15:43:21,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=586584.0, ans=0.125 2023-06-20 15:44:16,073 INFO [train.py:996] (3/4) Epoch 4, batch 6300, loss[loss=0.256, simple_loss=0.3197, pruned_loss=0.09611, over 21849.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3109, pruned_loss=0.07883, over 4272083.48 frames. ], batch size: 124, lr: 8.32e-03, grad_scale: 32.0 2023-06-20 15:44:48,352 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.762e+02 2.254e+02 2.717e+02 3.393e+02 6.074e+02, threshold=5.434e+02, percent-clipped=2.0 2023-06-20 15:44:57,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=586824.0, ans=0.1 2023-06-20 15:45:53,424 INFO [train.py:996] (3/4) Epoch 4, batch 6350, loss[loss=0.2463, simple_loss=0.3106, pruned_loss=0.09102, over 21930.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3162, pruned_loss=0.08401, over 4275921.83 frames. ], batch size: 351, lr: 8.32e-03, grad_scale: 32.0 2023-06-20 15:45:54,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=587004.0, ans=0.0 2023-06-20 15:46:30,137 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=587064.0, ans=0.125 2023-06-20 15:46:36,331 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=587064.0, ans=0.125 2023-06-20 15:48:04,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=587244.0, ans=0.125 2023-06-20 15:48:08,583 INFO [train.py:996] (3/4) Epoch 4, batch 6400, loss[loss=0.2816, simple_loss=0.3869, pruned_loss=0.08817, over 19725.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3241, pruned_loss=0.08809, over 4275788.99 frames. ], batch size: 703, lr: 8.32e-03, grad_scale: 32.0 2023-06-20 15:48:09,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=587304.0, ans=0.2 2023-06-20 15:48:09,666 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.96 vs. limit=15.0 2023-06-20 15:48:29,653 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=587364.0, ans=0.125 2023-06-20 15:48:36,209 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.218e+02 2.822e+02 3.385e+02 3.962e+02 5.879e+02, threshold=6.771e+02, percent-clipped=4.0 2023-06-20 15:48:38,512 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.85 vs. limit=15.0 2023-06-20 15:48:41,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=587364.0, ans=0.125 2023-06-20 15:48:41,024 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=587364.0, ans=0.0 2023-06-20 15:48:56,604 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=587424.0, ans=0.125 2023-06-20 15:50:00,104 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.51 vs. limit=15.0 2023-06-20 15:50:10,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=587604.0, ans=0.0 2023-06-20 15:50:16,753 INFO [train.py:996] (3/4) Epoch 4, batch 6450, loss[loss=0.228, simple_loss=0.3031, pruned_loss=0.07644, over 21330.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.3255, pruned_loss=0.08772, over 4280902.81 frames. ], batch size: 176, lr: 8.32e-03, grad_scale: 32.0 2023-06-20 15:50:17,150 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=587604.0, ans=0.125 2023-06-20 15:50:18,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=587604.0, ans=0.125 2023-06-20 15:50:31,141 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-06-20 15:50:50,439 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.21 vs. limit=12.0 2023-06-20 15:51:18,071 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.14 vs. limit=10.0 2023-06-20 15:51:25,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=587784.0, ans=0.0 2023-06-20 15:51:41,215 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.96 vs. limit=15.0 2023-06-20 15:51:50,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=587844.0, ans=0.0 2023-06-20 15:51:52,946 INFO [train.py:996] (3/4) Epoch 4, batch 6500, loss[loss=0.2012, simple_loss=0.2648, pruned_loss=0.06883, over 21606.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3173, pruned_loss=0.08569, over 4272545.11 frames. ], batch size: 247, lr: 8.31e-03, grad_scale: 32.0 2023-06-20 15:52:02,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=587904.0, ans=0.125 2023-06-20 15:52:19,375 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.872e+02 2.385e+02 2.722e+02 3.333e+02 5.165e+02, threshold=5.444e+02, percent-clipped=0.0 2023-06-20 15:52:33,698 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.80 vs. limit=15.0 2023-06-20 15:52:34,853 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.96 vs. limit=15.0 2023-06-20 15:53:00,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=588024.0, ans=0.125 2023-06-20 15:53:12,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=588084.0, ans=0.0 2023-06-20 15:53:42,936 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 15:53:49,882 INFO [train.py:996] (3/4) Epoch 4, batch 6550, loss[loss=0.2752, simple_loss=0.3374, pruned_loss=0.1065, over 21752.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.316, pruned_loss=0.0846, over 4274446.19 frames. ], batch size: 441, lr: 8.31e-03, grad_scale: 32.0 2023-06-20 15:53:59,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=588204.0, ans=0.125 2023-06-20 15:54:12,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=588264.0, ans=0.0 2023-06-20 15:54:43,710 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=588324.0, ans=0.125 2023-06-20 15:54:49,345 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=588384.0, ans=0.125 2023-06-20 15:55:24,238 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=588444.0, ans=0.125 2023-06-20 15:55:39,397 INFO [train.py:996] (3/4) Epoch 4, batch 6600, loss[loss=0.2057, simple_loss=0.2709, pruned_loss=0.07025, over 21831.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3102, pruned_loss=0.0838, over 4278486.51 frames. ], batch size: 107, lr: 8.31e-03, grad_scale: 32.0 2023-06-20 15:55:56,029 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.49 vs. limit=6.0 2023-06-20 15:56:09,322 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.904e+02 2.413e+02 2.653e+02 3.142e+02 5.278e+02, threshold=5.306e+02, percent-clipped=0.0 2023-06-20 15:56:25,452 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=588564.0, ans=0.04949747468305833 2023-06-20 15:56:28,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=588624.0, ans=0.125 2023-06-20 15:56:47,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=588684.0, ans=0.1 2023-06-20 15:57:12,506 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.13 vs. limit=5.0 2023-06-20 15:57:31,176 INFO [train.py:996] (3/4) Epoch 4, batch 6650, loss[loss=0.2146, simple_loss=0.2773, pruned_loss=0.07596, over 21746.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3047, pruned_loss=0.08048, over 4271967.92 frames. ], batch size: 112, lr: 8.31e-03, grad_scale: 32.0 2023-06-20 15:58:17,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=588924.0, ans=0.0 2023-06-20 15:58:35,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=588984.0, ans=0.125 2023-06-20 15:58:47,101 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=15.0 2023-06-20 15:58:52,651 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=589044.0, ans=0.0 2023-06-20 15:59:07,232 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.48 vs. limit=6.0 2023-06-20 15:59:13,814 INFO [train.py:996] (3/4) Epoch 4, batch 6700, loss[loss=0.2169, simple_loss=0.2809, pruned_loss=0.07649, over 21850.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.2997, pruned_loss=0.07981, over 4268434.09 frames. ], batch size: 107, lr: 8.31e-03, grad_scale: 32.0 2023-06-20 15:59:54,877 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.856e+02 2.379e+02 2.739e+02 3.210e+02 5.153e+02, threshold=5.478e+02, percent-clipped=0.0 2023-06-20 15:59:55,882 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=15.0 2023-06-20 16:00:59,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=589344.0, ans=10.0 2023-06-20 16:01:03,391 INFO [train.py:996] (3/4) Epoch 4, batch 6750, loss[loss=0.2375, simple_loss=0.2967, pruned_loss=0.08914, over 21865.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.2976, pruned_loss=0.07896, over 4264204.71 frames. ], batch size: 283, lr: 8.30e-03, grad_scale: 32.0 2023-06-20 16:01:13,491 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.30 vs. limit=15.0 2023-06-20 16:02:39,945 INFO [train.py:996] (3/4) Epoch 4, batch 6800, loss[loss=0.2309, simple_loss=0.2888, pruned_loss=0.0865, over 21597.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.2993, pruned_loss=0.08108, over 4257409.55 frames. ], batch size: 414, lr: 8.30e-03, grad_scale: 32.0 2023-06-20 16:02:52,868 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.63 vs. limit=15.0 2023-06-20 16:03:00,265 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.68 vs. limit=12.0 2023-06-20 16:03:02,165 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 2.637e+02 2.976e+02 3.448e+02 7.542e+02, threshold=5.952e+02, percent-clipped=4.0 2023-06-20 16:03:31,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=589824.0, ans=0.2 2023-06-20 16:03:48,330 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=589884.0, ans=0.1 2023-06-20 16:04:16,749 INFO [train.py:996] (3/4) Epoch 4, batch 6850, loss[loss=0.2232, simple_loss=0.2816, pruned_loss=0.08239, over 21674.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.2983, pruned_loss=0.08273, over 4261863.90 frames. ], batch size: 263, lr: 8.30e-03, grad_scale: 32.0 2023-06-20 16:04:20,198 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 16:04:25,265 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.26 vs. limit=22.5 2023-06-20 16:04:27,752 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=590004.0, ans=0.125 2023-06-20 16:04:52,902 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=590124.0, ans=0.2 2023-06-20 16:04:57,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=590124.0, ans=0.125 2023-06-20 16:04:59,351 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=590124.0, ans=0.125 2023-06-20 16:05:00,825 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=590124.0, ans=0.2 2023-06-20 16:05:14,080 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 16:05:55,773 INFO [train.py:996] (3/4) Epoch 4, batch 6900, loss[loss=0.2283, simple_loss=0.2955, pruned_loss=0.08053, over 21790.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3004, pruned_loss=0.08255, over 4270677.02 frames. ], batch size: 112, lr: 8.30e-03, grad_scale: 32.0 2023-06-20 16:06:47,127 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.644e+02 2.466e+02 2.910e+02 3.426e+02 7.067e+02, threshold=5.820e+02, percent-clipped=2.0 2023-06-20 16:07:25,122 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=590484.0, ans=0.125 2023-06-20 16:07:50,460 INFO [train.py:996] (3/4) Epoch 4, batch 6950, loss[loss=0.2608, simple_loss=0.3426, pruned_loss=0.08945, over 21808.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3023, pruned_loss=0.0804, over 4273045.26 frames. ], batch size: 118, lr: 8.29e-03, grad_scale: 32.0 2023-06-20 16:07:55,880 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.27 vs. limit=15.0 2023-06-20 16:08:30,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=590664.0, ans=0.04949747468305833 2023-06-20 16:09:17,338 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.19 vs. limit=15.0 2023-06-20 16:09:22,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=590844.0, ans=15.0 2023-06-20 16:09:28,251 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.75 vs. limit=6.0 2023-06-20 16:09:31,658 INFO [train.py:996] (3/4) Epoch 4, batch 7000, loss[loss=0.34, simple_loss=0.3801, pruned_loss=0.15, over 21290.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3062, pruned_loss=0.08404, over 4276295.56 frames. ], batch size: 507, lr: 8.29e-03, grad_scale: 32.0 2023-06-20 16:09:54,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=590904.0, ans=0.0 2023-06-20 16:10:04,332 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.798e+02 2.556e+02 2.969e+02 3.627e+02 5.556e+02, threshold=5.939e+02, percent-clipped=0.0 2023-06-20 16:10:08,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=590964.0, ans=0.125 2023-06-20 16:10:51,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=591084.0, ans=15.0 2023-06-20 16:11:03,080 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.73 vs. limit=15.0 2023-06-20 16:11:31,287 INFO [train.py:996] (3/4) Epoch 4, batch 7050, loss[loss=0.2218, simple_loss=0.3095, pruned_loss=0.06708, over 21662.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.304, pruned_loss=0.08274, over 4271777.43 frames. ], batch size: 414, lr: 8.29e-03, grad_scale: 32.0 2023-06-20 16:11:53,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=591204.0, ans=0.125 2023-06-20 16:12:14,851 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.64 vs. limit=15.0 2023-06-20 16:12:55,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=591444.0, ans=0.125 2023-06-20 16:13:35,494 INFO [train.py:996] (3/4) Epoch 4, batch 7100, loss[loss=0.3216, simple_loss=0.3706, pruned_loss=0.1363, over 21408.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3096, pruned_loss=0.08386, over 4271737.85 frames. ], batch size: 471, lr: 8.29e-03, grad_scale: 32.0 2023-06-20 16:13:38,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=591504.0, ans=0.0 2023-06-20 16:13:57,793 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.714e+02 2.233e+02 2.662e+02 3.259e+02 5.350e+02, threshold=5.324e+02, percent-clipped=0.0 2023-06-20 16:15:03,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=591744.0, ans=0.125 2023-06-20 16:15:18,612 INFO [train.py:996] (3/4) Epoch 4, batch 7150, loss[loss=0.2597, simple_loss=0.3239, pruned_loss=0.09774, over 21773.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3049, pruned_loss=0.08096, over 4278754.16 frames. ], batch size: 247, lr: 8.29e-03, grad_scale: 32.0 2023-06-20 16:15:20,505 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=591804.0, ans=0.0 2023-06-20 16:15:35,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=591864.0, ans=0.125 2023-06-20 16:15:52,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=591924.0, ans=0.2 2023-06-20 16:17:04,855 INFO [train.py:996] (3/4) Epoch 4, batch 7200, loss[loss=0.3096, simple_loss=0.3628, pruned_loss=0.1281, over 21780.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3088, pruned_loss=0.08397, over 4274765.02 frames. ], batch size: 441, lr: 8.28e-03, grad_scale: 32.0 2023-06-20 16:17:27,114 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.360e+02 2.565e+02 2.895e+02 3.560e+02 7.126e+02, threshold=5.790e+02, percent-clipped=7.0 2023-06-20 16:17:49,801 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.65 vs. limit=15.0 2023-06-20 16:18:40,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=592404.0, ans=0.125 2023-06-20 16:18:41,922 INFO [train.py:996] (3/4) Epoch 4, batch 7250, loss[loss=0.2183, simple_loss=0.2797, pruned_loss=0.07839, over 21744.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3057, pruned_loss=0.08347, over 4259204.49 frames. ], batch size: 300, lr: 8.28e-03, grad_scale: 32.0 2023-06-20 16:18:55,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=592464.0, ans=0.0 2023-06-20 16:19:24,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=592524.0, ans=0.125 2023-06-20 16:19:33,931 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=592524.0, ans=0.125 2023-06-20 16:19:35,627 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=592524.0, ans=0.125 2023-06-20 16:19:40,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=592584.0, ans=0.125 2023-06-20 16:20:04,936 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=592644.0, ans=0.035 2023-06-20 16:20:06,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=592644.0, ans=0.125 2023-06-20 16:20:19,815 INFO [train.py:996] (3/4) Epoch 4, batch 7300, loss[loss=0.2164, simple_loss=0.2768, pruned_loss=0.07802, over 21578.00 frames. ], tot_loss[loss=0.232, simple_loss=0.2991, pruned_loss=0.08243, over 4262596.29 frames. ], batch size: 298, lr: 8.28e-03, grad_scale: 32.0 2023-06-20 16:20:53,513 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.788e+02 2.476e+02 2.833e+02 3.500e+02 5.020e+02, threshold=5.666e+02, percent-clipped=0.0 2023-06-20 16:22:03,810 INFO [train.py:996] (3/4) Epoch 4, batch 7350, loss[loss=0.2056, simple_loss=0.2639, pruned_loss=0.07361, over 21752.00 frames. ], tot_loss[loss=0.231, simple_loss=0.296, pruned_loss=0.08304, over 4255823.54 frames. ], batch size: 300, lr: 8.28e-03, grad_scale: 32.0 2023-06-20 16:22:50,176 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=593124.0, ans=0.0 2023-06-20 16:23:19,904 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=593184.0, ans=0.125 2023-06-20 16:23:20,412 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.98 vs. limit=10.0 2023-06-20 16:23:56,341 INFO [train.py:996] (3/4) Epoch 4, batch 7400, loss[loss=0.1768, simple_loss=0.2092, pruned_loss=0.07223, over 16704.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3023, pruned_loss=0.08571, over 4250450.02 frames. ], batch size: 60, lr: 8.28e-03, grad_scale: 32.0 2023-06-20 16:24:12,576 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=593304.0, ans=0.125 2023-06-20 16:24:24,066 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.106e+02 2.858e+02 3.325e+02 3.971e+02 5.288e+02, threshold=6.650e+02, percent-clipped=0.0 2023-06-20 16:25:12,562 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.38 vs. limit=10.0 2023-06-20 16:25:23,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=593544.0, ans=0.125 2023-06-20 16:25:33,488 INFO [train.py:996] (3/4) Epoch 4, batch 7450, loss[loss=0.2108, simple_loss=0.2756, pruned_loss=0.07297, over 21574.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3013, pruned_loss=0.08436, over 4256874.73 frames. ], batch size: 247, lr: 8.27e-03, grad_scale: 32.0 2023-06-20 16:26:29,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=593724.0, ans=0.125 2023-06-20 16:27:13,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=593844.0, ans=0.125 2023-06-20 16:27:16,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=593844.0, ans=0.0 2023-06-20 16:27:34,237 INFO [train.py:996] (3/4) Epoch 4, batch 7500, loss[loss=0.2798, simple_loss=0.3865, pruned_loss=0.08651, over 21302.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.308, pruned_loss=0.0862, over 4257419.21 frames. ], batch size: 549, lr: 8.27e-03, grad_scale: 32.0 2023-06-20 16:28:09,127 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.017e+02 2.540e+02 2.828e+02 3.392e+02 7.342e+02, threshold=5.655e+02, percent-clipped=3.0 2023-06-20 16:28:12,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=593964.0, ans=0.125 2023-06-20 16:28:25,595 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.16 vs. limit=6.0 2023-06-20 16:28:51,856 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 16:29:13,482 INFO [train.py:996] (3/4) Epoch 4, batch 7550, loss[loss=0.2257, simple_loss=0.3209, pruned_loss=0.06525, over 21607.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3163, pruned_loss=0.08498, over 4260520.46 frames. ], batch size: 263, lr: 8.27e-03, grad_scale: 32.0 2023-06-20 16:29:54,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=594324.0, ans=10.0 2023-06-20 16:30:06,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=594324.0, ans=0.0 2023-06-20 16:30:11,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=594384.0, ans=0.125 2023-06-20 16:30:24,466 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.88 vs. limit=6.0 2023-06-20 16:30:26,934 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=594384.0, ans=0.0 2023-06-20 16:30:29,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=594444.0, ans=0.125 2023-06-20 16:30:37,579 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=594444.0, ans=0.125 2023-06-20 16:30:49,997 INFO [train.py:996] (3/4) Epoch 4, batch 7600, loss[loss=0.2454, simple_loss=0.3082, pruned_loss=0.09136, over 21800.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3154, pruned_loss=0.08334, over 4270178.92 frames. ], batch size: 282, lr: 8.27e-03, grad_scale: 32.0 2023-06-20 16:30:55,715 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.54 vs. limit=10.0 2023-06-20 16:30:56,741 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.63 vs. limit=10.0 2023-06-20 16:30:59,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=594504.0, ans=0.125 2023-06-20 16:31:08,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=594564.0, ans=0.125 2023-06-20 16:31:17,500 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.735e+02 2.434e+02 2.853e+02 3.264e+02 4.790e+02, threshold=5.707e+02, percent-clipped=0.0 2023-06-20 16:31:39,707 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=594624.0, ans=0.125 2023-06-20 16:31:51,654 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=594684.0, ans=0.125 2023-06-20 16:32:11,893 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=594744.0, ans=0.0 2023-06-20 16:32:25,976 INFO [train.py:996] (3/4) Epoch 4, batch 7650, loss[loss=0.241, simple_loss=0.3007, pruned_loss=0.09065, over 21408.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3132, pruned_loss=0.08522, over 4279722.40 frames. ], batch size: 159, lr: 8.27e-03, grad_scale: 32.0 2023-06-20 16:32:28,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=594804.0, ans=0.125 2023-06-20 16:33:21,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=594984.0, ans=0.125 2023-06-20 16:33:25,344 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.61 vs. limit=15.0 2023-06-20 16:33:36,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=594984.0, ans=0.0 2023-06-20 16:33:42,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=595044.0, ans=0.0 2023-06-20 16:33:51,581 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=595044.0, ans=0.1 2023-06-20 16:34:00,234 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=595044.0, ans=0.05 2023-06-20 16:34:04,134 INFO [train.py:996] (3/4) Epoch 4, batch 7700, loss[loss=0.3033, simple_loss=0.3637, pruned_loss=0.1215, over 21780.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3159, pruned_loss=0.08847, over 4282475.35 frames. ], batch size: 441, lr: 8.26e-03, grad_scale: 32.0 2023-06-20 16:34:40,151 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.965e+02 2.502e+02 2.863e+02 3.403e+02 5.125e+02, threshold=5.726e+02, percent-clipped=0.0 2023-06-20 16:35:02,976 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-20 16:36:10,693 INFO [train.py:996] (3/4) Epoch 4, batch 7750, loss[loss=0.3077, simple_loss=0.4044, pruned_loss=0.1055, over 21860.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3224, pruned_loss=0.08916, over 4283985.66 frames. ], batch size: 372, lr: 8.26e-03, grad_scale: 32.0 2023-06-20 16:36:18,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=595404.0, ans=0.125 2023-06-20 16:36:34,529 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.31 vs. limit=15.0 2023-06-20 16:37:33,012 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=595584.0, ans=0.125 2023-06-20 16:38:19,755 INFO [train.py:996] (3/4) Epoch 4, batch 7800, loss[loss=0.323, simple_loss=0.3784, pruned_loss=0.1338, over 21395.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3228, pruned_loss=0.08873, over 4283266.41 frames. ], batch size: 507, lr: 8.26e-03, grad_scale: 32.0 2023-06-20 16:38:49,532 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.871e+02 2.887e+02 3.418e+02 4.293e+02 7.867e+02, threshold=6.836e+02, percent-clipped=4.0 2023-06-20 16:38:51,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=595764.0, ans=0.0 2023-06-20 16:39:06,569 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=595824.0, ans=0.125 2023-06-20 16:39:15,706 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.51 vs. limit=15.0 2023-06-20 16:39:58,218 INFO [train.py:996] (3/4) Epoch 4, batch 7850, loss[loss=0.2251, simple_loss=0.2851, pruned_loss=0.08256, over 21793.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.315, pruned_loss=0.08768, over 4273396.97 frames. ], batch size: 112, lr: 8.26e-03, grad_scale: 16.0 2023-06-20 16:40:44,603 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=596124.0, ans=0.0 2023-06-20 16:41:48,965 INFO [train.py:996] (3/4) Epoch 4, batch 7900, loss[loss=0.2076, simple_loss=0.257, pruned_loss=0.07911, over 20016.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3098, pruned_loss=0.08646, over 4262633.95 frames. ], batch size: 704, lr: 8.26e-03, grad_scale: 16.0 2023-06-20 16:42:35,588 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.876e+02 2.748e+02 3.368e+02 4.194e+02 8.125e+02, threshold=6.737e+02, percent-clipped=4.0 2023-06-20 16:43:56,409 INFO [train.py:996] (3/4) Epoch 4, batch 7950, loss[loss=0.2426, simple_loss=0.311, pruned_loss=0.08712, over 21344.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.317, pruned_loss=0.0876, over 4268445.52 frames. ], batch size: 159, lr: 8.25e-03, grad_scale: 16.0 2023-06-20 16:44:06,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=596604.0, ans=0.125 2023-06-20 16:44:07,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=596604.0, ans=0.025 2023-06-20 16:44:13,622 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.26 vs. limit=6.0 2023-06-20 16:45:55,538 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=15.0 2023-06-20 16:45:56,308 INFO [train.py:996] (3/4) Epoch 4, batch 8000, loss[loss=0.2379, simple_loss=0.3064, pruned_loss=0.08468, over 21601.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3215, pruned_loss=0.08991, over 4265578.74 frames. ], batch size: 112, lr: 8.25e-03, grad_scale: 32.0 2023-06-20 16:46:00,443 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.76 vs. limit=12.0 2023-06-20 16:46:27,802 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.946e+02 2.568e+02 2.923e+02 3.494e+02 7.833e+02, threshold=5.846e+02, percent-clipped=1.0 2023-06-20 16:46:43,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=597024.0, ans=0.0 2023-06-20 16:47:51,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=597144.0, ans=0.125 2023-06-20 16:48:21,258 INFO [train.py:996] (3/4) Epoch 4, batch 8050, loss[loss=0.2108, simple_loss=0.2814, pruned_loss=0.07005, over 21466.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.3233, pruned_loss=0.08997, over 4255253.15 frames. ], batch size: 211, lr: 8.25e-03, grad_scale: 32.0 2023-06-20 16:48:33,954 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=597204.0, ans=0.1 2023-06-20 16:48:41,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=597264.0, ans=0.0 2023-06-20 16:49:06,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=597324.0, ans=0.0 2023-06-20 16:49:10,660 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=597324.0, ans=0.125 2023-06-20 16:49:22,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=597384.0, ans=0.125 2023-06-20 16:49:36,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=597384.0, ans=0.125 2023-06-20 16:49:39,463 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=597444.0, ans=0.2 2023-06-20 16:49:44,409 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.61 vs. limit=12.0 2023-06-20 16:49:50,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=597444.0, ans=0.125 2023-06-20 16:49:50,806 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.31 vs. limit=15.0 2023-06-20 16:50:00,271 INFO [train.py:996] (3/4) Epoch 4, batch 8100, loss[loss=0.3016, simple_loss=0.3526, pruned_loss=0.1253, over 21605.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.324, pruned_loss=0.0916, over 4267825.64 frames. ], batch size: 471, lr: 8.25e-03, grad_scale: 32.0 2023-06-20 16:50:37,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=597564.0, ans=0.0 2023-06-20 16:50:38,477 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.350e+02 2.976e+02 3.712e+02 5.223e+02 1.010e+03, threshold=7.424e+02, percent-clipped=11.0 2023-06-20 16:52:09,556 INFO [train.py:996] (3/4) Epoch 4, batch 8150, loss[loss=0.2481, simple_loss=0.3493, pruned_loss=0.07342, over 21823.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.33, pruned_loss=0.09223, over 4259047.37 frames. ], batch size: 372, lr: 8.24e-03, grad_scale: 32.0 2023-06-20 16:52:29,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=597864.0, ans=0.2 2023-06-20 16:52:38,691 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=597864.0, ans=0.125 2023-06-20 16:52:50,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=597924.0, ans=0.5 2023-06-20 16:52:59,988 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=597924.0, ans=0.1 2023-06-20 16:53:08,973 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=597984.0, ans=0.1 2023-06-20 16:53:38,018 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=598044.0, ans=0.125 2023-06-20 16:53:50,585 INFO [train.py:996] (3/4) Epoch 4, batch 8200, loss[loss=0.2299, simple_loss=0.2898, pruned_loss=0.08498, over 21439.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3209, pruned_loss=0.08911, over 4255567.85 frames. ], batch size: 389, lr: 8.24e-03, grad_scale: 32.0 2023-06-20 16:54:36,858 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.931e+02 2.510e+02 2.906e+02 3.709e+02 6.115e+02, threshold=5.811e+02, percent-clipped=0.0 2023-06-20 16:55:33,510 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=598344.0, ans=0.125 2023-06-20 16:55:40,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=598344.0, ans=0.0 2023-06-20 16:55:54,976 INFO [train.py:996] (3/4) Epoch 4, batch 8250, loss[loss=0.3332, simple_loss=0.3976, pruned_loss=0.1344, over 21542.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3219, pruned_loss=0.08934, over 4265208.44 frames. ], batch size: 508, lr: 8.24e-03, grad_scale: 32.0 2023-06-20 16:56:13,566 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=598404.0, ans=0.125 2023-06-20 16:56:14,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=598464.0, ans=0.1 2023-06-20 16:56:30,856 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=598464.0, ans=0.0 2023-06-20 16:57:14,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=598584.0, ans=0.125 2023-06-20 16:57:36,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=598644.0, ans=0.125 2023-06-20 16:57:38,849 INFO [train.py:996] (3/4) Epoch 4, batch 8300, loss[loss=0.2289, simple_loss=0.3102, pruned_loss=0.07383, over 21714.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3192, pruned_loss=0.08606, over 4266172.87 frames. ], batch size: 298, lr: 8.24e-03, grad_scale: 32.0 2023-06-20 16:58:09,678 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.21 vs. limit=15.0 2023-06-20 16:58:09,927 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.765e+02 2.374e+02 2.802e+02 3.318e+02 5.012e+02, threshold=5.604e+02, percent-clipped=0.0 2023-06-20 16:58:10,478 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=598764.0, ans=0.125 2023-06-20 16:58:16,445 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=598824.0, ans=0.0 2023-06-20 16:58:22,257 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=598824.0, ans=0.125 2023-06-20 16:58:22,266 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=598824.0, ans=0.0 2023-06-20 16:58:38,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=598884.0, ans=0.125 2023-06-20 16:59:13,431 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=598944.0, ans=0.2 2023-06-20 16:59:15,924 INFO [train.py:996] (3/4) Epoch 4, batch 8350, loss[loss=0.2317, simple_loss=0.3, pruned_loss=0.08171, over 21632.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3173, pruned_loss=0.08375, over 4275640.56 frames. ], batch size: 247, lr: 8.24e-03, grad_scale: 32.0 2023-06-20 16:59:43,589 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 17:00:13,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=599184.0, ans=0.1 2023-06-20 17:00:54,169 INFO [train.py:996] (3/4) Epoch 4, batch 8400, loss[loss=0.1955, simple_loss=0.2684, pruned_loss=0.06129, over 21273.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.314, pruned_loss=0.08062, over 4275387.48 frames. ], batch size: 131, lr: 8.23e-03, grad_scale: 32.0 2023-06-20 17:01:10,556 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=599304.0, ans=0.125 2023-06-20 17:01:24,803 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.636e+02 2.360e+02 2.679e+02 3.034e+02 4.807e+02, threshold=5.358e+02, percent-clipped=0.0 2023-06-20 17:02:38,953 INFO [train.py:996] (3/4) Epoch 4, batch 8450, loss[loss=0.2835, simple_loss=0.3251, pruned_loss=0.121, over 21662.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3138, pruned_loss=0.08104, over 4281696.67 frames. ], batch size: 508, lr: 8.23e-03, grad_scale: 32.0 2023-06-20 17:03:06,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=599664.0, ans=0.125 2023-06-20 17:03:13,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=599664.0, ans=0.125 2023-06-20 17:03:37,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=599724.0, ans=0.1 2023-06-20 17:03:49,990 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=599724.0, ans=0.02 2023-06-20 17:04:00,752 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.48 vs. limit=15.0 2023-06-20 17:04:34,224 INFO [train.py:996] (3/4) Epoch 4, batch 8500, loss[loss=0.248, simple_loss=0.3061, pruned_loss=0.09493, over 21658.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3105, pruned_loss=0.08253, over 4287030.84 frames. ], batch size: 332, lr: 8.23e-03, grad_scale: 32.0 2023-06-20 17:04:42,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=599904.0, ans=0.1 2023-06-20 17:04:54,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=599964.0, ans=0.0 2023-06-20 17:05:09,683 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.188e+02 2.562e+02 2.810e+02 3.301e+02 4.951e+02, threshold=5.621e+02, percent-clipped=0.0 2023-06-20 17:05:52,521 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=600084.0, ans=0.125 2023-06-20 17:05:58,644 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=600144.0, ans=0.2 2023-06-20 17:06:10,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=600144.0, ans=0.2 2023-06-20 17:06:11,710 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=600144.0, ans=0.1 2023-06-20 17:06:13,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=600144.0, ans=0.125 2023-06-20 17:06:17,598 INFO [train.py:996] (3/4) Epoch 4, batch 8550, loss[loss=0.2995, simple_loss=0.3969, pruned_loss=0.1011, over 21275.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3156, pruned_loss=0.08509, over 4290153.79 frames. ], batch size: 548, lr: 8.23e-03, grad_scale: 32.0 2023-06-20 17:07:07,297 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.42 vs. limit=5.0 2023-06-20 17:07:15,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=600384.0, ans=0.125 2023-06-20 17:07:36,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=600384.0, ans=0.125 2023-06-20 17:08:03,048 INFO [train.py:996] (3/4) Epoch 4, batch 8600, loss[loss=0.3074, simple_loss=0.3681, pruned_loss=0.1233, over 21433.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3219, pruned_loss=0.08743, over 4288126.90 frames. ], batch size: 471, lr: 8.23e-03, grad_scale: 32.0 2023-06-20 17:08:37,367 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=600564.0, ans=0.1 2023-06-20 17:08:38,953 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=600564.0, ans=0.125 2023-06-20 17:08:40,922 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.61 vs. limit=15.0 2023-06-20 17:08:49,982 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.927e+02 2.757e+02 3.185e+02 4.135e+02 6.803e+02, threshold=6.371e+02, percent-clipped=9.0 2023-06-20 17:08:51,082 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.85 vs. limit=15.0 2023-06-20 17:08:58,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=600624.0, ans=0.0 2023-06-20 17:09:12,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=600624.0, ans=0.125 2023-06-20 17:09:16,966 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=600684.0, ans=0.125 2023-06-20 17:09:37,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=600684.0, ans=0.125 2023-06-20 17:10:02,025 INFO [train.py:996] (3/4) Epoch 4, batch 8650, loss[loss=0.217, simple_loss=0.3244, pruned_loss=0.05477, over 21237.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3281, pruned_loss=0.08787, over 4285333.26 frames. ], batch size: 548, lr: 8.22e-03, grad_scale: 32.0 2023-06-20 17:10:39,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=600924.0, ans=0.125 2023-06-20 17:10:42,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=600924.0, ans=0.1 2023-06-20 17:11:38,698 INFO [train.py:996] (3/4) Epoch 4, batch 8700, loss[loss=0.213, simple_loss=0.2769, pruned_loss=0.07456, over 21802.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3188, pruned_loss=0.08405, over 4281989.10 frames. ], batch size: 118, lr: 8.22e-03, grad_scale: 32.0 2023-06-20 17:12:09,314 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.617e+02 2.376e+02 2.840e+02 3.647e+02 6.545e+02, threshold=5.680e+02, percent-clipped=1.0 2023-06-20 17:12:57,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=601284.0, ans=0.125 2023-06-20 17:13:24,748 INFO [train.py:996] (3/4) Epoch 4, batch 8750, loss[loss=0.3067, simple_loss=0.3525, pruned_loss=0.1304, over 21586.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3148, pruned_loss=0.08478, over 4285856.32 frames. ], batch size: 471, lr: 8.22e-03, grad_scale: 32.0 2023-06-20 17:13:27,059 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=601404.0, ans=0.1 2023-06-20 17:13:53,745 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=601464.0, ans=0.125 2023-06-20 17:14:21,562 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.44 vs. limit=15.0 2023-06-20 17:14:47,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=601644.0, ans=0.125 2023-06-20 17:15:02,057 INFO [train.py:996] (3/4) Epoch 4, batch 8800, loss[loss=0.2484, simple_loss=0.3293, pruned_loss=0.08375, over 21512.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3226, pruned_loss=0.08798, over 4289246.29 frames. ], batch size: 194, lr: 8.22e-03, grad_scale: 32.0 2023-06-20 17:15:41,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=601764.0, ans=0.07 2023-06-20 17:15:42,501 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.045e+02 2.769e+02 3.106e+02 3.590e+02 5.947e+02, threshold=6.211e+02, percent-clipped=1.0 2023-06-20 17:16:19,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=601824.0, ans=0.0 2023-06-20 17:16:31,703 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-20 17:16:37,494 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=601884.0, ans=0.125 2023-06-20 17:16:49,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=601944.0, ans=0.125 2023-06-20 17:16:53,400 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.86 vs. limit=22.5 2023-06-20 17:16:59,659 INFO [train.py:996] (3/4) Epoch 4, batch 8850, loss[loss=0.2945, simple_loss=0.3361, pruned_loss=0.1265, over 21349.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3293, pruned_loss=0.09035, over 4287649.57 frames. ], batch size: 508, lr: 8.22e-03, grad_scale: 32.0 2023-06-20 17:17:37,230 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=602064.0, ans=0.0 2023-06-20 17:18:19,343 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=602184.0, ans=0.1 2023-06-20 17:18:22,532 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=602184.0, ans=0.125 2023-06-20 17:18:40,457 INFO [train.py:996] (3/4) Epoch 4, batch 8900, loss[loss=0.2181, simple_loss=0.2824, pruned_loss=0.07691, over 21600.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3235, pruned_loss=0.08863, over 4287960.23 frames. ], batch size: 298, lr: 8.21e-03, grad_scale: 32.0 2023-06-20 17:19:30,531 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 2.687e+02 3.012e+02 3.727e+02 9.034e+02, threshold=6.025e+02, percent-clipped=1.0 2023-06-20 17:20:09,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=602424.0, ans=0.0 2023-06-20 17:20:11,452 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=602484.0, ans=0.0 2023-06-20 17:20:34,073 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=602544.0, ans=0.1 2023-06-20 17:20:55,166 INFO [train.py:996] (3/4) Epoch 4, batch 8950, loss[loss=0.2041, simple_loss=0.254, pruned_loss=0.0771, over 20760.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3241, pruned_loss=0.08832, over 4283447.05 frames. ], batch size: 609, lr: 8.21e-03, grad_scale: 32.0 2023-06-20 17:21:01,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=602604.0, ans=0.125 2023-06-20 17:22:02,309 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=602784.0, ans=0.125 2023-06-20 17:22:10,442 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.68 vs. limit=10.0 2023-06-20 17:22:22,073 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.31 vs. limit=6.0 2023-06-20 17:22:24,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=602844.0, ans=0.125 2023-06-20 17:22:38,302 INFO [train.py:996] (3/4) Epoch 4, batch 9000, loss[loss=0.2142, simple_loss=0.281, pruned_loss=0.07372, over 21057.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3192, pruned_loss=0.08797, over 4277392.08 frames. ], batch size: 143, lr: 8.21e-03, grad_scale: 32.0 2023-06-20 17:22:38,303 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-20 17:23:29,200 INFO [train.py:1028] (3/4) Epoch 4, validation: loss=0.2733, simple_loss=0.3656, pruned_loss=0.09047, over 1796401.00 frames. 2023-06-20 17:23:29,201 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-20 17:23:31,353 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=602904.0, ans=0.2 2023-06-20 17:23:42,386 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=22.5 2023-06-20 17:23:55,091 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.196e+02 2.920e+02 3.392e+02 4.044e+02 7.869e+02, threshold=6.783e+02, percent-clipped=2.0 2023-06-20 17:23:55,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=602964.0, ans=0.125 2023-06-20 17:25:06,920 INFO [train.py:996] (3/4) Epoch 4, batch 9050, loss[loss=0.184, simple_loss=0.2633, pruned_loss=0.05238, over 21556.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3135, pruned_loss=0.08403, over 4273914.35 frames. ], batch size: 212, lr: 8.21e-03, grad_scale: 16.0 2023-06-20 17:25:18,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=603204.0, ans=0.0 2023-06-20 17:25:18,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=603204.0, ans=0.0 2023-06-20 17:25:50,118 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.61 vs. limit=15.0 2023-06-20 17:25:51,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=603324.0, ans=0.2 2023-06-20 17:26:16,111 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.72 vs. limit=12.0 2023-06-20 17:26:51,030 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.67 vs. limit=10.0 2023-06-20 17:27:03,497 INFO [train.py:996] (3/4) Epoch 4, batch 9100, loss[loss=0.2565, simple_loss=0.3475, pruned_loss=0.08276, over 21664.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3208, pruned_loss=0.08754, over 4268373.38 frames. ], batch size: 441, lr: 8.21e-03, grad_scale: 16.0 2023-06-20 17:27:57,735 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.926e+02 2.581e+02 3.087e+02 3.555e+02 5.100e+02, threshold=6.174e+02, percent-clipped=0.0 2023-06-20 17:28:27,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=603684.0, ans=0.1 2023-06-20 17:28:42,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=603684.0, ans=0.1 2023-06-20 17:28:44,166 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=603684.0, ans=0.2 2023-06-20 17:29:02,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=603744.0, ans=0.125 2023-06-20 17:29:08,292 INFO [train.py:996] (3/4) Epoch 4, batch 9150, loss[loss=0.264, simple_loss=0.3663, pruned_loss=0.08086, over 21218.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3225, pruned_loss=0.08468, over 4270070.46 frames. ], batch size: 548, lr: 8.20e-03, grad_scale: 16.0 2023-06-20 17:29:16,419 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-20 17:29:51,018 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=603924.0, ans=0.0 2023-06-20 17:29:51,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=603924.0, ans=0.125 2023-06-20 17:30:26,127 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.55 vs. limit=22.5 2023-06-20 17:30:57,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=604044.0, ans=0.125 2023-06-20 17:31:03,050 INFO [train.py:996] (3/4) Epoch 4, batch 9200, loss[loss=0.2145, simple_loss=0.3036, pruned_loss=0.06269, over 21731.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.323, pruned_loss=0.08307, over 4263672.55 frames. ], batch size: 298, lr: 8.20e-03, grad_scale: 32.0 2023-06-20 17:31:47,388 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.792e+02 2.497e+02 2.858e+02 3.581e+02 5.930e+02, threshold=5.716e+02, percent-clipped=0.0 2023-06-20 17:32:01,937 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.45 vs. limit=8.0 2023-06-20 17:32:31,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=604344.0, ans=0.125 2023-06-20 17:32:36,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=604344.0, ans=10.0 2023-06-20 17:32:54,612 INFO [train.py:996] (3/4) Epoch 4, batch 9250, loss[loss=0.2251, simple_loss=0.2869, pruned_loss=0.08167, over 21912.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3264, pruned_loss=0.08687, over 4266151.07 frames. ], batch size: 113, lr: 8.20e-03, grad_scale: 32.0 2023-06-20 17:33:17,602 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.43 vs. limit=22.5 2023-06-20 17:34:37,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=604704.0, ans=0.0 2023-06-20 17:34:38,243 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.12 vs. limit=22.5 2023-06-20 17:34:38,708 INFO [train.py:996] (3/4) Epoch 4, batch 9300, loss[loss=0.2087, simple_loss=0.278, pruned_loss=0.06969, over 21881.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3198, pruned_loss=0.0861, over 4274208.40 frames. ], batch size: 107, lr: 8.20e-03, grad_scale: 32.0 2023-06-20 17:35:22,926 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.226e+02 3.268e+02 3.884e+02 4.664e+02 7.347e+02, threshold=7.768e+02, percent-clipped=7.0 2023-06-20 17:35:26,642 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=604824.0, ans=0.0 2023-06-20 17:35:44,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=604824.0, ans=0.1 2023-06-20 17:35:58,305 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=604884.0, ans=0.125 2023-06-20 17:36:18,034 INFO [train.py:996] (3/4) Epoch 4, batch 9350, loss[loss=0.2554, simple_loss=0.3337, pruned_loss=0.08852, over 21566.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3261, pruned_loss=0.08779, over 4276130.50 frames. ], batch size: 263, lr: 8.20e-03, grad_scale: 32.0 2023-06-20 17:37:05,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=605124.0, ans=0.2 2023-06-20 17:38:01,966 INFO [train.py:996] (3/4) Epoch 4, batch 9400, loss[loss=0.2267, simple_loss=0.2903, pruned_loss=0.08155, over 21134.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.3286, pruned_loss=0.08925, over 4277281.89 frames. ], batch size: 143, lr: 8.19e-03, grad_scale: 16.0 2023-06-20 17:38:09,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=605304.0, ans=0.07 2023-06-20 17:38:35,632 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.058e+02 2.508e+02 2.883e+02 3.302e+02 5.601e+02, threshold=5.767e+02, percent-clipped=0.0 2023-06-20 17:38:43,224 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=605424.0, ans=0.125 2023-06-20 17:38:49,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=605424.0, ans=0.125 2023-06-20 17:38:53,620 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=605484.0, ans=0.125 2023-06-20 17:39:45,258 INFO [train.py:996] (3/4) Epoch 4, batch 9450, loss[loss=0.2058, simple_loss=0.2718, pruned_loss=0.06993, over 21725.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3193, pruned_loss=0.08776, over 4282679.27 frames. ], batch size: 300, lr: 8.19e-03, grad_scale: 16.0 2023-06-20 17:40:06,505 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 17:40:40,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=605784.0, ans=0.1 2023-06-20 17:41:18,636 INFO [train.py:996] (3/4) Epoch 4, batch 9500, loss[loss=0.2499, simple_loss=0.3167, pruned_loss=0.09153, over 21186.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3122, pruned_loss=0.08553, over 4262129.28 frames. ], batch size: 143, lr: 8.19e-03, grad_scale: 16.0 2023-06-20 17:41:53,778 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.913e+02 2.513e+02 2.842e+02 3.687e+02 6.250e+02, threshold=5.685e+02, percent-clipped=3.0 2023-06-20 17:41:59,254 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=606024.0, ans=0.125 2023-06-20 17:42:00,820 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 17:42:09,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=606084.0, ans=0.125 2023-06-20 17:42:36,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=606084.0, ans=0.0 2023-06-20 17:43:05,080 INFO [train.py:996] (3/4) Epoch 4, batch 9550, loss[loss=0.2932, simple_loss=0.3653, pruned_loss=0.1106, over 21195.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3165, pruned_loss=0.08704, over 4267774.97 frames. ], batch size: 143, lr: 8.19e-03, grad_scale: 16.0 2023-06-20 17:43:29,009 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=606264.0, ans=0.125 2023-06-20 17:44:11,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=606384.0, ans=0.0 2023-06-20 17:44:51,772 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.89 vs. limit=15.0 2023-06-20 17:44:55,266 INFO [train.py:996] (3/4) Epoch 4, batch 9600, loss[loss=0.2683, simple_loss=0.3372, pruned_loss=0.09971, over 21423.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3209, pruned_loss=0.09032, over 4272269.26 frames. ], batch size: 211, lr: 8.19e-03, grad_scale: 32.0 2023-06-20 17:45:05,151 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.93 vs. limit=15.0 2023-06-20 17:45:07,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=606504.0, ans=15.0 2023-06-20 17:45:15,785 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=606564.0, ans=0.125 2023-06-20 17:45:29,030 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.929e+02 2.596e+02 2.958e+02 3.550e+02 7.860e+02, threshold=5.916e+02, percent-clipped=4.0 2023-06-20 17:45:37,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=606624.0, ans=0.0 2023-06-20 17:46:11,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=606744.0, ans=0.035 2023-06-20 17:46:32,677 INFO [train.py:996] (3/4) Epoch 4, batch 9650, loss[loss=0.2847, simple_loss=0.3374, pruned_loss=0.1159, over 21503.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3198, pruned_loss=0.08988, over 4273514.17 frames. ], batch size: 508, lr: 8.18e-03, grad_scale: 32.0 2023-06-20 17:46:54,801 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=606864.0, ans=0.0 2023-06-20 17:47:02,432 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=606864.0, ans=0.0 2023-06-20 17:47:53,940 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=606984.0, ans=0.2 2023-06-20 17:48:14,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=607044.0, ans=0.0 2023-06-20 17:48:18,427 INFO [train.py:996] (3/4) Epoch 4, batch 9700, loss[loss=0.2355, simple_loss=0.321, pruned_loss=0.07504, over 21636.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.3226, pruned_loss=0.08979, over 4274739.73 frames. ], batch size: 263, lr: 8.18e-03, grad_scale: 32.0 2023-06-20 17:48:32,207 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.76 vs. limit=15.0 2023-06-20 17:48:46,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=607164.0, ans=0.07 2023-06-20 17:48:52,037 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.905e+02 2.445e+02 2.765e+02 3.359e+02 4.931e+02, threshold=5.531e+02, percent-clipped=0.0 2023-06-20 17:50:13,092 INFO [train.py:996] (3/4) Epoch 4, batch 9750, loss[loss=0.2268, simple_loss=0.2879, pruned_loss=0.08281, over 15416.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3168, pruned_loss=0.08836, over 4272050.09 frames. ], batch size: 60, lr: 8.18e-03, grad_scale: 32.0 2023-06-20 17:50:38,930 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=607464.0, ans=0.1 2023-06-20 17:50:40,554 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=607464.0, ans=0.0 2023-06-20 17:51:03,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=607584.0, ans=0.125 2023-06-20 17:51:06,364 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.20 vs. limit=15.0 2023-06-20 17:51:43,075 INFO [train.py:996] (3/4) Epoch 4, batch 9800, loss[loss=0.2465, simple_loss=0.3282, pruned_loss=0.08243, over 21891.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3171, pruned_loss=0.0888, over 4274963.17 frames. ], batch size: 124, lr: 8.18e-03, grad_scale: 16.0 2023-06-20 17:52:05,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=607764.0, ans=0.125 2023-06-20 17:52:18,890 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.047e+02 2.616e+02 2.976e+02 3.497e+02 5.489e+02, threshold=5.952e+02, percent-clipped=0.0 2023-06-20 17:52:49,459 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.01 vs. limit=10.0 2023-06-20 17:53:19,582 INFO [train.py:996] (3/4) Epoch 4, batch 9850, loss[loss=0.2245, simple_loss=0.2814, pruned_loss=0.08381, over 21243.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3133, pruned_loss=0.08787, over 4278810.40 frames. ], batch size: 159, lr: 8.18e-03, grad_scale: 16.0 2023-06-20 17:53:46,627 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=608064.0, ans=0.0 2023-06-20 17:54:15,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=608184.0, ans=0.0 2023-06-20 17:54:23,899 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=608184.0, ans=0.125 2023-06-20 17:54:45,150 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.03 vs. limit=15.0 2023-06-20 17:54:48,282 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.14 vs. limit=12.0 2023-06-20 17:54:57,579 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=608244.0, ans=0.125 2023-06-20 17:54:59,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=608304.0, ans=0.125 2023-06-20 17:55:00,047 INFO [train.py:996] (3/4) Epoch 4, batch 9900, loss[loss=0.247, simple_loss=0.3429, pruned_loss=0.07554, over 20740.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3091, pruned_loss=0.08686, over 4267418.67 frames. ], batch size: 607, lr: 8.17e-03, grad_scale: 16.0 2023-06-20 17:55:38,215 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.985e+02 2.623e+02 2.955e+02 3.593e+02 6.748e+02, threshold=5.910e+02, percent-clipped=3.0 2023-06-20 17:55:47,456 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=608424.0, ans=0.1 2023-06-20 17:56:00,231 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=608484.0, ans=0.125 2023-06-20 17:56:26,890 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=608544.0, ans=0.125 2023-06-20 17:56:41,656 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 17:56:45,771 INFO [train.py:996] (3/4) Epoch 4, batch 9950, loss[loss=0.2521, simple_loss=0.3112, pruned_loss=0.0965, over 19955.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3127, pruned_loss=0.08938, over 4269902.85 frames. ], batch size: 702, lr: 8.17e-03, grad_scale: 16.0 2023-06-20 17:57:02,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=608604.0, ans=0.125 2023-06-20 17:57:34,456 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.22 vs. limit=15.0 2023-06-20 17:57:49,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=608784.0, ans=0.04949747468305833 2023-06-20 17:58:07,403 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.86 vs. limit=6.0 2023-06-20 17:58:36,072 INFO [train.py:996] (3/4) Epoch 4, batch 10000, loss[loss=0.2095, simple_loss=0.2746, pruned_loss=0.07219, over 21580.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3084, pruned_loss=0.08855, over 4263106.33 frames. ], batch size: 230, lr: 8.17e-03, grad_scale: 32.0 2023-06-20 17:58:43,369 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.92 vs. limit=15.0 2023-06-20 17:59:06,186 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.779e+02 2.693e+02 3.176e+02 3.942e+02 5.803e+02, threshold=6.352e+02, percent-clipped=0.0 2023-06-20 17:59:08,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=609024.0, ans=10.0 2023-06-20 18:00:33,110 INFO [train.py:996] (3/4) Epoch 4, batch 10050, loss[loss=0.1948, simple_loss=0.2728, pruned_loss=0.0584, over 21663.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3106, pruned_loss=0.08936, over 4268927.55 frames. ], batch size: 332, lr: 8.17e-03, grad_scale: 32.0 2023-06-20 18:00:40,367 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.32 vs. limit=15.0 2023-06-20 18:00:50,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=609264.0, ans=0.05 2023-06-20 18:01:04,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=609324.0, ans=0.125 2023-06-20 18:02:11,240 INFO [train.py:996] (3/4) Epoch 4, batch 10100, loss[loss=0.2149, simple_loss=0.2792, pruned_loss=0.07529, over 21627.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3075, pruned_loss=0.08689, over 4261918.30 frames. ], batch size: 230, lr: 8.17e-03, grad_scale: 32.0 2023-06-20 18:02:17,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=609504.0, ans=0.0 2023-06-20 18:02:22,840 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.75 vs. limit=15.0 2023-06-20 18:02:26,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=609564.0, ans=0.125 2023-06-20 18:02:48,179 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.956e+02 2.616e+02 3.061e+02 3.552e+02 5.046e+02, threshold=6.121e+02, percent-clipped=0.0 2023-06-20 18:03:33,138 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=609684.0, ans=0.04949747468305833 2023-06-20 18:03:33,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=609684.0, ans=0.125 2023-06-20 18:03:46,913 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=609744.0, ans=0.1 2023-06-20 18:03:51,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=609744.0, ans=0.1 2023-06-20 18:03:54,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=609744.0, ans=0.0 2023-06-20 18:03:56,768 INFO [train.py:996] (3/4) Epoch 4, batch 10150, loss[loss=0.243, simple_loss=0.3192, pruned_loss=0.08344, over 21261.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3135, pruned_loss=0.08854, over 4270294.12 frames. ], batch size: 176, lr: 8.16e-03, grad_scale: 32.0 2023-06-20 18:04:03,249 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=609804.0, ans=0.125 2023-06-20 18:04:12,991 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=609864.0, ans=15.0 2023-06-20 18:04:13,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=609864.0, ans=0.125 2023-06-20 18:04:19,973 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=609864.0, ans=0.2 2023-06-20 18:05:10,598 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=609984.0, ans=0.2 2023-06-20 18:05:30,212 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=610044.0, ans=0.2 2023-06-20 18:05:34,319 INFO [train.py:996] (3/4) Epoch 4, batch 10200, loss[loss=0.2518, simple_loss=0.3306, pruned_loss=0.08655, over 21702.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3138, pruned_loss=0.08646, over 4275758.89 frames. ], batch size: 415, lr: 8.16e-03, grad_scale: 32.0 2023-06-20 18:06:09,910 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.682e+02 2.303e+02 2.650e+02 3.100e+02 6.273e+02, threshold=5.301e+02, percent-clipped=1.0 2023-06-20 18:06:10,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=610224.0, ans=0.0 2023-06-20 18:06:35,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=610284.0, ans=0.125 2023-06-20 18:06:52,710 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.70 vs. limit=22.5 2023-06-20 18:06:55,167 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=610344.0, ans=0.2 2023-06-20 18:07:06,499 INFO [train.py:996] (3/4) Epoch 4, batch 10250, loss[loss=0.2602, simple_loss=0.3389, pruned_loss=0.09077, over 21604.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3077, pruned_loss=0.08065, over 4279651.54 frames. ], batch size: 389, lr: 8.16e-03, grad_scale: 32.0 2023-06-20 18:07:13,198 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.52 vs. limit=10.0 2023-06-20 18:08:02,179 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=610524.0, ans=0.125 2023-06-20 18:08:28,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=610584.0, ans=0.125 2023-06-20 18:08:46,329 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=610644.0, ans=0.0 2023-06-20 18:08:53,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=610644.0, ans=0.0 2023-06-20 18:08:58,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=610704.0, ans=0.1 2023-06-20 18:08:59,161 INFO [train.py:996] (3/4) Epoch 4, batch 10300, loss[loss=0.2758, simple_loss=0.3662, pruned_loss=0.09275, over 21907.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3121, pruned_loss=0.08225, over 4278230.24 frames. ], batch size: 372, lr: 8.16e-03, grad_scale: 32.0 2023-06-20 18:10:07,764 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.534e+02 2.456e+02 2.901e+02 3.440e+02 5.624e+02, threshold=5.802e+02, percent-clipped=3.0 2023-06-20 18:10:09,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=610824.0, ans=0.2 2023-06-20 18:10:38,372 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=610944.0, ans=0.125 2023-06-20 18:11:11,074 INFO [train.py:996] (3/4) Epoch 4, batch 10350, loss[loss=0.2435, simple_loss=0.3185, pruned_loss=0.08423, over 21819.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3132, pruned_loss=0.08329, over 4267616.00 frames. ], batch size: 372, lr: 8.16e-03, grad_scale: 32.0 2023-06-20 18:11:12,087 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.95 vs. limit=6.0 2023-06-20 18:11:13,296 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=611004.0, ans=0.0 2023-06-20 18:11:41,631 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.56 vs. limit=15.0 2023-06-20 18:11:57,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=611124.0, ans=0.1 2023-06-20 18:12:54,575 INFO [train.py:996] (3/4) Epoch 4, batch 10400, loss[loss=0.2244, simple_loss=0.302, pruned_loss=0.07338, over 21686.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3058, pruned_loss=0.081, over 4266060.43 frames. ], batch size: 391, lr: 8.15e-03, grad_scale: 32.0 2023-06-20 18:13:06,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=611304.0, ans=0.2 2023-06-20 18:13:32,407 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 18:13:34,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=611364.0, ans=0.0 2023-06-20 18:13:36,681 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.814e+02 2.614e+02 3.049e+02 3.745e+02 5.860e+02, threshold=6.098e+02, percent-clipped=1.0 2023-06-20 18:13:48,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=611424.0, ans=0.0 2023-06-20 18:14:27,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=611544.0, ans=0.07 2023-06-20 18:14:33,864 INFO [train.py:996] (3/4) Epoch 4, batch 10450, loss[loss=0.2582, simple_loss=0.3368, pruned_loss=0.08982, over 21613.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3109, pruned_loss=0.08448, over 4259546.50 frames. ], batch size: 263, lr: 8.15e-03, grad_scale: 32.0 2023-06-20 18:14:34,319 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=611604.0, ans=0.125 2023-06-20 18:15:00,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=611664.0, ans=0.125 2023-06-20 18:15:05,070 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=611664.0, ans=0.125 2023-06-20 18:15:05,116 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=611664.0, ans=0.0 2023-06-20 18:15:42,855 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=22.5 2023-06-20 18:16:15,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=611784.0, ans=0.125 2023-06-20 18:16:49,366 INFO [train.py:996] (3/4) Epoch 4, batch 10500, loss[loss=0.2963, simple_loss=0.3992, pruned_loss=0.09674, over 19846.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3098, pruned_loss=0.08337, over 4257274.59 frames. ], batch size: 703, lr: 8.15e-03, grad_scale: 32.0 2023-06-20 18:16:52,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=611904.0, ans=0.0 2023-06-20 18:17:04,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=611904.0, ans=0.05 2023-06-20 18:17:08,037 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.82 vs. limit=15.0 2023-06-20 18:17:25,178 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.863e+02 2.554e+02 2.960e+02 3.444e+02 4.861e+02, threshold=5.921e+02, percent-clipped=0.0 2023-06-20 18:18:04,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=612084.0, ans=15.0 2023-06-20 18:18:05,896 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=612144.0, ans=0.125 2023-06-20 18:18:27,192 INFO [train.py:996] (3/4) Epoch 4, batch 10550, loss[loss=0.2226, simple_loss=0.2769, pruned_loss=0.08415, over 14886.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3073, pruned_loss=0.08411, over 4247408.76 frames. ], batch size: 60, lr: 8.15e-03, grad_scale: 32.0 2023-06-20 18:18:46,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=612264.0, ans=0.1 2023-06-20 18:18:51,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=612264.0, ans=0.125 2023-06-20 18:19:06,837 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=612324.0, ans=0.1 2023-06-20 18:19:08,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=612324.0, ans=0.1 2023-06-20 18:20:06,500 INFO [train.py:996] (3/4) Epoch 4, batch 10600, loss[loss=0.2063, simple_loss=0.2614, pruned_loss=0.07559, over 21658.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3027, pruned_loss=0.08212, over 4245808.50 frames. ], batch size: 282, lr: 8.15e-03, grad_scale: 32.0 2023-06-20 18:20:16,967 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=612504.0, ans=0.125 2023-06-20 18:20:28,430 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=612564.0, ans=0.0 2023-06-20 18:20:38,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=612564.0, ans=0.0 2023-06-20 18:20:40,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=612564.0, ans=0.0 2023-06-20 18:20:42,877 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.848e+02 2.468e+02 2.792e+02 3.375e+02 4.680e+02, threshold=5.585e+02, percent-clipped=0.0 2023-06-20 18:20:57,190 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.73 vs. limit=15.0 2023-06-20 18:20:58,197 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=612624.0, ans=0.125 2023-06-20 18:21:11,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=612684.0, ans=0.125 2023-06-20 18:21:32,579 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=612744.0, ans=0.125 2023-06-20 18:21:51,937 INFO [train.py:996] (3/4) Epoch 4, batch 10650, loss[loss=0.1759, simple_loss=0.2564, pruned_loss=0.04773, over 21395.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3044, pruned_loss=0.0814, over 4251674.72 frames. ], batch size: 211, lr: 8.14e-03, grad_scale: 32.0 2023-06-20 18:22:35,916 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.whiten.whitening_limit, batch_count=612864.0, ans=12.0 2023-06-20 18:23:08,710 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=612984.0, ans=0.0 2023-06-20 18:23:43,992 INFO [train.py:996] (3/4) Epoch 4, batch 10700, loss[loss=0.2665, simple_loss=0.3358, pruned_loss=0.09859, over 21598.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3041, pruned_loss=0.08103, over 4242816.52 frames. ], batch size: 263, lr: 8.14e-03, grad_scale: 32.0 2023-06-20 18:23:58,071 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=613164.0, ans=0.0 2023-06-20 18:24:25,577 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.881e+02 2.621e+02 3.333e+02 3.923e+02 6.693e+02, threshold=6.666e+02, percent-clipped=4.0 2023-06-20 18:24:56,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=613284.0, ans=0.125 2023-06-20 18:25:22,501 INFO [train.py:996] (3/4) Epoch 4, batch 10750, loss[loss=0.2428, simple_loss=0.3127, pruned_loss=0.08644, over 20669.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3145, pruned_loss=0.08579, over 4255257.86 frames. ], batch size: 607, lr: 8.14e-03, grad_scale: 32.0 2023-06-20 18:25:27,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=613404.0, ans=0.125 2023-06-20 18:25:29,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=613404.0, ans=0.04949747468305833 2023-06-20 18:26:02,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=613524.0, ans=0.125 2023-06-20 18:26:26,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=613584.0, ans=0.0 2023-06-20 18:27:06,303 INFO [train.py:996] (3/4) Epoch 4, batch 10800, loss[loss=0.254, simple_loss=0.3226, pruned_loss=0.09268, over 21903.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3198, pruned_loss=0.08644, over 4260430.89 frames. ], batch size: 316, lr: 8.14e-03, grad_scale: 32.0 2023-06-20 18:27:29,042 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=613704.0, ans=0.125 2023-06-20 18:27:44,220 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=613704.0, ans=0.125 2023-06-20 18:28:07,245 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.821e+02 2.553e+02 2.986e+02 3.356e+02 5.834e+02, threshold=5.972e+02, percent-clipped=0.0 2023-06-20 18:28:20,769 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=613824.0, ans=0.125 2023-06-20 18:29:04,045 INFO [train.py:996] (3/4) Epoch 4, batch 10850, loss[loss=0.2642, simple_loss=0.322, pruned_loss=0.1032, over 21356.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3189, pruned_loss=0.0865, over 4259517.75 frames. ], batch size: 471, lr: 8.14e-03, grad_scale: 32.0 2023-06-20 18:29:58,122 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=614124.0, ans=0.0 2023-06-20 18:30:48,125 INFO [train.py:996] (3/4) Epoch 4, batch 10900, loss[loss=0.2324, simple_loss=0.3115, pruned_loss=0.07671, over 21591.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3119, pruned_loss=0.08442, over 4258223.33 frames. ], batch size: 414, lr: 8.13e-03, grad_scale: 32.0 2023-06-20 18:30:51,865 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 18:30:58,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=614304.0, ans=0.125 2023-06-20 18:31:33,697 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.811e+02 2.427e+02 2.861e+02 3.280e+02 5.229e+02, threshold=5.723e+02, percent-clipped=0.0 2023-06-20 18:32:30,226 INFO [train.py:996] (3/4) Epoch 4, batch 10950, loss[loss=0.2067, simple_loss=0.2817, pruned_loss=0.0658, over 21754.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3085, pruned_loss=0.08205, over 4254788.35 frames. ], batch size: 351, lr: 8.13e-03, grad_scale: 32.0 2023-06-20 18:33:04,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=614724.0, ans=0.05 2023-06-20 18:33:22,009 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=614724.0, ans=0.0 2023-06-20 18:34:03,881 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.70 vs. limit=6.0 2023-06-20 18:34:07,234 INFO [train.py:996] (3/4) Epoch 4, batch 11000, loss[loss=0.213, simple_loss=0.2865, pruned_loss=0.06982, over 21502.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3061, pruned_loss=0.08259, over 4263054.29 frames. ], batch size: 212, lr: 8.13e-03, grad_scale: 16.0 2023-06-20 18:34:51,040 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.50 vs. limit=15.0 2023-06-20 18:35:02,763 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.872e+02 2.436e+02 2.740e+02 3.123e+02 4.405e+02, threshold=5.481e+02, percent-clipped=0.0 2023-06-20 18:35:56,979 INFO [train.py:996] (3/4) Epoch 4, batch 11050, loss[loss=0.2156, simple_loss=0.2718, pruned_loss=0.07969, over 21775.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3046, pruned_loss=0.08402, over 4266828.27 frames. ], batch size: 351, lr: 8.13e-03, grad_scale: 16.0 2023-06-20 18:36:01,815 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=615204.0, ans=0.1 2023-06-20 18:36:55,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=615384.0, ans=0.125 2023-06-20 18:37:07,020 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=615444.0, ans=0.0 2023-06-20 18:37:19,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=615444.0, ans=0.0 2023-06-20 18:37:33,925 INFO [train.py:996] (3/4) Epoch 4, batch 11100, loss[loss=0.2871, simple_loss=0.3361, pruned_loss=0.119, over 21351.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3025, pruned_loss=0.08395, over 4266632.39 frames. ], batch size: 471, lr: 8.13e-03, grad_scale: 16.0 2023-06-20 18:37:40,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=615504.0, ans=0.125 2023-06-20 18:37:54,898 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.01 vs. limit=22.5 2023-06-20 18:38:08,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=615624.0, ans=0.1 2023-06-20 18:38:11,255 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.131e+02 2.701e+02 3.050e+02 3.889e+02 7.267e+02, threshold=6.099e+02, percent-clipped=1.0 2023-06-20 18:38:31,137 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.98 vs. limit=15.0 2023-06-20 18:38:51,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=615744.0, ans=0.07 2023-06-20 18:39:08,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=615744.0, ans=0.1 2023-06-20 18:39:11,204 INFO [train.py:996] (3/4) Epoch 4, batch 11150, loss[loss=0.2378, simple_loss=0.3172, pruned_loss=0.0792, over 21770.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.2994, pruned_loss=0.08311, over 4252284.42 frames. ], batch size: 371, lr: 8.12e-03, grad_scale: 16.0 2023-06-20 18:39:26,686 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=615864.0, ans=0.1 2023-06-20 18:39:33,628 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.33 vs. limit=10.0 2023-06-20 18:40:03,876 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.41 vs. limit=15.0 2023-06-20 18:40:06,641 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.46 vs. limit=15.0 2023-06-20 18:40:15,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=615984.0, ans=0.125 2023-06-20 18:40:43,201 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.62 vs. limit=15.0 2023-06-20 18:40:43,580 INFO [train.py:996] (3/4) Epoch 4, batch 11200, loss[loss=0.2185, simple_loss=0.2761, pruned_loss=0.08043, over 21329.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.2972, pruned_loss=0.0826, over 4255689.84 frames. ], batch size: 211, lr: 8.12e-03, grad_scale: 32.0 2023-06-20 18:40:48,822 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=616104.0, ans=0.0 2023-06-20 18:41:20,884 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.925e+02 2.501e+02 3.007e+02 3.463e+02 5.262e+02, threshold=6.015e+02, percent-clipped=0.0 2023-06-20 18:41:47,515 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=616284.0, ans=0.0 2023-06-20 18:41:50,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=616284.0, ans=0.1 2023-06-20 18:42:19,908 INFO [train.py:996] (3/4) Epoch 4, batch 11250, loss[loss=0.2409, simple_loss=0.3254, pruned_loss=0.07819, over 21796.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.2973, pruned_loss=0.0822, over 4260305.83 frames. ], batch size: 124, lr: 8.12e-03, grad_scale: 32.0 2023-06-20 18:42:23,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=616404.0, ans=0.125 2023-06-20 18:42:27,510 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=616404.0, ans=0.0 2023-06-20 18:42:36,260 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=616464.0, ans=0.2 2023-06-20 18:42:39,162 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=616464.0, ans=0.0 2023-06-20 18:42:49,510 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=616524.0, ans=0.1 2023-06-20 18:43:46,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=616644.0, ans=0.1 2023-06-20 18:43:55,851 INFO [train.py:996] (3/4) Epoch 4, batch 11300, loss[loss=0.2022, simple_loss=0.2785, pruned_loss=0.06294, over 21813.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.2994, pruned_loss=0.08219, over 4272658.32 frames. ], batch size: 118, lr: 8.12e-03, grad_scale: 32.0 2023-06-20 18:43:59,869 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.46 vs. limit=15.0 2023-06-20 18:44:05,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=616704.0, ans=0.125 2023-06-20 18:44:32,986 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.644e+02 2.391e+02 2.715e+02 3.266e+02 4.844e+02, threshold=5.429e+02, percent-clipped=0.0 2023-06-20 18:45:06,198 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=616944.0, ans=0.125 2023-06-20 18:45:32,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=617004.0, ans=0.125 2023-06-20 18:45:33,082 INFO [train.py:996] (3/4) Epoch 4, batch 11350, loss[loss=0.2742, simple_loss=0.3453, pruned_loss=0.1016, over 21889.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3014, pruned_loss=0.08209, over 4263426.13 frames. ], batch size: 372, lr: 8.12e-03, grad_scale: 32.0 2023-06-20 18:45:51,644 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=617064.0, ans=0.125 2023-06-20 18:46:40,813 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=617124.0, ans=0.1 2023-06-20 18:47:05,243 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=617184.0, ans=0.125 2023-06-20 18:47:18,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=617244.0, ans=0.2 2023-06-20 18:47:27,855 INFO [train.py:996] (3/4) Epoch 4, batch 11400, loss[loss=0.2222, simple_loss=0.3023, pruned_loss=0.07105, over 21410.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3101, pruned_loss=0.08634, over 4267836.04 frames. ], batch size: 194, lr: 8.11e-03, grad_scale: 32.0 2023-06-20 18:47:28,580 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=617304.0, ans=0.1 2023-06-20 18:47:59,234 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2023-06-20 18:48:09,860 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.983e+02 2.547e+02 3.001e+02 3.571e+02 5.767e+02, threshold=6.003e+02, percent-clipped=1.0 2023-06-20 18:48:42,980 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=617484.0, ans=0.125 2023-06-20 18:48:43,575 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.97 vs. limit=10.0 2023-06-20 18:49:04,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=617544.0, ans=0.125 2023-06-20 18:49:20,891 INFO [train.py:996] (3/4) Epoch 4, batch 11450, loss[loss=0.2308, simple_loss=0.3137, pruned_loss=0.07396, over 21637.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3116, pruned_loss=0.08504, over 4263961.37 frames. ], batch size: 389, lr: 8.11e-03, grad_scale: 32.0 2023-06-20 18:49:24,327 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=617604.0, ans=0.0 2023-06-20 18:49:31,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=617604.0, ans=0.1 2023-06-20 18:49:42,677 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 18:49:55,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=617664.0, ans=0.125 2023-06-20 18:50:30,876 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=617784.0, ans=0.025 2023-06-20 18:51:09,824 INFO [train.py:996] (3/4) Epoch 4, batch 11500, loss[loss=0.2591, simple_loss=0.3285, pruned_loss=0.09481, over 21470.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3148, pruned_loss=0.08615, over 4268948.04 frames. ], batch size: 211, lr: 8.11e-03, grad_scale: 32.0 2023-06-20 18:51:35,685 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=617904.0, ans=0.125 2023-06-20 18:51:52,302 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=618024.0, ans=0.0 2023-06-20 18:51:54,634 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.842e+02 2.475e+02 2.866e+02 3.336e+02 5.251e+02, threshold=5.732e+02, percent-clipped=0.0 2023-06-20 18:51:57,536 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.21 vs. limit=15.0 2023-06-20 18:52:58,001 INFO [train.py:996] (3/4) Epoch 4, batch 11550, loss[loss=0.1727, simple_loss=0.2337, pruned_loss=0.05585, over 16725.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.32, pruned_loss=0.08609, over 4266246.93 frames. ], batch size: 60, lr: 8.11e-03, grad_scale: 32.0 2023-06-20 18:53:30,338 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.46 vs. limit=22.5 2023-06-20 18:53:54,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=618264.0, ans=0.125 2023-06-20 18:54:06,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=618324.0, ans=0.125 2023-06-20 18:54:08,679 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=22.5 2023-06-20 18:54:26,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=618384.0, ans=0.1 2023-06-20 18:54:56,637 INFO [train.py:996] (3/4) Epoch 4, batch 11600, loss[loss=0.2534, simple_loss=0.3479, pruned_loss=0.07948, over 21657.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3322, pruned_loss=0.08791, over 4266989.59 frames. ], batch size: 263, lr: 8.11e-03, grad_scale: 32.0 2023-06-20 18:55:09,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=618504.0, ans=0.0 2023-06-20 18:55:40,402 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.999e+02 2.611e+02 3.024e+02 3.614e+02 6.438e+02, threshold=6.048e+02, percent-clipped=2.0 2023-06-20 18:55:49,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=618624.0, ans=0.125 2023-06-20 18:56:34,569 INFO [train.py:996] (3/4) Epoch 4, batch 11650, loss[loss=0.2442, simple_loss=0.3261, pruned_loss=0.08118, over 21193.00 frames. ], tot_loss[loss=0.257, simple_loss=0.3376, pruned_loss=0.08822, over 4266274.98 frames. ], batch size: 159, lr: 8.10e-03, grad_scale: 32.0 2023-06-20 18:56:36,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=618804.0, ans=0.125 2023-06-20 18:56:52,357 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=618804.0, ans=0.125 2023-06-20 18:57:53,631 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 18:58:10,738 INFO [train.py:996] (3/4) Epoch 4, batch 11700, loss[loss=0.205, simple_loss=0.264, pruned_loss=0.07293, over 21609.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3299, pruned_loss=0.08817, over 4268621.52 frames. ], batch size: 247, lr: 8.10e-03, grad_scale: 32.0 2023-06-20 18:58:18,958 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.88 vs. limit=22.5 2023-06-20 18:58:52,833 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.951e+02 2.548e+02 2.868e+02 3.351e+02 5.665e+02, threshold=5.736e+02, percent-clipped=0.0 2023-06-20 18:59:48,051 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=619344.0, ans=0.04949747468305833 2023-06-20 18:59:52,722 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=619344.0, ans=0.125 2023-06-20 18:59:56,747 INFO [train.py:996] (3/4) Epoch 4, batch 11750, loss[loss=0.2829, simple_loss=0.3785, pruned_loss=0.09371, over 19689.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.321, pruned_loss=0.08743, over 4249810.71 frames. ], batch size: 702, lr: 8.10e-03, grad_scale: 32.0 2023-06-20 19:01:04,282 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=619584.0, ans=0.2 2023-06-20 19:01:13,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=619584.0, ans=0.125 2023-06-20 19:01:15,600 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.06 vs. limit=15.0 2023-06-20 19:01:31,463 INFO [train.py:996] (3/4) Epoch 4, batch 11800, loss[loss=0.2248, simple_loss=0.2898, pruned_loss=0.0799, over 21852.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3222, pruned_loss=0.08923, over 4260228.23 frames. ], batch size: 107, lr: 8.10e-03, grad_scale: 32.0 2023-06-20 19:01:38,944 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=619704.0, ans=0.125 2023-06-20 19:02:00,613 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 19:02:06,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=619764.0, ans=0.125 2023-06-20 19:02:13,519 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.964e+02 2.675e+02 3.120e+02 4.087e+02 6.326e+02, threshold=6.239e+02, percent-clipped=4.0 2023-06-20 19:02:18,787 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=619824.0, ans=0.125 2023-06-20 19:02:41,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=619884.0, ans=0.1 2023-06-20 19:02:51,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=619944.0, ans=0.0 2023-06-20 19:03:08,261 INFO [train.py:996] (3/4) Epoch 4, batch 11850, loss[loss=0.2141, simple_loss=0.2957, pruned_loss=0.06627, over 21158.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3228, pruned_loss=0.08831, over 4268621.35 frames. ], batch size: 143, lr: 8.10e-03, grad_scale: 32.0 2023-06-20 19:03:16,328 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.52 vs. limit=22.5 2023-06-20 19:03:59,116 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=620124.0, ans=0.0 2023-06-20 19:04:11,182 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=620124.0, ans=0.0 2023-06-20 19:04:11,892 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=18.96 vs. limit=22.5 2023-06-20 19:04:16,045 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=620184.0, ans=0.125 2023-06-20 19:04:23,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=620184.0, ans=0.125 2023-06-20 19:04:58,308 INFO [train.py:996] (3/4) Epoch 4, batch 11900, loss[loss=0.209, simple_loss=0.294, pruned_loss=0.06202, over 21584.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3222, pruned_loss=0.08608, over 4267750.07 frames. ], batch size: 230, lr: 8.09e-03, grad_scale: 32.0 2023-06-20 19:05:00,896 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.60 vs. limit=15.0 2023-06-20 19:05:01,741 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=620304.0, ans=0.125 2023-06-20 19:05:14,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=620364.0, ans=0.125 2023-06-20 19:05:36,513 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.586e+02 2.326e+02 2.650e+02 3.050e+02 4.543e+02, threshold=5.300e+02, percent-clipped=0.0 2023-06-20 19:05:55,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=620484.0, ans=0.2 2023-06-20 19:05:57,107 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.35 vs. limit=15.0 2023-06-20 19:06:20,919 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.28 vs. limit=15.0 2023-06-20 19:06:37,624 INFO [train.py:996] (3/4) Epoch 4, batch 11950, loss[loss=0.2717, simple_loss=0.3635, pruned_loss=0.08994, over 21632.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3212, pruned_loss=0.08186, over 4268802.47 frames. ], batch size: 441, lr: 8.09e-03, grad_scale: 32.0 2023-06-20 19:07:41,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=620784.0, ans=0.025 2023-06-20 19:07:56,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=620844.0, ans=0.0 2023-06-20 19:08:03,558 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=620844.0, ans=0.125 2023-06-20 19:08:12,901 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.07 vs. limit=10.0 2023-06-20 19:08:14,764 INFO [train.py:996] (3/4) Epoch 4, batch 12000, loss[loss=0.2143, simple_loss=0.2763, pruned_loss=0.07612, over 21418.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3185, pruned_loss=0.08054, over 4258266.99 frames. ], batch size: 211, lr: 8.09e-03, grad_scale: 32.0 2023-06-20 19:08:14,765 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-20 19:08:59,184 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.3.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.7528, 2.4519, 2.6413, 2.9443, 2.4094, 2.3369, 2.8925, 2.8518], device='cuda:3') 2023-06-20 19:09:03,789 INFO [train.py:1028] (3/4) Epoch 4, validation: loss=0.2647, simple_loss=0.362, pruned_loss=0.08364, over 1796401.00 frames. 2023-06-20 19:09:03,790 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-20 19:09:32,744 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=620964.0, ans=0.125 2023-06-20 19:09:47,055 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.747e+02 2.456e+02 3.065e+02 4.141e+02 8.942e+02, threshold=6.129e+02, percent-clipped=11.0 2023-06-20 19:10:31,615 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.90 vs. limit=22.5 2023-06-20 19:10:42,509 INFO [train.py:996] (3/4) Epoch 4, batch 12050, loss[loss=0.2505, simple_loss=0.3229, pruned_loss=0.08905, over 21841.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3176, pruned_loss=0.08284, over 4264880.75 frames. ], batch size: 351, lr: 8.09e-03, grad_scale: 32.0 2023-06-20 19:11:14,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=621264.0, ans=0.0 2023-06-20 19:11:26,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=621324.0, ans=0.2 2023-06-20 19:12:00,109 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=621384.0, ans=0.1 2023-06-20 19:12:17,044 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=621444.0, ans=0.125 2023-06-20 19:12:21,129 INFO [train.py:996] (3/4) Epoch 4, batch 12100, loss[loss=0.2672, simple_loss=0.3356, pruned_loss=0.09936, over 21321.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3222, pruned_loss=0.0862, over 4267576.60 frames. ], batch size: 143, lr: 8.09e-03, grad_scale: 32.0 2023-06-20 19:12:30,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=621504.0, ans=0.125 2023-06-20 19:12:43,950 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=621504.0, ans=15.0 2023-06-20 19:13:09,942 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.247e+02 2.714e+02 2.937e+02 3.590e+02 6.628e+02, threshold=5.874e+02, percent-clipped=1.0 2023-06-20 19:13:42,827 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.62 vs. limit=15.0 2023-06-20 19:13:49,691 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=621744.0, ans=0.0 2023-06-20 19:14:10,378 INFO [train.py:996] (3/4) Epoch 4, batch 12150, loss[loss=0.2207, simple_loss=0.3003, pruned_loss=0.07059, over 21381.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3234, pruned_loss=0.08604, over 4268236.82 frames. ], batch size: 211, lr: 8.09e-03, grad_scale: 32.0 2023-06-20 19:14:23,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=621864.0, ans=0.125 2023-06-20 19:15:10,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=621984.0, ans=0.125 2023-06-20 19:15:20,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=622044.0, ans=0.1 2023-06-20 19:15:47,744 INFO [train.py:996] (3/4) Epoch 4, batch 12200, loss[loss=0.2311, simple_loss=0.2827, pruned_loss=0.08979, over 21284.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3192, pruned_loss=0.08521, over 4274386.77 frames. ], batch size: 160, lr: 8.08e-03, grad_scale: 32.0 2023-06-20 19:16:07,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=622164.0, ans=0.125 2023-06-20 19:16:19,550 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.10 vs. limit=15.0 2023-06-20 19:16:30,792 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.879e+02 2.592e+02 3.097e+02 3.968e+02 7.788e+02, threshold=6.193e+02, percent-clipped=3.0 2023-06-20 19:16:39,063 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=622224.0, ans=0.1 2023-06-20 19:16:40,420 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=622224.0, ans=0.0 2023-06-20 19:16:52,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=622284.0, ans=0.125 2023-06-20 19:17:03,800 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=622344.0, ans=0.125 2023-06-20 19:17:06,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=622344.0, ans=0.0 2023-06-20 19:17:25,142 INFO [train.py:996] (3/4) Epoch 4, batch 12250, loss[loss=0.1575, simple_loss=0.2281, pruned_loss=0.04345, over 15936.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3126, pruned_loss=0.08284, over 4262476.68 frames. ], batch size: 60, lr: 8.08e-03, grad_scale: 32.0 2023-06-20 19:17:26,886 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=622404.0, ans=0.125 2023-06-20 19:18:08,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=622524.0, ans=0.125 2023-06-20 19:19:02,396 INFO [train.py:996] (3/4) Epoch 4, batch 12300, loss[loss=0.1484, simple_loss=0.2188, pruned_loss=0.03899, over 21283.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3032, pruned_loss=0.07621, over 4259489.21 frames. ], batch size: 159, lr: 8.08e-03, grad_scale: 32.0 2023-06-20 19:19:45,606 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.476e+02 2.123e+02 2.583e+02 3.049e+02 4.453e+02, threshold=5.165e+02, percent-clipped=0.0 2023-06-20 19:20:42,923 INFO [train.py:996] (3/4) Epoch 4, batch 12350, loss[loss=0.2357, simple_loss=0.3033, pruned_loss=0.084, over 21844.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.309, pruned_loss=0.07768, over 4263591.76 frames. ], batch size: 282, lr: 8.08e-03, grad_scale: 32.0 2023-06-20 19:21:26,804 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.28 vs. limit=15.0 2023-06-20 19:22:10,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=623244.0, ans=0.025 2023-06-20 19:22:18,913 INFO [train.py:996] (3/4) Epoch 4, batch 12400, loss[loss=0.2609, simple_loss=0.3198, pruned_loss=0.101, over 21317.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3107, pruned_loss=0.0814, over 4271076.63 frames. ], batch size: 176, lr: 8.08e-03, grad_scale: 32.0 2023-06-20 19:22:39,046 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.76 vs. limit=10.0 2023-06-20 19:22:53,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=623364.0, ans=0.1 2023-06-20 19:23:01,739 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.827e+02 2.547e+02 2.868e+02 3.412e+02 7.340e+02, threshold=5.736e+02, percent-clipped=3.0 2023-06-20 19:23:04,281 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.30 vs. limit=6.0 2023-06-20 19:23:45,603 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=623544.0, ans=0.125 2023-06-20 19:23:48,423 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=623544.0, ans=0.0 2023-06-20 19:23:48,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=623544.0, ans=0.125 2023-06-20 19:23:52,203 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.89 vs. limit=15.0 2023-06-20 19:23:57,373 INFO [train.py:996] (3/4) Epoch 4, batch 12450, loss[loss=0.2764, simple_loss=0.3488, pruned_loss=0.102, over 21947.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3152, pruned_loss=0.08444, over 4275400.88 frames. ], batch size: 316, lr: 8.07e-03, grad_scale: 32.0 2023-06-20 19:24:19,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=623604.0, ans=0.95 2023-06-20 19:24:26,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=623664.0, ans=0.1 2023-06-20 19:24:26,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=623664.0, ans=0.0 2023-06-20 19:24:58,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=623724.0, ans=0.125 2023-06-20 19:25:13,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=623784.0, ans=0.125 2023-06-20 19:25:20,944 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=623844.0, ans=0.125 2023-06-20 19:25:47,472 INFO [train.py:996] (3/4) Epoch 4, batch 12500, loss[loss=0.2742, simple_loss=0.3672, pruned_loss=0.09063, over 21434.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3249, pruned_loss=0.08831, over 4276145.01 frames. ], batch size: 194, lr: 8.07e-03, grad_scale: 32.0 2023-06-20 19:26:06,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=623964.0, ans=0.0 2023-06-20 19:26:23,690 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=624024.0, ans=0.2 2023-06-20 19:26:28,712 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.162e+02 2.938e+02 3.235e+02 3.820e+02 6.603e+02, threshold=6.470e+02, percent-clipped=1.0 2023-06-20 19:26:43,908 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=624024.0, ans=0.5 2023-06-20 19:26:57,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=624084.0, ans=0.0 2023-06-20 19:26:58,109 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.61 vs. limit=15.0 2023-06-20 19:27:43,432 INFO [train.py:996] (3/4) Epoch 4, batch 12550, loss[loss=0.2537, simple_loss=0.3261, pruned_loss=0.09066, over 21645.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3301, pruned_loss=0.09028, over 4279779.76 frames. ], batch size: 263, lr: 8.07e-03, grad_scale: 32.0 2023-06-20 19:28:18,589 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.68 vs. limit=15.0 2023-06-20 19:29:19,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=624444.0, ans=0.125 2023-06-20 19:29:31,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=624444.0, ans=0.1 2023-06-20 19:29:34,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=624444.0, ans=0.2 2023-06-20 19:29:38,562 INFO [train.py:996] (3/4) Epoch 4, batch 12600, loss[loss=0.2239, simple_loss=0.3138, pruned_loss=0.06701, over 21844.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.3287, pruned_loss=0.08784, over 4275464.11 frames. ], batch size: 372, lr: 8.07e-03, grad_scale: 32.0 2023-06-20 19:30:29,137 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.876e+02 2.456e+02 2.778e+02 3.111e+02 4.488e+02, threshold=5.555e+02, percent-clipped=0.0 2023-06-20 19:30:29,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=624624.0, ans=0.125 2023-06-20 19:30:45,547 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=624624.0, ans=0.0 2023-06-20 19:30:58,765 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=12.0 2023-06-20 19:31:04,656 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.75 vs. limit=15.0 2023-06-20 19:31:17,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=624744.0, ans=0.1 2023-06-20 19:31:23,401 INFO [train.py:996] (3/4) Epoch 4, batch 12650, loss[loss=0.2479, simple_loss=0.3603, pruned_loss=0.06773, over 20794.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3194, pruned_loss=0.083, over 4270133.62 frames. ], batch size: 608, lr: 8.07e-03, grad_scale: 32.0 2023-06-20 19:31:28,514 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=624804.0, ans=0.125 2023-06-20 19:31:30,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=624804.0, ans=0.0 2023-06-20 19:31:36,114 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=624804.0, ans=0.05 2023-06-20 19:31:57,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=624864.0, ans=0.2 2023-06-20 19:32:43,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=624984.0, ans=0.0 2023-06-20 19:32:50,951 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=625044.0, ans=0.125 2023-06-20 19:32:53,148 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.71 vs. limit=15.0 2023-06-20 19:32:55,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=625044.0, ans=0.125 2023-06-20 19:33:06,253 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.50 vs. limit=15.0 2023-06-20 19:33:06,885 INFO [train.py:996] (3/4) Epoch 4, batch 12700, loss[loss=0.3015, simple_loss=0.3563, pruned_loss=0.1233, over 21425.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3194, pruned_loss=0.08558, over 4275737.76 frames. ], batch size: 471, lr: 8.06e-03, grad_scale: 16.0 2023-06-20 19:33:51,077 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.043e+02 2.673e+02 3.135e+02 3.667e+02 6.631e+02, threshold=6.269e+02, percent-clipped=1.0 2023-06-20 19:34:05,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=625224.0, ans=0.2 2023-06-20 19:34:34,032 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=625344.0, ans=0.125 2023-06-20 19:34:43,234 INFO [train.py:996] (3/4) Epoch 4, batch 12750, loss[loss=0.2485, simple_loss=0.3258, pruned_loss=0.08562, over 21811.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3196, pruned_loss=0.08604, over 4277976.94 frames. ], batch size: 415, lr: 8.06e-03, grad_scale: 16.0 2023-06-20 19:34:55,719 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=625404.0, ans=0.07 2023-06-20 19:34:58,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=625404.0, ans=0.125 2023-06-20 19:36:06,145 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=625584.0, ans=0.125 2023-06-20 19:36:34,895 INFO [train.py:996] (3/4) Epoch 4, batch 12800, loss[loss=0.2895, simple_loss=0.3578, pruned_loss=0.1106, over 21301.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3192, pruned_loss=0.08684, over 4284938.11 frames. ], batch size: 144, lr: 8.06e-03, grad_scale: 32.0 2023-06-20 19:36:43,477 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.01 vs. limit=15.0 2023-06-20 19:37:03,274 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=625764.0, ans=0.0 2023-06-20 19:37:23,916 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.918e+02 2.419e+02 2.676e+02 3.175e+02 5.760e+02, threshold=5.353e+02, percent-clipped=0.0 2023-06-20 19:38:37,383 INFO [train.py:996] (3/4) Epoch 4, batch 12850, loss[loss=0.2097, simple_loss=0.3036, pruned_loss=0.05786, over 21639.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3225, pruned_loss=0.08814, over 4288641.93 frames. ], batch size: 263, lr: 8.06e-03, grad_scale: 32.0 2023-06-20 19:38:48,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=626004.0, ans=0.125 2023-06-20 19:39:37,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=626184.0, ans=0.05 2023-06-20 19:40:16,667 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=626244.0, ans=0.125 2023-06-20 19:40:22,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=626304.0, ans=0.125 2023-06-20 19:40:23,450 INFO [train.py:996] (3/4) Epoch 4, batch 12900, loss[loss=0.1991, simple_loss=0.2725, pruned_loss=0.06286, over 21793.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3206, pruned_loss=0.08512, over 4285482.69 frames. ], batch size: 118, lr: 8.06e-03, grad_scale: 32.0 2023-06-20 19:40:47,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=626364.0, ans=0.1 2023-06-20 19:41:02,028 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.859e+02 2.262e+02 2.706e+02 3.188e+02 4.993e+02, threshold=5.411e+02, percent-clipped=0.0 2023-06-20 19:41:45,809 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=626484.0, ans=0.125 2023-06-20 19:42:13,988 INFO [train.py:996] (3/4) Epoch 4, batch 12950, loss[loss=0.2412, simple_loss=0.313, pruned_loss=0.08468, over 21741.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3195, pruned_loss=0.08261, over 4275874.42 frames. ], batch size: 332, lr: 8.05e-03, grad_scale: 32.0 2023-06-20 19:42:17,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=626604.0, ans=0.125 2023-06-20 19:42:18,094 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.29 vs. limit=6.0 2023-06-20 19:42:25,720 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.57 vs. limit=15.0 2023-06-20 19:42:43,901 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.38 vs. limit=15.0 2023-06-20 19:42:54,966 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=626724.0, ans=0.1 2023-06-20 19:43:20,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=626784.0, ans=0.07 2023-06-20 19:43:33,552 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=626844.0, ans=0.09899494936611666 2023-06-20 19:43:48,590 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.73 vs. limit=15.0 2023-06-20 19:43:52,076 INFO [train.py:996] (3/4) Epoch 4, batch 13000, loss[loss=0.2923, simple_loss=0.444, pruned_loss=0.07031, over 19709.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3207, pruned_loss=0.08248, over 4272307.37 frames. ], batch size: 702, lr: 8.05e-03, grad_scale: 32.0 2023-06-20 19:43:53,831 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=626904.0, ans=0.1 2023-06-20 19:44:30,822 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.687e+02 2.289e+02 2.689e+02 3.180e+02 4.201e+02, threshold=5.379e+02, percent-clipped=0.0 2023-06-20 19:44:35,822 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=627024.0, ans=0.2 2023-06-20 19:45:29,788 INFO [train.py:996] (3/4) Epoch 4, batch 13050, loss[loss=0.2438, simple_loss=0.3117, pruned_loss=0.08798, over 21530.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3165, pruned_loss=0.08059, over 4267717.87 frames. ], batch size: 131, lr: 8.05e-03, grad_scale: 32.0 2023-06-20 19:45:31,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=627204.0, ans=0.1 2023-06-20 19:45:41,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=627204.0, ans=0.1 2023-06-20 19:45:48,055 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.43 vs. limit=22.5 2023-06-20 19:45:58,113 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.26 vs. limit=15.0 2023-06-20 19:46:22,727 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.52 vs. limit=8.0 2023-06-20 19:46:35,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=627324.0, ans=0.125 2023-06-20 19:47:31,044 INFO [train.py:996] (3/4) Epoch 4, batch 13100, loss[loss=0.2722, simple_loss=0.3506, pruned_loss=0.09683, over 21395.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3187, pruned_loss=0.08116, over 4276585.76 frames. ], batch size: 131, lr: 8.05e-03, grad_scale: 32.0 2023-06-20 19:47:42,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=627504.0, ans=0.05 2023-06-20 19:47:53,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=627564.0, ans=0.0 2023-06-20 19:48:05,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=627564.0, ans=0.125 2023-06-20 19:48:07,574 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.00 vs. limit=15.0 2023-06-20 19:48:16,295 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.547e+02 2.575e+02 2.978e+02 3.531e+02 5.580e+02, threshold=5.955e+02, percent-clipped=1.0 2023-06-20 19:48:27,943 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.55 vs. limit=15.0 2023-06-20 19:49:09,612 INFO [train.py:996] (3/4) Epoch 4, batch 13150, loss[loss=0.2202, simple_loss=0.243, pruned_loss=0.09873, over 20244.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3195, pruned_loss=0.08413, over 4280738.30 frames. ], batch size: 710, lr: 8.05e-03, grad_scale: 32.0 2023-06-20 19:49:24,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=627864.0, ans=0.5 2023-06-20 19:50:26,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=627984.0, ans=0.125 2023-06-20 19:50:39,856 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=627984.0, ans=0.125 2023-06-20 19:51:09,810 INFO [train.py:996] (3/4) Epoch 4, batch 13200, loss[loss=0.2345, simple_loss=0.3002, pruned_loss=0.0844, over 21826.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3172, pruned_loss=0.08424, over 4280178.02 frames. ], batch size: 282, lr: 8.04e-03, grad_scale: 32.0 2023-06-20 19:51:14,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=628104.0, ans=0.125 2023-06-20 19:51:25,410 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.22 vs. limit=12.0 2023-06-20 19:51:25,532 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.69 vs. limit=22.5 2023-06-20 19:51:27,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=628104.0, ans=0.0 2023-06-20 19:51:27,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=628104.0, ans=0.125 2023-06-20 19:51:29,811 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.98 vs. limit=10.0 2023-06-20 19:51:54,644 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.911e+02 2.426e+02 2.747e+02 3.167e+02 4.367e+02, threshold=5.495e+02, percent-clipped=0.0 2023-06-20 19:52:17,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=628284.0, ans=0.0 2023-06-20 19:52:26,593 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.62 vs. limit=10.0 2023-06-20 19:52:48,959 INFO [train.py:996] (3/4) Epoch 4, batch 13250, loss[loss=0.2505, simple_loss=0.3184, pruned_loss=0.09129, over 21784.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3178, pruned_loss=0.08592, over 4285102.05 frames. ], batch size: 112, lr: 8.04e-03, grad_scale: 32.0 2023-06-20 19:53:51,138 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=628524.0, ans=0.125 2023-06-20 19:53:57,496 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=628584.0, ans=0.0 2023-06-20 19:54:01,944 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=628584.0, ans=0.0 2023-06-20 19:54:27,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=628644.0, ans=0.125 2023-06-20 19:54:38,767 INFO [train.py:996] (3/4) Epoch 4, batch 13300, loss[loss=0.2821, simple_loss=0.362, pruned_loss=0.1011, over 21601.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3213, pruned_loss=0.08596, over 4286389.60 frames. ], batch size: 414, lr: 8.04e-03, grad_scale: 32.0 2023-06-20 19:54:44,925 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-20 19:54:48,895 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 19:54:49,419 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.32 vs. limit=22.5 2023-06-20 19:55:12,590 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.53 vs. limit=15.0 2023-06-20 19:55:18,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=628764.0, ans=0.125 2023-06-20 19:55:23,834 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.856e+02 2.499e+02 2.765e+02 3.137e+02 6.068e+02, threshold=5.530e+02, percent-clipped=1.0 2023-06-20 19:55:24,514 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=628824.0, ans=0.0 2023-06-20 19:56:17,444 INFO [train.py:996] (3/4) Epoch 4, batch 13350, loss[loss=0.2423, simple_loss=0.3178, pruned_loss=0.08335, over 20689.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.3251, pruned_loss=0.08871, over 4289947.42 frames. ], batch size: 607, lr: 8.04e-03, grad_scale: 32.0 2023-06-20 19:57:47,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=629244.0, ans=0.125 2023-06-20 19:57:54,439 INFO [train.py:996] (3/4) Epoch 4, batch 13400, loss[loss=0.2496, simple_loss=0.3149, pruned_loss=0.09216, over 21428.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3269, pruned_loss=0.0907, over 4292195.53 frames. ], batch size: 211, lr: 8.04e-03, grad_scale: 32.0 2023-06-20 19:58:26,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=629364.0, ans=0.035 2023-06-20 19:58:36,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=629424.0, ans=0.0 2023-06-20 19:58:38,671 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.126e+02 2.621e+02 2.989e+02 3.422e+02 4.870e+02, threshold=5.978e+02, percent-clipped=0.0 2023-06-20 19:59:37,646 INFO [train.py:996] (3/4) Epoch 4, batch 13450, loss[loss=0.2215, simple_loss=0.2877, pruned_loss=0.07771, over 21661.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.3288, pruned_loss=0.09242, over 4284064.04 frames. ], batch size: 247, lr: 8.04e-03, grad_scale: 32.0 2023-06-20 20:00:10,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=629664.0, ans=0.125 2023-06-20 20:00:12,009 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=629724.0, ans=0.0 2023-06-20 20:00:12,064 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=629724.0, ans=0.2 2023-06-20 20:00:24,115 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=629724.0, ans=0.125 2023-06-20 20:00:48,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=629784.0, ans=0.07 2023-06-20 20:01:08,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=629844.0, ans=0.0 2023-06-20 20:01:21,154 INFO [train.py:996] (3/4) Epoch 4, batch 13500, loss[loss=0.2401, simple_loss=0.3124, pruned_loss=0.08385, over 21561.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3183, pruned_loss=0.08904, over 4283208.34 frames. ], batch size: 441, lr: 8.03e-03, grad_scale: 32.0 2023-06-20 20:01:32,494 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=629904.0, ans=0.05 2023-06-20 20:01:41,154 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-20 20:01:55,870 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.883e+02 2.636e+02 2.928e+02 3.414e+02 6.823e+02, threshold=5.856e+02, percent-clipped=1.0 2023-06-20 20:02:15,603 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=630024.0, ans=0.125 2023-06-20 20:02:23,442 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.38 vs. limit=22.5 2023-06-20 20:02:30,071 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=630084.0, ans=0.125 2023-06-20 20:02:36,562 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.87 vs. limit=10.0 2023-06-20 20:02:38,974 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=630144.0, ans=0.125 2023-06-20 20:03:00,264 INFO [train.py:996] (3/4) Epoch 4, batch 13550, loss[loss=0.2971, simple_loss=0.3912, pruned_loss=0.1015, over 21764.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3231, pruned_loss=0.08889, over 4284990.74 frames. ], batch size: 351, lr: 8.03e-03, grad_scale: 32.0 2023-06-20 20:03:08,758 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.48 vs. limit=15.0 2023-06-20 20:03:49,292 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=630324.0, ans=0.125 2023-06-20 20:04:14,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=630384.0, ans=0.125 2023-06-20 20:04:29,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=630444.0, ans=0.125 2023-06-20 20:04:37,776 INFO [train.py:996] (3/4) Epoch 4, batch 13600, loss[loss=0.2258, simple_loss=0.2908, pruned_loss=0.08044, over 21272.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3244, pruned_loss=0.08964, over 4290815.75 frames. ], batch size: 176, lr: 8.03e-03, grad_scale: 32.0 2023-06-20 20:04:38,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=630504.0, ans=0.035 2023-06-20 20:04:45,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=630504.0, ans=0.2 2023-06-20 20:05:17,491 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.885e+02 2.746e+02 3.307e+02 4.061e+02 7.657e+02, threshold=6.614e+02, percent-clipped=3.0 2023-06-20 20:05:45,303 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 20:06:14,063 INFO [train.py:996] (3/4) Epoch 4, batch 13650, loss[loss=0.2067, simple_loss=0.2722, pruned_loss=0.07063, over 21719.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3191, pruned_loss=0.0863, over 4279015.68 frames. ], batch size: 316, lr: 8.03e-03, grad_scale: 16.0 2023-06-20 20:06:25,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=630804.0, ans=0.125 2023-06-20 20:06:28,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=630864.0, ans=0.125 2023-06-20 20:07:52,675 INFO [train.py:996] (3/4) Epoch 4, batch 13700, loss[loss=0.333, simple_loss=0.392, pruned_loss=0.137, over 21494.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3147, pruned_loss=0.08627, over 4274765.67 frames. ], batch size: 508, lr: 8.03e-03, grad_scale: 16.0 2023-06-20 20:08:05,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=631104.0, ans=0.2 2023-06-20 20:08:11,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=631164.0, ans=0.125 2023-06-20 20:08:33,722 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=631224.0, ans=0.5 2023-06-20 20:08:39,110 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.162e+02 2.840e+02 3.302e+02 4.432e+02 7.267e+02, threshold=6.603e+02, percent-clipped=2.0 2023-06-20 20:08:51,477 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=631224.0, ans=0.125 2023-06-20 20:09:03,073 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=631284.0, ans=0.125 2023-06-20 20:09:10,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=631284.0, ans=0.0 2023-06-20 20:09:21,706 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=631344.0, ans=0.125 2023-06-20 20:09:23,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=631344.0, ans=0.1 2023-06-20 20:09:31,696 INFO [train.py:996] (3/4) Epoch 4, batch 13750, loss[loss=0.1998, simple_loss=0.2994, pruned_loss=0.05008, over 19802.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3137, pruned_loss=0.08579, over 4270900.40 frames. ], batch size: 703, lr: 8.02e-03, grad_scale: 16.0 2023-06-20 20:10:16,842 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 20:10:36,876 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=631524.0, ans=0.0 2023-06-20 20:10:55,332 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=631584.0, ans=0.0 2023-06-20 20:11:05,366 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.55 vs. limit=15.0 2023-06-20 20:11:11,721 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.03 vs. limit=15.0 2023-06-20 20:11:18,438 INFO [train.py:996] (3/4) Epoch 4, batch 13800, loss[loss=0.2312, simple_loss=0.3118, pruned_loss=0.0753, over 21220.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3177, pruned_loss=0.08413, over 4261273.40 frames. ], batch size: 159, lr: 8.02e-03, grad_scale: 16.0 2023-06-20 20:11:18,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=631704.0, ans=0.125 2023-06-20 20:11:19,542 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.40 vs. limit=10.0 2023-06-20 20:12:15,068 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.951e+02 2.546e+02 3.323e+02 4.159e+02 7.075e+02, threshold=6.647e+02, percent-clipped=2.0 2023-06-20 20:13:14,494 INFO [train.py:996] (3/4) Epoch 4, batch 13850, loss[loss=0.2526, simple_loss=0.3248, pruned_loss=0.09017, over 21882.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3237, pruned_loss=0.08469, over 4270210.85 frames. ], batch size: 118, lr: 8.02e-03, grad_scale: 16.0 2023-06-20 20:13:58,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=632124.0, ans=0.0 2023-06-20 20:14:20,172 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=632184.0, ans=0.0 2023-06-20 20:14:53,349 INFO [train.py:996] (3/4) Epoch 4, batch 13900, loss[loss=0.2582, simple_loss=0.3254, pruned_loss=0.09552, over 21739.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.3273, pruned_loss=0.08855, over 4273699.98 frames. ], batch size: 351, lr: 8.02e-03, grad_scale: 16.0 2023-06-20 20:15:03,242 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.24 vs. limit=6.0 2023-06-20 20:15:39,752 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.652e+02 3.240e+02 3.689e+02 5.944e+02, threshold=6.479e+02, percent-clipped=0.0 2023-06-20 20:15:51,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=632484.0, ans=0.1 2023-06-20 20:16:37,004 INFO [train.py:996] (3/4) Epoch 4, batch 13950, loss[loss=0.2531, simple_loss=0.3119, pruned_loss=0.0971, over 21795.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3268, pruned_loss=0.09071, over 4279733.29 frames. ], batch size: 247, lr: 8.02e-03, grad_scale: 16.0 2023-06-20 20:16:53,715 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.81 vs. limit=15.0 2023-06-20 20:17:33,010 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.69 vs. limit=22.5 2023-06-20 20:18:13,421 INFO [train.py:996] (3/4) Epoch 4, batch 14000, loss[loss=0.2264, simple_loss=0.3175, pruned_loss=0.06759, over 21366.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3206, pruned_loss=0.08707, over 4272535.28 frames. ], batch size: 548, lr: 8.01e-03, grad_scale: 32.0 2023-06-20 20:18:49,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=633024.0, ans=0.125 2023-06-20 20:18:53,045 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-06-20 20:18:53,727 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.636e+02 2.412e+02 2.959e+02 3.758e+02 5.707e+02, threshold=5.918e+02, percent-clipped=0.0 2023-06-20 20:19:34,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=633144.0, ans=0.1 2023-06-20 20:19:50,093 INFO [train.py:996] (3/4) Epoch 4, batch 14050, loss[loss=0.2185, simple_loss=0.2775, pruned_loss=0.07974, over 21955.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3154, pruned_loss=0.08346, over 4272618.80 frames. ], batch size: 103, lr: 8.01e-03, grad_scale: 32.0 2023-06-20 20:20:01,703 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=633204.0, ans=0.125 2023-06-20 20:20:03,183 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 20:20:28,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=633324.0, ans=0.125 2023-06-20 20:20:32,787 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=633324.0, ans=0.2 2023-06-20 20:21:25,550 INFO [train.py:996] (3/4) Epoch 4, batch 14100, loss[loss=0.2095, simple_loss=0.267, pruned_loss=0.07599, over 21399.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3104, pruned_loss=0.08316, over 4263297.06 frames. ], batch size: 194, lr: 8.01e-03, grad_scale: 32.0 2023-06-20 20:21:48,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=633564.0, ans=0.1 2023-06-20 20:21:49,692 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=633564.0, ans=0.0 2023-06-20 20:22:05,805 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.600e+02 2.333e+02 2.713e+02 3.192e+02 5.959e+02, threshold=5.427e+02, percent-clipped=1.0 2023-06-20 20:22:52,270 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.08 vs. limit=15.0 2023-06-20 20:22:55,900 INFO [train.py:996] (3/4) Epoch 4, batch 14150, loss[loss=0.2366, simple_loss=0.315, pruned_loss=0.07907, over 21304.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3148, pruned_loss=0.08471, over 4259801.85 frames. ], batch size: 159, lr: 8.01e-03, grad_scale: 32.0 2023-06-20 20:22:56,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=633804.0, ans=0.0 2023-06-20 20:23:51,650 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=633984.0, ans=0.0 2023-06-20 20:24:11,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=634044.0, ans=0.125 2023-06-20 20:24:11,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=634044.0, ans=0.125 2023-06-20 20:24:24,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=634044.0, ans=0.0 2023-06-20 20:24:30,063 INFO [train.py:996] (3/4) Epoch 4, batch 14200, loss[loss=0.2872, simple_loss=0.317, pruned_loss=0.1287, over 21433.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3128, pruned_loss=0.08271, over 4266270.26 frames. ], batch size: 508, lr: 8.01e-03, grad_scale: 32.0 2023-06-20 20:24:34,053 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.27 vs. limit=12.0 2023-06-20 20:25:02,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=634164.0, ans=0.015 2023-06-20 20:25:15,334 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.735e+02 2.204e+02 2.467e+02 2.859e+02 4.192e+02, threshold=4.934e+02, percent-clipped=0.0 2023-06-20 20:25:24,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=634284.0, ans=0.125 2023-06-20 20:25:24,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=634284.0, ans=0.1 2023-06-20 20:25:38,068 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=634284.0, ans=0.0 2023-06-20 20:25:41,138 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=634344.0, ans=0.0 2023-06-20 20:25:41,185 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=634344.0, ans=0.2 2023-06-20 20:25:49,975 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.98 vs. limit=22.5 2023-06-20 20:26:06,334 INFO [train.py:996] (3/4) Epoch 4, batch 14250, loss[loss=0.1874, simple_loss=0.2591, pruned_loss=0.05783, over 21626.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3075, pruned_loss=0.08232, over 4261661.05 frames. ], batch size: 247, lr: 8.00e-03, grad_scale: 16.0 2023-06-20 20:26:12,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=634404.0, ans=0.125 2023-06-20 20:27:10,280 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=634584.0, ans=0.1 2023-06-20 20:27:17,981 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.38 vs. limit=15.0 2023-06-20 20:27:46,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=634644.0, ans=0.07 2023-06-20 20:27:50,753 INFO [train.py:996] (3/4) Epoch 4, batch 14300, loss[loss=0.2889, simple_loss=0.3797, pruned_loss=0.09902, over 21879.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3097, pruned_loss=0.0823, over 4263717.31 frames. ], batch size: 372, lr: 8.00e-03, grad_scale: 16.0 2023-06-20 20:28:26,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=634764.0, ans=0.125 2023-06-20 20:28:29,951 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=634764.0, ans=0.125 2023-06-20 20:28:38,736 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.850e+02 2.547e+02 2.874e+02 3.518e+02 5.819e+02, threshold=5.747e+02, percent-clipped=3.0 2023-06-20 20:29:28,328 INFO [train.py:996] (3/4) Epoch 4, batch 14350, loss[loss=0.2118, simple_loss=0.289, pruned_loss=0.06726, over 21798.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.313, pruned_loss=0.08219, over 4247351.14 frames. ], batch size: 247, lr: 8.00e-03, grad_scale: 16.0 2023-06-20 20:30:26,187 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=635184.0, ans=0.125 2023-06-20 20:31:05,227 INFO [train.py:996] (3/4) Epoch 4, batch 14400, loss[loss=0.2464, simple_loss=0.3014, pruned_loss=0.09569, over 21855.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3115, pruned_loss=0.08379, over 4248866.36 frames. ], batch size: 373, lr: 8.00e-03, grad_scale: 32.0 2023-06-20 20:31:40,896 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=635364.0, ans=0.1 2023-06-20 20:31:53,788 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.870e+02 2.383e+02 2.685e+02 3.261e+02 5.760e+02, threshold=5.369e+02, percent-clipped=1.0 2023-06-20 20:32:01,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=635484.0, ans=0.125 2023-06-20 20:32:10,913 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=635484.0, ans=0.125 2023-06-20 20:32:41,713 INFO [train.py:996] (3/4) Epoch 4, batch 14450, loss[loss=0.2382, simple_loss=0.2956, pruned_loss=0.0904, over 21186.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3069, pruned_loss=0.08369, over 4250794.02 frames. ], batch size: 143, lr: 8.00e-03, grad_scale: 16.0 2023-06-20 20:32:43,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=635604.0, ans=0.0 2023-06-20 20:32:53,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=635604.0, ans=0.0 2023-06-20 20:33:53,521 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=635844.0, ans=0.125 2023-06-20 20:34:07,436 INFO [train.py:996] (3/4) Epoch 4, batch 14500, loss[loss=0.2372, simple_loss=0.321, pruned_loss=0.07672, over 21775.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.305, pruned_loss=0.08333, over 4258787.63 frames. ], batch size: 371, lr: 8.00e-03, grad_scale: 16.0 2023-06-20 20:34:50,853 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=636024.0, ans=0.2 2023-06-20 20:34:53,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=636024.0, ans=0.1 2023-06-20 20:34:56,072 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.056e+02 2.605e+02 3.187e+02 4.339e+02 7.236e+02, threshold=6.375e+02, percent-clipped=9.0 2023-06-20 20:35:06,159 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 20:35:12,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=636084.0, ans=0.125 2023-06-20 20:35:15,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=636084.0, ans=0.1 2023-06-20 20:35:38,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=636144.0, ans=0.125 2023-06-20 20:35:45,547 INFO [train.py:996] (3/4) Epoch 4, batch 14550, loss[loss=0.286, simple_loss=0.355, pruned_loss=0.1085, over 21515.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3104, pruned_loss=0.08488, over 4264834.66 frames. ], batch size: 414, lr: 7.99e-03, grad_scale: 16.0 2023-06-20 20:35:48,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=636204.0, ans=0.07 2023-06-20 20:36:25,253 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=636264.0, ans=0.125 2023-06-20 20:36:40,435 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=636324.0, ans=0.2 2023-06-20 20:37:05,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=636384.0, ans=0.2 2023-06-20 20:37:14,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=636444.0, ans=0.2 2023-06-20 20:37:15,751 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=636444.0, ans=0.1 2023-06-20 20:37:30,045 INFO [train.py:996] (3/4) Epoch 4, batch 14600, loss[loss=0.2566, simple_loss=0.349, pruned_loss=0.0821, over 21880.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.316, pruned_loss=0.08861, over 4267641.82 frames. ], batch size: 316, lr: 7.99e-03, grad_scale: 16.0 2023-06-20 20:37:34,781 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=636504.0, ans=0.125 2023-06-20 20:37:53,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=636564.0, ans=0.125 2023-06-20 20:37:56,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=636564.0, ans=0.05 2023-06-20 20:38:12,833 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.049e+02 2.825e+02 3.267e+02 4.052e+02 6.777e+02, threshold=6.533e+02, percent-clipped=1.0 2023-06-20 20:39:00,571 INFO [train.py:996] (3/4) Epoch 4, batch 14650, loss[loss=0.2137, simple_loss=0.2846, pruned_loss=0.07137, over 21823.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.318, pruned_loss=0.0875, over 4270200.77 frames. ], batch size: 102, lr: 7.99e-03, grad_scale: 16.0 2023-06-20 20:39:37,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=636864.0, ans=0.2 2023-06-20 20:40:35,645 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.89 vs. limit=15.0 2023-06-20 20:40:38,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=637044.0, ans=0.125 2023-06-20 20:40:43,438 INFO [train.py:996] (3/4) Epoch 4, batch 14700, loss[loss=0.1955, simple_loss=0.2685, pruned_loss=0.06123, over 21313.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3108, pruned_loss=0.08121, over 4260443.13 frames. ], batch size: 131, lr: 7.99e-03, grad_scale: 16.0 2023-06-20 20:40:54,839 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.64 vs. limit=6.0 2023-06-20 20:41:26,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=637224.0, ans=0.1 2023-06-20 20:41:26,983 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.394e+02 1.902e+02 2.312e+02 2.758e+02 4.493e+02, threshold=4.623e+02, percent-clipped=0.0 2023-06-20 20:42:27,588 INFO [train.py:996] (3/4) Epoch 4, batch 14750, loss[loss=0.3416, simple_loss=0.4022, pruned_loss=0.1405, over 21609.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3169, pruned_loss=0.08478, over 4262847.96 frames. ], batch size: 389, lr: 7.99e-03, grad_scale: 16.0 2023-06-20 20:44:15,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=637644.0, ans=0.125 2023-06-20 20:44:17,009 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=637644.0, ans=0.125 2023-06-20 20:44:19,803 INFO [train.py:996] (3/4) Epoch 4, batch 14800, loss[loss=0.2434, simple_loss=0.3073, pruned_loss=0.08974, over 21181.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3297, pruned_loss=0.09047, over 4267875.85 frames. ], batch size: 159, lr: 7.98e-03, grad_scale: 32.0 2023-06-20 20:44:49,980 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=637764.0, ans=0.0 2023-06-20 20:44:49,981 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 20:44:56,294 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=637824.0, ans=0.0 2023-06-20 20:45:05,416 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.148e+02 3.268e+02 3.887e+02 5.109e+02 8.215e+02, threshold=7.774e+02, percent-clipped=33.0 2023-06-20 20:45:56,116 INFO [train.py:996] (3/4) Epoch 4, batch 14850, loss[loss=0.2442, simple_loss=0.2989, pruned_loss=0.0947, over 14631.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3229, pruned_loss=0.08925, over 4266324.07 frames. ], batch size: 60, lr: 7.98e-03, grad_scale: 32.0 2023-06-20 20:46:04,387 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.whiten.whitening_limit, batch_count=638004.0, ans=12.0 2023-06-20 20:46:51,243 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=638064.0, ans=0.125 2023-06-20 20:47:18,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=638184.0, ans=0.125 2023-06-20 20:47:23,549 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=638184.0, ans=0.125 2023-06-20 20:47:24,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=638184.0, ans=0.1 2023-06-20 20:47:27,928 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=638184.0, ans=0.125 2023-06-20 20:47:52,986 INFO [train.py:996] (3/4) Epoch 4, batch 14900, loss[loss=0.245, simple_loss=0.3187, pruned_loss=0.08562, over 21729.00 frames. ], tot_loss[loss=0.2566, simple_loss=0.3288, pruned_loss=0.09221, over 4265527.95 frames. ], batch size: 298, lr: 7.98e-03, grad_scale: 32.0 2023-06-20 20:48:39,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=638364.0, ans=0.125 2023-06-20 20:49:02,870 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.039e+02 2.913e+02 3.593e+02 4.421e+02 7.605e+02, threshold=7.185e+02, percent-clipped=0.0 2023-06-20 20:49:11,539 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=638484.0, ans=10.0 2023-06-20 20:49:26,349 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=638544.0, ans=0.1 2023-06-20 20:49:49,138 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=638544.0, ans=0.125 2023-06-20 20:49:51,620 INFO [train.py:996] (3/4) Epoch 4, batch 14950, loss[loss=0.2336, simple_loss=0.3175, pruned_loss=0.07486, over 21889.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3283, pruned_loss=0.09081, over 4268411.84 frames. ], batch size: 317, lr: 7.98e-03, grad_scale: 32.0 2023-06-20 20:50:15,807 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.22 vs. limit=15.0 2023-06-20 20:50:16,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=638664.0, ans=0.125 2023-06-20 20:50:44,115 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=638724.0, ans=0.1 2023-06-20 20:51:29,224 INFO [train.py:996] (3/4) Epoch 4, batch 15000, loss[loss=0.2574, simple_loss=0.3183, pruned_loss=0.09825, over 21316.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.3297, pruned_loss=0.0925, over 4272651.17 frames. ], batch size: 143, lr: 7.98e-03, grad_scale: 32.0 2023-06-20 20:51:29,225 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-20 20:52:12,278 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.2.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([1.9259, 2.4917, 3.9690, 2.9604], device='cuda:3') 2023-06-20 20:52:21,547 INFO [train.py:1028] (3/4) Epoch 4, validation: loss=0.2644, simple_loss=0.3595, pruned_loss=0.08463, over 1796401.00 frames. 2023-06-20 20:52:21,548 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-20 20:52:41,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=638964.0, ans=0.0 2023-06-20 20:52:45,037 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.81 vs. limit=15.0 2023-06-20 20:52:56,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=639024.0, ans=0.0 2023-06-20 20:52:57,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=639024.0, ans=0.0 2023-06-20 20:53:05,094 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.797e+02 2.416e+02 2.838e+02 3.359e+02 5.526e+02, threshold=5.676e+02, percent-clipped=0.0 2023-06-20 20:54:04,780 INFO [train.py:996] (3/4) Epoch 4, batch 15050, loss[loss=0.2159, simple_loss=0.2839, pruned_loss=0.0739, over 21218.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3295, pruned_loss=0.0933, over 4271163.45 frames. ], batch size: 159, lr: 7.97e-03, grad_scale: 32.0 2023-06-20 20:54:29,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=639264.0, ans=0.5 2023-06-20 20:55:01,384 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=639384.0, ans=0.0 2023-06-20 20:55:13,913 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.44 vs. limit=10.0 2023-06-20 20:55:16,808 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.92 vs. limit=15.0 2023-06-20 20:55:39,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=639444.0, ans=0.2 2023-06-20 20:55:41,718 INFO [train.py:996] (3/4) Epoch 4, batch 15100, loss[loss=0.2814, simple_loss=0.3533, pruned_loss=0.1048, over 21782.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.3327, pruned_loss=0.09352, over 4271870.24 frames. ], batch size: 124, lr: 7.97e-03, grad_scale: 16.0 2023-06-20 20:55:52,758 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=639504.0, ans=0.125 2023-06-20 20:56:01,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=639564.0, ans=0.125 2023-06-20 20:56:11,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=639564.0, ans=0.125 2023-06-20 20:56:23,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=639624.0, ans=0.125 2023-06-20 20:56:31,191 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.988e+02 2.752e+02 3.160e+02 3.731e+02 5.799e+02, threshold=6.320e+02, percent-clipped=1.0 2023-06-20 20:56:44,856 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=639684.0, ans=0.125 2023-06-20 20:57:03,147 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=639744.0, ans=0.125 2023-06-20 20:57:17,848 INFO [train.py:996] (3/4) Epoch 4, batch 15150, loss[loss=0.2314, simple_loss=0.2866, pruned_loss=0.08811, over 21356.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3286, pruned_loss=0.09383, over 4268595.72 frames. ], batch size: 177, lr: 7.97e-03, grad_scale: 16.0 2023-06-20 20:58:08,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=639924.0, ans=0.125 2023-06-20 20:58:14,986 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=22.5 2023-06-20 20:58:53,763 INFO [train.py:996] (3/4) Epoch 4, batch 15200, loss[loss=0.212, simple_loss=0.2835, pruned_loss=0.07021, over 21271.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3207, pruned_loss=0.08934, over 4273013.43 frames. ], batch size: 176, lr: 7.97e-03, grad_scale: 32.0 2023-06-20 20:59:22,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=640164.0, ans=0.1 2023-06-20 20:59:33,813 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=640224.0, ans=0.1 2023-06-20 20:59:43,553 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.716e+02 2.484e+02 2.960e+02 3.412e+02 6.984e+02, threshold=5.920e+02, percent-clipped=2.0 2023-06-20 20:59:45,510 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 21:00:11,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=640284.0, ans=0.1 2023-06-20 21:00:29,812 INFO [train.py:996] (3/4) Epoch 4, batch 15250, loss[loss=0.2275, simple_loss=0.2801, pruned_loss=0.08745, over 21313.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3162, pruned_loss=0.08759, over 4256355.86 frames. ], batch size: 549, lr: 7.97e-03, grad_scale: 32.0 2023-06-20 21:01:00,067 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=640464.0, ans=0.125 2023-06-20 21:01:03,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=640464.0, ans=0.125 2023-06-20 21:01:07,609 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=640524.0, ans=0.2 2023-06-20 21:02:06,989 INFO [train.py:996] (3/4) Epoch 4, batch 15300, loss[loss=0.2859, simple_loss=0.3583, pruned_loss=0.1067, over 21811.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3172, pruned_loss=0.09062, over 4260468.51 frames. ], batch size: 124, lr: 7.97e-03, grad_scale: 32.0 2023-06-20 21:02:51,695 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.89 vs. limit=22.5 2023-06-20 21:03:13,354 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 2.730e+02 3.158e+02 3.743e+02 8.189e+02, threshold=6.315e+02, percent-clipped=2.0 2023-06-20 21:03:42,237 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=640944.0, ans=0.0 2023-06-20 21:03:48,995 INFO [train.py:996] (3/4) Epoch 4, batch 15350, loss[loss=0.3035, simple_loss=0.3636, pruned_loss=0.1217, over 21758.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.322, pruned_loss=0.09335, over 4268070.57 frames. ], batch size: 441, lr: 7.96e-03, grad_scale: 32.0 2023-06-20 21:05:19,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=641184.0, ans=0.1 2023-06-20 21:05:38,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=641304.0, ans=0.0 2023-06-20 21:05:39,759 INFO [train.py:996] (3/4) Epoch 4, batch 15400, loss[loss=0.2413, simple_loss=0.3191, pruned_loss=0.08179, over 21866.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.322, pruned_loss=0.09146, over 4279927.79 frames. ], batch size: 124, lr: 7.96e-03, grad_scale: 32.0 2023-06-20 21:06:23,590 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.884e+02 2.400e+02 2.718e+02 3.167e+02 5.553e+02, threshold=5.437e+02, percent-clipped=0.0 2023-06-20 21:06:47,801 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=641484.0, ans=0.0 2023-06-20 21:07:01,634 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2023-06-20 21:07:15,588 INFO [train.py:996] (3/4) Epoch 4, batch 15450, loss[loss=0.2318, simple_loss=0.3072, pruned_loss=0.07827, over 21817.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3198, pruned_loss=0.09028, over 4286757.35 frames. ], batch size: 282, lr: 7.96e-03, grad_scale: 32.0 2023-06-20 21:07:57,680 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.41 vs. limit=15.0 2023-06-20 21:08:57,922 INFO [train.py:996] (3/4) Epoch 4, batch 15500, loss[loss=0.249, simple_loss=0.3185, pruned_loss=0.08977, over 21843.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3229, pruned_loss=0.08936, over 4272833.15 frames. ], batch size: 102, lr: 7.96e-03, grad_scale: 32.0 2023-06-20 21:09:03,478 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.92 vs. limit=15.0 2023-06-20 21:09:10,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=641904.0, ans=0.0 2023-06-20 21:09:11,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=641964.0, ans=0.125 2023-06-20 21:09:48,558 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.680e+02 2.414e+02 2.758e+02 3.182e+02 6.018e+02, threshold=5.516e+02, percent-clipped=3.0 2023-06-20 21:09:52,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=642024.0, ans=15.0 2023-06-20 21:10:24,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=642084.0, ans=0.125 2023-06-20 21:10:33,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=642144.0, ans=0.0 2023-06-20 21:10:33,144 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=642144.0, ans=0.125 2023-06-20 21:10:49,953 INFO [train.py:996] (3/4) Epoch 4, batch 15550, loss[loss=0.2626, simple_loss=0.3397, pruned_loss=0.0927, over 21632.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3212, pruned_loss=0.08647, over 4273902.86 frames. ], batch size: 414, lr: 7.96e-03, grad_scale: 32.0 2023-06-20 21:11:05,552 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.75 vs. limit=15.0 2023-06-20 21:11:15,222 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=642264.0, ans=0.0 2023-06-20 21:11:45,650 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.80 vs. limit=15.0 2023-06-20 21:11:48,184 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=642324.0, ans=0.07 2023-06-20 21:11:50,987 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 21:12:25,923 INFO [train.py:996] (3/4) Epoch 4, batch 15600, loss[loss=0.2143, simple_loss=0.2854, pruned_loss=0.07161, over 21172.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3149, pruned_loss=0.08483, over 4268964.24 frames. ], batch size: 143, lr: 7.95e-03, grad_scale: 32.0 2023-06-20 21:12:36,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=642504.0, ans=0.125 2023-06-20 21:13:21,298 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=642564.0, ans=0.035 2023-06-20 21:13:30,236 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=642624.0, ans=0.2 2023-06-20 21:13:32,817 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.813e+02 2.378e+02 2.705e+02 3.069e+02 6.852e+02, threshold=5.411e+02, percent-clipped=1.0 2023-06-20 21:14:13,487 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.34 vs. limit=15.0 2023-06-20 21:14:17,075 INFO [train.py:996] (3/4) Epoch 4, batch 15650, loss[loss=0.2309, simple_loss=0.2934, pruned_loss=0.08424, over 21714.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3131, pruned_loss=0.08385, over 4271451.41 frames. ], batch size: 316, lr: 7.95e-03, grad_scale: 32.0 2023-06-20 21:14:24,921 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=642804.0, ans=0.125 2023-06-20 21:15:09,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=642924.0, ans=0.125 2023-06-20 21:15:47,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=642984.0, ans=0.125 2023-06-20 21:15:50,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=642984.0, ans=0.125 2023-06-20 21:16:05,059 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=643044.0, ans=0.1 2023-06-20 21:16:19,618 INFO [train.py:996] (3/4) Epoch 4, batch 15700, loss[loss=0.2438, simple_loss=0.3152, pruned_loss=0.08619, over 21508.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3102, pruned_loss=0.08346, over 4267710.09 frames. ], batch size: 389, lr: 7.95e-03, grad_scale: 32.0 2023-06-20 21:16:47,140 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.61 vs. limit=22.5 2023-06-20 21:17:16,186 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.862e+02 2.295e+02 2.525e+02 2.882e+02 4.287e+02, threshold=5.050e+02, percent-clipped=0.0 2023-06-20 21:17:17,215 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.16 vs. limit=15.0 2023-06-20 21:17:28,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=643284.0, ans=0.125 2023-06-20 21:17:47,961 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=643344.0, ans=0.125 2023-06-20 21:17:58,121 INFO [train.py:996] (3/4) Epoch 4, batch 15750, loss[loss=0.2108, simple_loss=0.2861, pruned_loss=0.06769, over 21400.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3055, pruned_loss=0.0829, over 4266489.95 frames. ], batch size: 194, lr: 7.95e-03, grad_scale: 32.0 2023-06-20 21:18:51,412 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-20 21:19:42,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=643644.0, ans=15.0 2023-06-20 21:19:44,101 INFO [train.py:996] (3/4) Epoch 4, batch 15800, loss[loss=0.2776, simple_loss=0.3356, pruned_loss=0.1097, over 20097.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3007, pruned_loss=0.0827, over 4250382.89 frames. ], batch size: 703, lr: 7.95e-03, grad_scale: 16.0 2023-06-20 21:19:44,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=643704.0, ans=0.5 2023-06-20 21:20:20,008 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.47 vs. limit=15.0 2023-06-20 21:20:30,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=643764.0, ans=0.0 2023-06-20 21:20:37,173 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=643764.0, ans=0.5 2023-06-20 21:20:44,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=643824.0, ans=0.0 2023-06-20 21:20:51,222 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.953e+02 2.525e+02 2.910e+02 3.653e+02 6.469e+02, threshold=5.821e+02, percent-clipped=2.0 2023-06-20 21:21:28,520 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.60 vs. limit=12.0 2023-06-20 21:21:32,257 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=643944.0, ans=0.0 2023-06-20 21:21:34,812 INFO [train.py:996] (3/4) Epoch 4, batch 15850, loss[loss=0.2025, simple_loss=0.2618, pruned_loss=0.0716, over 21444.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3021, pruned_loss=0.08472, over 4250796.51 frames. ], batch size: 212, lr: 7.95e-03, grad_scale: 16.0 2023-06-20 21:21:38,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=644004.0, ans=15.0 2023-06-20 21:21:40,269 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.50 vs. limit=15.0 2023-06-20 21:22:25,920 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.40 vs. limit=15.0 2023-06-20 21:23:12,749 INFO [train.py:996] (3/4) Epoch 4, batch 15900, loss[loss=0.2362, simple_loss=0.3172, pruned_loss=0.07758, over 21688.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3002, pruned_loss=0.08467, over 4240654.18 frames. ], batch size: 332, lr: 7.94e-03, grad_scale: 16.0 2023-06-20 21:23:14,619 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=644304.0, ans=0.125 2023-06-20 21:23:32,292 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=644364.0, ans=0.125 2023-06-20 21:23:47,144 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=644364.0, ans=0.125 2023-06-20 21:24:14,368 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.69 vs. limit=15.0 2023-06-20 21:24:14,691 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.963e+02 2.520e+02 3.020e+02 3.568e+02 5.069e+02, threshold=6.040e+02, percent-clipped=0.0 2023-06-20 21:24:18,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=644484.0, ans=0.035 2023-06-20 21:24:23,176 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.47 vs. limit=6.0 2023-06-20 21:24:28,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=644484.0, ans=0.0 2023-06-20 21:24:54,176 INFO [train.py:996] (3/4) Epoch 4, batch 15950, loss[loss=0.2217, simple_loss=0.3036, pruned_loss=0.06995, over 21673.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3007, pruned_loss=0.08213, over 4242663.73 frames. ], batch size: 389, lr: 7.94e-03, grad_scale: 16.0 2023-06-20 21:26:33,946 INFO [train.py:996] (3/4) Epoch 4, batch 16000, loss[loss=0.2021, simple_loss=0.295, pruned_loss=0.05461, over 21583.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3028, pruned_loss=0.07949, over 4248651.46 frames. ], batch size: 230, lr: 7.94e-03, grad_scale: 32.0 2023-06-20 21:26:34,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=644904.0, ans=0.1 2023-06-20 21:26:59,649 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=12.0 2023-06-20 21:27:32,069 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.833e+02 2.333e+02 2.770e+02 3.330e+02 6.195e+02, threshold=5.540e+02, percent-clipped=2.0 2023-06-20 21:28:08,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=645144.0, ans=0.125 2023-06-20 21:28:20,586 INFO [train.py:996] (3/4) Epoch 4, batch 16050, loss[loss=0.2432, simple_loss=0.3289, pruned_loss=0.07881, over 21329.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3048, pruned_loss=0.07744, over 4259993.05 frames. ], batch size: 194, lr: 7.94e-03, grad_scale: 32.0 2023-06-20 21:28:34,187 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=645204.0, ans=0.07 2023-06-20 21:28:54,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=645264.0, ans=15.0 2023-06-20 21:29:27,387 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.68 vs. limit=22.5 2023-06-20 21:29:32,876 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=645384.0, ans=0.125 2023-06-20 21:29:54,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=645444.0, ans=0.1 2023-06-20 21:30:05,004 INFO [train.py:996] (3/4) Epoch 4, batch 16100, loss[loss=0.2247, simple_loss=0.293, pruned_loss=0.07815, over 21929.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3088, pruned_loss=0.07923, over 4267852.49 frames. ], batch size: 316, lr: 7.94e-03, grad_scale: 32.0 2023-06-20 21:30:22,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=645564.0, ans=0.125 2023-06-20 21:30:35,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=645564.0, ans=0.1 2023-06-20 21:30:50,372 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=645624.0, ans=0.1 2023-06-20 21:30:50,866 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.14 vs. limit=22.5 2023-06-20 21:31:00,660 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.987e+02 2.499e+02 3.072e+02 4.062e+02 6.589e+02, threshold=6.145e+02, percent-clipped=6.0 2023-06-20 21:31:07,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=645684.0, ans=0.035 2023-06-20 21:31:07,503 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=645684.0, ans=0.0 2023-06-20 21:31:13,763 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 21:31:39,637 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=17.56 vs. limit=22.5 2023-06-20 21:31:41,516 INFO [train.py:996] (3/4) Epoch 4, batch 16150, loss[loss=0.2153, simple_loss=0.2831, pruned_loss=0.07376, over 21668.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3114, pruned_loss=0.08236, over 4279010.79 frames. ], batch size: 263, lr: 7.93e-03, grad_scale: 32.0 2023-06-20 21:31:41,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=645804.0, ans=0.0 2023-06-20 21:31:49,452 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=645804.0, ans=0.0 2023-06-20 21:31:49,958 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.04 vs. limit=15.0 2023-06-20 21:32:00,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=645864.0, ans=0.125 2023-06-20 21:32:02,943 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.58 vs. limit=15.0 2023-06-20 21:32:46,231 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.83 vs. limit=10.0 2023-06-20 21:32:49,412 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.94 vs. limit=22.5 2023-06-20 21:33:04,894 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=646044.0, ans=0.125 2023-06-20 21:33:17,672 INFO [train.py:996] (3/4) Epoch 4, batch 16200, loss[loss=0.2877, simple_loss=0.3487, pruned_loss=0.1134, over 21653.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.316, pruned_loss=0.08334, over 4279448.05 frames. ], batch size: 263, lr: 7.93e-03, grad_scale: 32.0 2023-06-20 21:33:40,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=646164.0, ans=0.125 2023-06-20 21:34:14,047 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.972e+02 2.796e+02 3.136e+02 3.765e+02 7.945e+02, threshold=6.271e+02, percent-clipped=2.0 2023-06-20 21:34:53,554 INFO [train.py:996] (3/4) Epoch 4, batch 16250, loss[loss=0.2147, simple_loss=0.2916, pruned_loss=0.06893, over 21805.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3152, pruned_loss=0.08336, over 4276409.21 frames. ], batch size: 372, lr: 7.93e-03, grad_scale: 32.0 2023-06-20 21:34:54,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=646404.0, ans=0.125 2023-06-20 21:34:55,521 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=646404.0, ans=0.0 2023-06-20 21:35:02,099 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.99 vs. limit=15.0 2023-06-20 21:36:08,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=646584.0, ans=0.125 2023-06-20 21:36:30,014 INFO [train.py:996] (3/4) Epoch 4, batch 16300, loss[loss=0.2674, simple_loss=0.3336, pruned_loss=0.1006, over 21361.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3102, pruned_loss=0.08016, over 4278059.94 frames. ], batch size: 507, lr: 7.93e-03, grad_scale: 32.0 2023-06-20 21:37:23,036 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.55 vs. limit=15.0 2023-06-20 21:37:28,339 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.589e+02 2.180e+02 2.582e+02 2.940e+02 4.968e+02, threshold=5.164e+02, percent-clipped=0.0 2023-06-20 21:38:09,610 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.25 vs. limit=6.0 2023-06-20 21:38:10,103 INFO [train.py:996] (3/4) Epoch 4, batch 16350, loss[loss=0.2434, simple_loss=0.3256, pruned_loss=0.08062, over 19923.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3082, pruned_loss=0.0806, over 4268292.21 frames. ], batch size: 703, lr: 7.93e-03, grad_scale: 32.0 2023-06-20 21:39:28,505 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=647184.0, ans=0.05 2023-06-20 21:39:55,228 INFO [train.py:996] (3/4) Epoch 4, batch 16400, loss[loss=0.2361, simple_loss=0.3002, pruned_loss=0.08598, over 21848.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3118, pruned_loss=0.08268, over 4270746.70 frames. ], batch size: 107, lr: 7.92e-03, grad_scale: 32.0 2023-06-20 21:40:19,265 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.22 vs. limit=10.0 2023-06-20 21:40:23,262 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=647364.0, ans=0.2 2023-06-20 21:40:52,416 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.949e+02 2.622e+02 2.917e+02 3.537e+02 6.383e+02, threshold=5.834e+02, percent-clipped=4.0 2023-06-20 21:41:20,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=647544.0, ans=0.125 2023-06-20 21:41:32,874 INFO [train.py:996] (3/4) Epoch 4, batch 16450, loss[loss=0.2386, simple_loss=0.3043, pruned_loss=0.08648, over 21250.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3129, pruned_loss=0.08432, over 4271353.76 frames. ], batch size: 143, lr: 7.92e-03, grad_scale: 32.0 2023-06-20 21:41:48,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=647604.0, ans=0.1 2023-06-20 21:41:58,284 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.56 vs. limit=15.0 2023-06-20 21:42:08,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=647664.0, ans=0.125 2023-06-20 21:42:18,000 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.88 vs. limit=15.0 2023-06-20 21:42:45,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=647784.0, ans=0.125 2023-06-20 21:42:51,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=647844.0, ans=0.0 2023-06-20 21:42:54,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=647844.0, ans=0.2 2023-06-20 21:42:54,420 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=647844.0, ans=0.125 2023-06-20 21:43:17,620 INFO [train.py:996] (3/4) Epoch 4, batch 16500, loss[loss=0.2501, simple_loss=0.3317, pruned_loss=0.08425, over 21661.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3115, pruned_loss=0.0847, over 4279093.49 frames. ], batch size: 441, lr: 7.92e-03, grad_scale: 32.0 2023-06-20 21:43:18,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=647904.0, ans=0.0 2023-06-20 21:43:49,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=647964.0, ans=0.0 2023-06-20 21:44:00,435 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=647964.0, ans=0.0 2023-06-20 21:44:24,026 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.08 vs. limit=10.0 2023-06-20 21:44:24,339 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 2.642e+02 3.003e+02 3.781e+02 6.239e+02, threshold=6.006e+02, percent-clipped=1.0 2023-06-20 21:44:59,349 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=648144.0, ans=0.0 2023-06-20 21:45:26,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=648144.0, ans=0.125 2023-06-20 21:45:32,252 INFO [train.py:996] (3/4) Epoch 4, batch 16550, loss[loss=0.2502, simple_loss=0.3321, pruned_loss=0.08415, over 21699.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3112, pruned_loss=0.08174, over 4273396.50 frames. ], batch size: 351, lr: 7.92e-03, grad_scale: 32.0 2023-06-20 21:45:57,969 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=648264.0, ans=0.125 2023-06-20 21:46:00,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=648264.0, ans=0.125 2023-06-20 21:47:16,292 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=648444.0, ans=0.125 2023-06-20 21:47:24,354 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.13 vs. limit=12.0 2023-06-20 21:47:29,699 INFO [train.py:996] (3/4) Epoch 4, batch 16600, loss[loss=0.2717, simple_loss=0.3601, pruned_loss=0.0917, over 21383.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3199, pruned_loss=0.08558, over 4273430.00 frames. ], batch size: 131, lr: 7.92e-03, grad_scale: 32.0 2023-06-20 21:47:49,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=648564.0, ans=0.07 2023-06-20 21:47:51,152 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.19 vs. limit=15.0 2023-06-20 21:47:51,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=648564.0, ans=15.0 2023-06-20 21:47:55,946 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.05 vs. limit=12.0 2023-06-20 21:48:22,115 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.763e+02 2.551e+02 2.918e+02 3.524e+02 5.305e+02, threshold=5.835e+02, percent-clipped=0.0 2023-06-20 21:48:50,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=648684.0, ans=0.05 2023-06-20 21:48:50,654 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=648684.0, ans=0.0 2023-06-20 21:48:52,249 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=648744.0, ans=0.125 2023-06-20 21:48:54,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=648744.0, ans=0.0 2023-06-20 21:49:13,948 INFO [train.py:996] (3/4) Epoch 4, batch 16650, loss[loss=0.264, simple_loss=0.3356, pruned_loss=0.09621, over 21707.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.3314, pruned_loss=0.08937, over 4272806.13 frames. ], batch size: 298, lr: 7.92e-03, grad_scale: 32.0 2023-06-20 21:49:28,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=648864.0, ans=0.125 2023-06-20 21:49:45,115 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=12.0 2023-06-20 21:49:48,373 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.27 vs. limit=10.0 2023-06-20 21:49:50,812 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=648924.0, ans=0.0 2023-06-20 21:50:24,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=648984.0, ans=0.1 2023-06-20 21:50:55,313 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.59 vs. limit=15.0 2023-06-20 21:50:55,789 INFO [train.py:996] (3/4) Epoch 4, batch 16700, loss[loss=0.2608, simple_loss=0.3562, pruned_loss=0.08271, over 21200.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.33, pruned_loss=0.08917, over 4272586.50 frames. ], batch size: 549, lr: 7.91e-03, grad_scale: 32.0 2023-06-20 21:52:03,127 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.991e+02 2.731e+02 3.045e+02 3.440e+02 5.036e+02, threshold=6.090e+02, percent-clipped=0.0 2023-06-20 21:52:03,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=649224.0, ans=0.1 2023-06-20 21:52:55,499 INFO [train.py:996] (3/4) Epoch 4, batch 16750, loss[loss=0.2829, simple_loss=0.3707, pruned_loss=0.09758, over 21895.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.3336, pruned_loss=0.09191, over 4261543.15 frames. ], batch size: 372, lr: 7.91e-03, grad_scale: 32.0 2023-06-20 21:52:56,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=649404.0, ans=0.0 2023-06-20 21:53:12,051 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 21:53:53,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=649524.0, ans=0.2 2023-06-20 21:53:57,645 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=649584.0, ans=0.2 2023-06-20 21:54:15,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=649584.0, ans=0.125 2023-06-20 21:54:44,327 INFO [train.py:996] (3/4) Epoch 4, batch 16800, loss[loss=0.2434, simple_loss=0.3041, pruned_loss=0.09136, over 21442.00 frames. ], tot_loss[loss=0.2623, simple_loss=0.3389, pruned_loss=0.09289, over 4261671.86 frames. ], batch size: 211, lr: 7.91e-03, grad_scale: 32.0 2023-06-20 21:55:06,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=649764.0, ans=0.125 2023-06-20 21:55:23,473 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 21:55:44,094 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.214e+02 2.879e+02 3.327e+02 4.049e+02 9.640e+02, threshold=6.654e+02, percent-clipped=7.0 2023-06-20 21:55:48,483 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 21:56:54,184 INFO [train.py:996] (3/4) Epoch 4, batch 16850, loss[loss=0.2512, simple_loss=0.3142, pruned_loss=0.09413, over 21643.00 frames. ], tot_loss[loss=0.2605, simple_loss=0.3352, pruned_loss=0.09291, over 4270681.93 frames. ], batch size: 263, lr: 7.91e-03, grad_scale: 32.0 2023-06-20 21:58:22,510 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=650244.0, ans=0.09899494936611666 2023-06-20 21:58:30,571 INFO [train.py:996] (3/4) Epoch 4, batch 16900, loss[loss=0.2081, simple_loss=0.2762, pruned_loss=0.07001, over 20780.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.3281, pruned_loss=0.09112, over 4279838.26 frames. ], batch size: 608, lr: 7.91e-03, grad_scale: 32.0 2023-06-20 21:58:55,194 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.97 vs. limit=15.0 2023-06-20 21:59:16,042 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.755e+02 2.471e+02 2.836e+02 3.403e+02 5.773e+02, threshold=5.671e+02, percent-clipped=0.0 2023-06-20 21:59:19,415 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=650484.0, ans=0.125 2023-06-20 22:00:07,490 INFO [train.py:996] (3/4) Epoch 4, batch 16950, loss[loss=0.243, simple_loss=0.312, pruned_loss=0.08701, over 21737.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3202, pruned_loss=0.08963, over 4281370.73 frames. ], batch size: 389, lr: 7.90e-03, grad_scale: 32.0 2023-06-20 22:00:19,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=650604.0, ans=0.1 2023-06-20 22:00:22,839 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=650664.0, ans=0.2 2023-06-20 22:00:34,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=650664.0, ans=0.125 2023-06-20 22:01:30,524 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=650784.0, ans=0.0 2023-06-20 22:01:30,565 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=650784.0, ans=0.0 2023-06-20 22:01:54,990 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=650844.0, ans=0.95 2023-06-20 22:01:59,402 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=650844.0, ans=0.0 2023-06-20 22:02:01,824 INFO [train.py:996] (3/4) Epoch 4, batch 17000, loss[loss=0.2507, simple_loss=0.3148, pruned_loss=0.09332, over 21891.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3157, pruned_loss=0.0892, over 4288246.25 frames. ], batch size: 124, lr: 7.90e-03, grad_scale: 32.0 2023-06-20 22:02:14,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=650904.0, ans=0.2 2023-06-20 22:02:46,114 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=651024.0, ans=15.0 2023-06-20 22:02:47,947 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.109e+02 2.716e+02 3.273e+02 3.995e+02 5.622e+02, threshold=6.546e+02, percent-clipped=0.0 2023-06-20 22:02:53,205 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=651084.0, ans=0.125 2023-06-20 22:03:30,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=651144.0, ans=0.2 2023-06-20 22:03:31,928 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=651144.0, ans=0.2 2023-06-20 22:03:34,850 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=651144.0, ans=0.125 2023-06-20 22:03:38,868 INFO [train.py:996] (3/4) Epoch 4, batch 17050, loss[loss=0.298, simple_loss=0.3778, pruned_loss=0.1091, over 21784.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.3225, pruned_loss=0.09118, over 4291163.52 frames. ], batch size: 414, lr: 7.90e-03, grad_scale: 32.0 2023-06-20 22:03:42,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=651204.0, ans=0.125 2023-06-20 22:04:24,583 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=651324.0, ans=0.05 2023-06-20 22:04:24,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=651324.0, ans=0.125 2023-06-20 22:05:13,755 INFO [train.py:996] (3/4) Epoch 4, batch 17100, loss[loss=0.2405, simple_loss=0.3022, pruned_loss=0.08936, over 21477.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3222, pruned_loss=0.09139, over 4289031.39 frames. ], batch size: 211, lr: 7.90e-03, grad_scale: 32.0 2023-06-20 22:05:35,238 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.85 vs. limit=22.5 2023-06-20 22:05:57,683 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.72 vs. limit=22.5 2023-06-20 22:05:59,547 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.062e+02 2.611e+02 3.092e+02 3.570e+02 5.618e+02, threshold=6.184e+02, percent-clipped=0.0 2023-06-20 22:06:42,685 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=651744.0, ans=0.125 2023-06-20 22:06:51,001 INFO [train.py:996] (3/4) Epoch 4, batch 17150, loss[loss=0.2149, simple_loss=0.2646, pruned_loss=0.08264, over 21249.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3179, pruned_loss=0.09062, over 4289216.09 frames. ], batch size: 608, lr: 7.90e-03, grad_scale: 32.0 2023-06-20 22:07:22,358 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=12.0 2023-06-20 22:07:27,988 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=651864.0, ans=0.125 2023-06-20 22:07:49,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=651924.0, ans=0.2 2023-06-20 22:08:35,504 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.73 vs. limit=15.0 2023-06-20 22:08:48,887 INFO [train.py:996] (3/4) Epoch 4, batch 17200, loss[loss=0.2556, simple_loss=0.3308, pruned_loss=0.09021, over 21508.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3161, pruned_loss=0.0897, over 4286818.39 frames. ], batch size: 112, lr: 7.90e-03, grad_scale: 32.0 2023-06-20 22:09:44,788 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.32 vs. limit=15.0 2023-06-20 22:09:48,592 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.879e+02 2.497e+02 2.913e+02 3.458e+02 6.747e+02, threshold=5.827e+02, percent-clipped=3.0 2023-06-20 22:09:50,557 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=652284.0, ans=0.1 2023-06-20 22:09:52,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=652284.0, ans=0.125 2023-06-20 22:10:29,123 INFO [train.py:996] (3/4) Epoch 4, batch 17250, loss[loss=0.2802, simple_loss=0.3607, pruned_loss=0.09985, over 21752.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3199, pruned_loss=0.0912, over 4293344.07 frames. ], batch size: 298, lr: 7.89e-03, grad_scale: 32.0 2023-06-20 22:10:34,786 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.75 vs. limit=22.5 2023-06-20 22:11:18,437 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=652524.0, ans=0.04949747468305833 2023-06-20 22:11:34,167 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=652584.0, ans=0.0 2023-06-20 22:11:56,887 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=652644.0, ans=0.0 2023-06-20 22:12:04,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=652644.0, ans=0.0 2023-06-20 22:12:08,324 INFO [train.py:996] (3/4) Epoch 4, batch 17300, loss[loss=0.2612, simple_loss=0.3284, pruned_loss=0.09703, over 21492.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.3293, pruned_loss=0.09521, over 4287060.56 frames. ], batch size: 194, lr: 7.89e-03, grad_scale: 32.0 2023-06-20 22:12:57,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=652824.0, ans=0.0 2023-06-20 22:13:00,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=652824.0, ans=0.5 2023-06-20 22:13:02,316 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=652824.0, ans=0.125 2023-06-20 22:13:02,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=652824.0, ans=0.125 2023-06-20 22:13:06,576 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=652824.0, ans=0.0 2023-06-20 22:13:13,036 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.061e+02 2.676e+02 3.122e+02 3.587e+02 4.716e+02, threshold=6.244e+02, percent-clipped=0.0 2023-06-20 22:13:22,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=652884.0, ans=0.125 2023-06-20 22:13:41,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=652944.0, ans=0.025 2023-06-20 22:13:52,571 INFO [train.py:996] (3/4) Epoch 4, batch 17350, loss[loss=0.2448, simple_loss=0.3229, pruned_loss=0.08334, over 19867.00 frames. ], tot_loss[loss=0.2603, simple_loss=0.3308, pruned_loss=0.09487, over 4279162.17 frames. ], batch size: 702, lr: 7.89e-03, grad_scale: 32.0 2023-06-20 22:14:04,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=653004.0, ans=0.0 2023-06-20 22:14:30,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=653064.0, ans=0.025 2023-06-20 22:14:54,005 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=653124.0, ans=0.125 2023-06-20 22:14:54,068 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=653124.0, ans=0.125 2023-06-20 22:15:05,542 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.74 vs. limit=22.5 2023-06-20 22:15:43,270 INFO [train.py:996] (3/4) Epoch 4, batch 17400, loss[loss=0.2137, simple_loss=0.289, pruned_loss=0.06915, over 21640.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3269, pruned_loss=0.09072, over 4281518.02 frames. ], batch size: 247, lr: 7.89e-03, grad_scale: 32.0 2023-06-20 22:15:48,427 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=653304.0, ans=0.1 2023-06-20 22:16:49,923 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.985e+02 2.607e+02 3.173e+02 3.817e+02 7.670e+02, threshold=6.346e+02, percent-clipped=2.0 2023-06-20 22:17:25,677 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.58 vs. limit=15.0 2023-06-20 22:17:40,496 INFO [train.py:996] (3/4) Epoch 4, batch 17450, loss[loss=0.2236, simple_loss=0.3164, pruned_loss=0.06538, over 21698.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3229, pruned_loss=0.08806, over 4284288.46 frames. ], batch size: 414, lr: 7.89e-03, grad_scale: 32.0 2023-06-20 22:17:48,606 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=653604.0, ans=0.125 2023-06-20 22:17:52,893 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=653604.0, ans=0.125 2023-06-20 22:18:00,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=653664.0, ans=0.2 2023-06-20 22:18:02,434 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=653664.0, ans=0.125 2023-06-20 22:18:12,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=653664.0, ans=0.0 2023-06-20 22:18:45,365 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.25 vs. limit=10.0 2023-06-20 22:18:58,855 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=653844.0, ans=0.1 2023-06-20 22:19:08,369 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.72 vs. limit=15.0 2023-06-20 22:19:11,787 INFO [train.py:996] (3/4) Epoch 4, batch 17500, loss[loss=0.2351, simple_loss=0.2985, pruned_loss=0.08586, over 21777.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3173, pruned_loss=0.08514, over 4278951.87 frames. ], batch size: 247, lr: 7.89e-03, grad_scale: 32.0 2023-06-20 22:19:21,182 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=653904.0, ans=0.125 2023-06-20 22:19:31,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=653964.0, ans=0.1 2023-06-20 22:19:36,545 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.78 vs. limit=15.0 2023-06-20 22:19:43,306 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 22:19:45,283 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.89 vs. limit=15.0 2023-06-20 22:20:00,467 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.618e+02 2.227e+02 2.545e+02 2.907e+02 4.795e+02, threshold=5.089e+02, percent-clipped=0.0 2023-06-20 22:20:10,565 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=654084.0, ans=0.125 2023-06-20 22:20:38,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=654144.0, ans=0.125 2023-06-20 22:20:41,371 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=654144.0, ans=0.1 2023-06-20 22:20:46,772 INFO [train.py:996] (3/4) Epoch 4, batch 17550, loss[loss=0.212, simple_loss=0.3018, pruned_loss=0.06114, over 21797.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3171, pruned_loss=0.084, over 4283663.50 frames. ], batch size: 316, lr: 7.88e-03, grad_scale: 16.0 2023-06-20 22:21:21,876 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=654324.0, ans=0.125 2023-06-20 22:21:32,674 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.63 vs. limit=15.0 2023-06-20 22:21:35,306 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 22:21:48,373 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=654384.0, ans=0.125 2023-06-20 22:22:22,679 INFO [train.py:996] (3/4) Epoch 4, batch 17600, loss[loss=0.2557, simple_loss=0.3246, pruned_loss=0.09339, over 21828.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3191, pruned_loss=0.08395, over 4281328.09 frames. ], batch size: 282, lr: 7.88e-03, grad_scale: 32.0 2023-06-20 22:22:55,175 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.17 vs. limit=12.0 2023-06-20 22:23:12,014 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.874e+02 2.529e+02 3.022e+02 3.829e+02 6.673e+02, threshold=6.045e+02, percent-clipped=10.0 2023-06-20 22:23:19,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=654684.0, ans=0.0 2023-06-20 22:23:59,704 INFO [train.py:996] (3/4) Epoch 4, batch 17650, loss[loss=0.1693, simple_loss=0.2209, pruned_loss=0.0589, over 21233.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3177, pruned_loss=0.0846, over 4275974.44 frames. ], batch size: 159, lr: 7.88e-03, grad_scale: 32.0 2023-06-20 22:24:34,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=654924.0, ans=0.125 2023-06-20 22:24:50,389 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=654924.0, ans=0.2 2023-06-20 22:24:51,781 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=654924.0, ans=0.04949747468305833 2023-06-20 22:25:36,825 INFO [train.py:996] (3/4) Epoch 4, batch 17700, loss[loss=0.2532, simple_loss=0.3397, pruned_loss=0.08337, over 21945.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3121, pruned_loss=0.08161, over 4274947.65 frames. ], batch size: 317, lr: 7.88e-03, grad_scale: 32.0 2023-06-20 22:25:37,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=655104.0, ans=0.125 2023-06-20 22:25:49,070 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=655104.0, ans=0.125 2023-06-20 22:26:12,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=655164.0, ans=0.0 2023-06-20 22:26:42,319 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.837e+02 2.561e+02 3.035e+02 3.633e+02 7.633e+02, threshold=6.070e+02, percent-clipped=4.0 2023-06-20 22:27:25,851 INFO [train.py:996] (3/4) Epoch 4, batch 17750, loss[loss=0.2516, simple_loss=0.3347, pruned_loss=0.08425, over 20648.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3205, pruned_loss=0.08621, over 4273135.39 frames. ], batch size: 607, lr: 7.88e-03, grad_scale: 32.0 2023-06-20 22:29:23,616 INFO [train.py:996] (3/4) Epoch 4, batch 17800, loss[loss=0.2265, simple_loss=0.302, pruned_loss=0.07552, over 21824.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3188, pruned_loss=0.08493, over 4271116.19 frames. ], batch size: 282, lr: 7.87e-03, grad_scale: 32.0 2023-06-20 22:30:00,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=655764.0, ans=0.2 2023-06-20 22:30:30,537 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.734e+02 2.510e+02 2.835e+02 3.285e+02 4.580e+02, threshold=5.670e+02, percent-clipped=0.0 2023-06-20 22:30:47,402 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=655944.0, ans=0.0 2023-06-20 22:31:14,507 INFO [train.py:996] (3/4) Epoch 4, batch 17850, loss[loss=0.2496, simple_loss=0.3266, pruned_loss=0.08627, over 21473.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3184, pruned_loss=0.08483, over 4275488.29 frames. ], batch size: 131, lr: 7.87e-03, grad_scale: 32.0 2023-06-20 22:31:18,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=656004.0, ans=0.125 2023-06-20 22:31:49,659 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.00 vs. limit=15.0 2023-06-20 22:33:12,751 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=656244.0, ans=0.125 2023-06-20 22:33:25,139 INFO [train.py:996] (3/4) Epoch 4, batch 17900, loss[loss=0.2385, simple_loss=0.3174, pruned_loss=0.07983, over 21290.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3234, pruned_loss=0.08717, over 4273974.84 frames. ], batch size: 159, lr: 7.87e-03, grad_scale: 32.0 2023-06-20 22:33:42,950 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=656304.0, ans=0.125 2023-06-20 22:33:52,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=656364.0, ans=0.0 2023-06-20 22:34:19,836 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.048e+02 2.504e+02 2.807e+02 3.225e+02 4.472e+02, threshold=5.613e+02, percent-clipped=0.0 2023-06-20 22:34:52,392 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=656484.0, ans=0.2 2023-06-20 22:35:02,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=656544.0, ans=0.125 2023-06-20 22:35:20,760 INFO [train.py:996] (3/4) Epoch 4, batch 17950, loss[loss=0.2484, simple_loss=0.339, pruned_loss=0.07886, over 21572.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3234, pruned_loss=0.08382, over 4275201.96 frames. ], batch size: 441, lr: 7.87e-03, grad_scale: 32.0 2023-06-20 22:35:38,913 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=656664.0, ans=0.0 2023-06-20 22:35:53,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=656664.0, ans=0.125 2023-06-20 22:36:13,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=656784.0, ans=0.0 2023-06-20 22:36:38,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=656844.0, ans=0.125 2023-06-20 22:36:58,689 INFO [train.py:996] (3/4) Epoch 4, batch 18000, loss[loss=0.2327, simple_loss=0.2922, pruned_loss=0.08658, over 16091.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3154, pruned_loss=0.08197, over 4275947.40 frames. ], batch size: 67, lr: 7.87e-03, grad_scale: 32.0 2023-06-20 22:36:58,692 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-20 22:37:57,972 INFO [train.py:1028] (3/4) Epoch 4, validation: loss=0.2692, simple_loss=0.3694, pruned_loss=0.08448, over 1796401.00 frames. 2023-06-20 22:37:57,974 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-20 22:38:04,674 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.02 vs. limit=15.0 2023-06-20 22:38:28,568 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=656964.0, ans=0.0 2023-06-20 22:38:30,102 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=656964.0, ans=0.125 2023-06-20 22:38:39,891 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.28 vs. limit=22.5 2023-06-20 22:38:43,585 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=657024.0, ans=0.125 2023-06-20 22:38:52,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=657024.0, ans=0.0 2023-06-20 22:38:52,526 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.52 vs. limit=15.0 2023-06-20 22:38:52,538 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.69 vs. limit=15.0 2023-06-20 22:38:53,060 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.563e+02 2.214e+02 2.674e+02 3.135e+02 4.981e+02, threshold=5.348e+02, percent-clipped=0.0 2023-06-20 22:39:13,811 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=657084.0, ans=0.0 2023-06-20 22:39:36,678 INFO [train.py:996] (3/4) Epoch 4, batch 18050, loss[loss=0.2314, simple_loss=0.2932, pruned_loss=0.08479, over 21422.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.31, pruned_loss=0.08124, over 4274848.27 frames. ], batch size: 211, lr: 7.87e-03, grad_scale: 32.0 2023-06-20 22:40:04,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=657264.0, ans=0.125 2023-06-20 22:40:12,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=657324.0, ans=0.1 2023-06-20 22:40:36,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=657324.0, ans=0.1 2023-06-20 22:40:44,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=657384.0, ans=0.125 2023-06-20 22:40:49,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=657384.0, ans=0.1 2023-06-20 22:41:08,081 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.77 vs. limit=6.0 2023-06-20 22:41:20,648 INFO [train.py:996] (3/4) Epoch 4, batch 18100, loss[loss=0.2542, simple_loss=0.3486, pruned_loss=0.07987, over 21621.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3145, pruned_loss=0.08318, over 4274386.49 frames. ], batch size: 414, lr: 7.86e-03, grad_scale: 32.0 2023-06-20 22:42:08,752 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=657624.0, ans=0.125 2023-06-20 22:42:21,098 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.095e+02 2.529e+02 2.874e+02 3.352e+02 6.003e+02, threshold=5.748e+02, percent-clipped=2.0 2023-06-20 22:42:34,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=657684.0, ans=0.125 2023-06-20 22:42:58,560 INFO [train.py:996] (3/4) Epoch 4, batch 18150, loss[loss=0.2205, simple_loss=0.2924, pruned_loss=0.07436, over 21713.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3169, pruned_loss=0.08329, over 4272490.56 frames. ], batch size: 282, lr: 7.86e-03, grad_scale: 32.0 2023-06-20 22:43:18,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=657864.0, ans=0.125 2023-06-20 22:44:04,599 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=2.510e-03 2023-06-20 22:44:08,789 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=657984.0, ans=0.125 2023-06-20 22:44:22,891 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 22:44:36,738 INFO [train.py:996] (3/4) Epoch 4, batch 18200, loss[loss=0.2221, simple_loss=0.2855, pruned_loss=0.07934, over 21901.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3116, pruned_loss=0.08334, over 4274909.45 frames. ], batch size: 107, lr: 7.86e-03, grad_scale: 32.0 2023-06-20 22:44:57,777 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=658164.0, ans=10.0 2023-06-20 22:45:32,365 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.656e+02 2.336e+02 2.746e+02 3.220e+02 5.960e+02, threshold=5.492e+02, percent-clipped=1.0 2023-06-20 22:46:08,034 INFO [train.py:996] (3/4) Epoch 4, batch 18250, loss[loss=0.277, simple_loss=0.377, pruned_loss=0.08856, over 19960.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3044, pruned_loss=0.081, over 4263178.78 frames. ], batch size: 702, lr: 7.86e-03, grad_scale: 16.0 2023-06-20 22:46:08,386 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=658404.0, ans=0.125 2023-06-20 22:46:41,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=658464.0, ans=0.0 2023-06-20 22:46:50,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=658524.0, ans=0.0 2023-06-20 22:46:59,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=658584.0, ans=0.125 2023-06-20 22:47:32,370 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=658644.0, ans=0.125 2023-06-20 22:47:45,376 INFO [train.py:996] (3/4) Epoch 4, batch 18300, loss[loss=0.2364, simple_loss=0.3059, pruned_loss=0.08347, over 21800.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3033, pruned_loss=0.08072, over 4260638.31 frames. ], batch size: 112, lr: 7.86e-03, grad_scale: 16.0 2023-06-20 22:47:49,807 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.36 vs. limit=12.0 2023-06-20 22:47:56,838 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=658704.0, ans=0.0 2023-06-20 22:48:36,819 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.798e+02 2.526e+02 2.841e+02 3.430e+02 6.700e+02, threshold=5.681e+02, percent-clipped=2.0 2023-06-20 22:48:55,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=658884.0, ans=0.0 2023-06-20 22:48:55,713 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=658884.0, ans=0.0 2023-06-20 22:49:17,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=658944.0, ans=0.0 2023-06-20 22:49:22,773 INFO [train.py:996] (3/4) Epoch 4, batch 18350, loss[loss=0.2293, simple_loss=0.2919, pruned_loss=0.08332, over 21269.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3093, pruned_loss=0.08078, over 4268095.92 frames. ], batch size: 144, lr: 7.85e-03, grad_scale: 16.0 2023-06-20 22:50:42,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=659244.0, ans=0.2 2023-06-20 22:50:45,296 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=659244.0, ans=0.04949747468305833 2023-06-20 22:50:53,554 INFO [train.py:996] (3/4) Epoch 4, batch 18400, loss[loss=0.189, simple_loss=0.2731, pruned_loss=0.05243, over 21541.00 frames. ], tot_loss[loss=0.232, simple_loss=0.305, pruned_loss=0.07945, over 4270339.96 frames. ], batch size: 230, lr: 7.85e-03, grad_scale: 32.0 2023-06-20 22:52:08,247 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.801e+02 2.306e+02 2.639e+02 3.073e+02 4.656e+02, threshold=5.278e+02, percent-clipped=0.0 2023-06-20 22:52:21,342 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=659484.0, ans=0.0 2023-06-20 22:52:37,825 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=659544.0, ans=0.2 2023-06-20 22:52:43,447 INFO [train.py:996] (3/4) Epoch 4, batch 18450, loss[loss=0.1984, simple_loss=0.2725, pruned_loss=0.06218, over 21690.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3009, pruned_loss=0.07583, over 4266843.22 frames. ], batch size: 298, lr: 7.85e-03, grad_scale: 16.0 2023-06-20 22:53:05,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=659664.0, ans=0.0 2023-06-20 22:53:13,011 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=659664.0, ans=0.0 2023-06-20 22:53:25,356 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.03 vs. limit=15.0 2023-06-20 22:54:03,630 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.19 vs. limit=22.5 2023-06-20 22:54:17,236 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=659844.0, ans=0.125 2023-06-20 22:54:19,822 INFO [train.py:996] (3/4) Epoch 4, batch 18500, loss[loss=0.218, simple_loss=0.2741, pruned_loss=0.08099, over 21763.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2967, pruned_loss=0.07491, over 4258014.35 frames. ], batch size: 118, lr: 7.85e-03, grad_scale: 16.0 2023-06-20 22:54:50,684 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.01 vs. limit=6.0 2023-06-20 22:55:22,936 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.463e+02 2.377e+02 2.788e+02 3.386e+02 5.871e+02, threshold=5.576e+02, percent-clipped=1.0 2023-06-20 22:55:47,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=660144.0, ans=0.0 2023-06-20 22:55:56,914 INFO [train.py:996] (3/4) Epoch 4, batch 18550, loss[loss=0.211, simple_loss=0.28, pruned_loss=0.07105, over 21191.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2968, pruned_loss=0.07448, over 4253634.29 frames. ], batch size: 176, lr: 7.85e-03, grad_scale: 16.0 2023-06-20 22:56:03,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=660204.0, ans=0.2 2023-06-20 22:56:32,109 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=660264.0, ans=0.1 2023-06-20 22:56:40,950 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=660324.0, ans=0.125 2023-06-20 22:57:20,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=660384.0, ans=0.2 2023-06-20 22:57:25,490 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.77 vs. limit=15.0 2023-06-20 22:57:28,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=660444.0, ans=0.125 2023-06-20 22:57:45,991 INFO [train.py:996] (3/4) Epoch 4, batch 18600, loss[loss=0.2723, simple_loss=0.3337, pruned_loss=0.1055, over 21292.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2951, pruned_loss=0.07523, over 4259350.95 frames. ], batch size: 471, lr: 7.85e-03, grad_scale: 16.0 2023-06-20 22:57:46,562 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=660504.0, ans=0.2 2023-06-20 22:57:59,023 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.09 vs. limit=15.0 2023-06-20 22:57:59,033 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.46 vs. limit=12.0 2023-06-20 22:58:10,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=660564.0, ans=0.0 2023-06-20 22:58:24,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=660624.0, ans=0.1 2023-06-20 22:58:42,412 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.875e+02 2.410e+02 2.851e+02 3.256e+02 5.084e+02, threshold=5.701e+02, percent-clipped=0.0 2023-06-20 22:58:51,303 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=660684.0, ans=0.1 2023-06-20 22:58:54,402 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=660684.0, ans=10.0 2023-06-20 22:59:07,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=660744.0, ans=0.125 2023-06-20 22:59:15,908 INFO [train.py:996] (3/4) Epoch 4, batch 18650, loss[loss=0.2187, simple_loss=0.2796, pruned_loss=0.07885, over 20233.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2946, pruned_loss=0.0761, over 4260586.76 frames. ], batch size: 703, lr: 7.84e-03, grad_scale: 16.0 2023-06-20 22:59:41,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=660864.0, ans=0.125 2023-06-20 23:00:15,569 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=660924.0, ans=0.125 2023-06-20 23:00:43,192 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 23:00:43,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=660984.0, ans=0.0 2023-06-20 23:01:02,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=661104.0, ans=0.0 2023-06-20 23:01:02,988 INFO [train.py:996] (3/4) Epoch 4, batch 18700, loss[loss=0.2148, simple_loss=0.2627, pruned_loss=0.08345, over 20308.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2926, pruned_loss=0.07778, over 4267133.01 frames. ], batch size: 703, lr: 7.84e-03, grad_scale: 16.0 2023-06-20 23:01:32,351 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=661164.0, ans=0.2 2023-06-20 23:01:33,779 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=661164.0, ans=0.2 2023-06-20 23:02:02,423 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=661224.0, ans=0.0 2023-06-20 23:02:25,371 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.956e+02 2.377e+02 2.645e+02 3.014e+02 5.356e+02, threshold=5.290e+02, percent-clipped=0.0 2023-06-20 23:02:45,468 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.90 vs. limit=10.0 2023-06-20 23:03:02,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=661404.0, ans=0.125 2023-06-20 23:03:03,478 INFO [train.py:996] (3/4) Epoch 4, batch 18750, loss[loss=0.2667, simple_loss=0.3377, pruned_loss=0.09786, over 21762.00 frames. ], tot_loss[loss=0.228, simple_loss=0.2953, pruned_loss=0.08033, over 4261593.80 frames. ], batch size: 298, lr: 7.84e-03, grad_scale: 16.0 2023-06-20 23:04:39,198 INFO [train.py:996] (3/4) Epoch 4, batch 18800, loss[loss=0.2281, simple_loss=0.3128, pruned_loss=0.07165, over 21640.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.301, pruned_loss=0.08093, over 4260783.51 frames. ], batch size: 414, lr: 7.84e-03, grad_scale: 32.0 2023-06-20 23:04:39,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=661704.0, ans=0.125 2023-06-20 23:05:36,863 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.11 vs. limit=22.5 2023-06-20 23:05:45,981 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.590e+02 2.344e+02 2.808e+02 3.404e+02 5.687e+02, threshold=5.616e+02, percent-clipped=4.0 2023-06-20 23:06:13,792 INFO [train.py:996] (3/4) Epoch 4, batch 18850, loss[loss=0.2062, simple_loss=0.2605, pruned_loss=0.07596, over 21314.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.2975, pruned_loss=0.07633, over 4247441.90 frames. ], batch size: 144, lr: 7.84e-03, grad_scale: 32.0 2023-06-20 23:07:21,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=662184.0, ans=0.125 2023-06-20 23:07:41,825 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.01 vs. limit=15.0 2023-06-20 23:07:51,149 INFO [train.py:996] (3/4) Epoch 4, batch 18900, loss[loss=0.2147, simple_loss=0.2553, pruned_loss=0.087, over 20312.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2933, pruned_loss=0.07598, over 4255114.15 frames. ], batch size: 703, lr: 7.84e-03, grad_scale: 32.0 2023-06-20 23:07:54,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=662304.0, ans=0.0 2023-06-20 23:08:14,179 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.96 vs. limit=22.5 2023-06-20 23:08:48,808 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 2.224e+02 2.639e+02 3.131e+02 5.534e+02, threshold=5.278e+02, percent-clipped=0.0 2023-06-20 23:08:51,365 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.23 vs. limit=22.5 2023-06-20 23:09:12,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=662544.0, ans=0.0 2023-06-20 23:09:28,923 INFO [train.py:996] (3/4) Epoch 4, batch 18950, loss[loss=0.1892, simple_loss=0.2657, pruned_loss=0.05633, over 20807.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.2953, pruned_loss=0.07827, over 4264115.85 frames. ], batch size: 608, lr: 7.83e-03, grad_scale: 32.0 2023-06-20 23:09:29,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=662604.0, ans=0.1 2023-06-20 23:09:51,385 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=662664.0, ans=0.125 2023-06-20 23:10:14,021 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.95 vs. limit=15.0 2023-06-20 23:10:53,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=662784.0, ans=6.0 2023-06-20 23:11:19,047 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.42 vs. limit=15.0 2023-06-20 23:11:22,820 INFO [train.py:996] (3/4) Epoch 4, batch 19000, loss[loss=0.262, simple_loss=0.3442, pruned_loss=0.08991, over 21718.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3058, pruned_loss=0.08099, over 4267530.98 frames. ], batch size: 351, lr: 7.83e-03, grad_scale: 32.0 2023-06-20 23:11:39,882 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 23:11:47,701 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.65 vs. limit=15.0 2023-06-20 23:12:11,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=663024.0, ans=0.125 2023-06-20 23:12:25,113 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.011e+02 2.761e+02 3.068e+02 3.870e+02 6.410e+02, threshold=6.137e+02, percent-clipped=5.0 2023-06-20 23:12:40,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=663144.0, ans=0.125 2023-06-20 23:12:47,813 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=663144.0, ans=0.05 2023-06-20 23:12:59,009 INFO [train.py:996] (3/4) Epoch 4, batch 19050, loss[loss=0.2505, simple_loss=0.3211, pruned_loss=0.08998, over 21884.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3106, pruned_loss=0.0852, over 4275365.25 frames. ], batch size: 118, lr: 7.83e-03, grad_scale: 32.0 2023-06-20 23:13:28,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=663324.0, ans=0.1 2023-06-20 23:14:11,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=663384.0, ans=0.07 2023-06-20 23:14:35,609 INFO [train.py:996] (3/4) Epoch 4, batch 19100, loss[loss=0.227, simple_loss=0.2823, pruned_loss=0.08587, over 21143.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3093, pruned_loss=0.08561, over 4269562.37 frames. ], batch size: 143, lr: 7.83e-03, grad_scale: 32.0 2023-06-20 23:14:44,150 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=663504.0, ans=0.0 2023-06-20 23:15:24,981 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.96 vs. limit=15.0 2023-06-20 23:15:40,907 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=663624.0, ans=0.125 2023-06-20 23:15:46,366 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.112e+02 2.597e+02 2.867e+02 3.510e+02 4.906e+02, threshold=5.733e+02, percent-clipped=0.0 2023-06-20 23:16:19,234 INFO [train.py:996] (3/4) Epoch 4, batch 19150, loss[loss=0.3143, simple_loss=0.4112, pruned_loss=0.1087, over 21157.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3137, pruned_loss=0.08719, over 4267105.72 frames. ], batch size: 548, lr: 7.83e-03, grad_scale: 16.0 2023-06-20 23:16:39,446 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=663804.0, ans=0.0 2023-06-20 23:16:40,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=663804.0, ans=0.125 2023-06-20 23:17:13,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=663864.0, ans=0.0 2023-06-20 23:17:23,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=663924.0, ans=0.125 2023-06-20 23:17:47,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=663984.0, ans=0.125 2023-06-20 23:17:58,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=664044.0, ans=0.0 2023-06-20 23:18:00,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=664044.0, ans=0.125 2023-06-20 23:18:09,658 INFO [train.py:996] (3/4) Epoch 4, batch 19200, loss[loss=0.2485, simple_loss=0.3397, pruned_loss=0.07859, over 21231.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3234, pruned_loss=0.0886, over 4257879.18 frames. ], batch size: 143, lr: 7.82e-03, grad_scale: 32.0 2023-06-20 23:18:13,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=664104.0, ans=0.0 2023-06-20 23:18:14,746 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=664104.0, ans=0.125 2023-06-20 23:18:46,018 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=664164.0, ans=0.5 2023-06-20 23:18:47,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=664164.0, ans=0.2 2023-06-20 23:18:49,768 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.95 vs. limit=22.5 2023-06-20 23:19:19,304 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 2.481e+02 2.914e+02 3.556e+02 7.085e+02, threshold=5.828e+02, percent-clipped=2.0 2023-06-20 23:19:21,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=664284.0, ans=0.2 2023-06-20 23:19:22,562 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=664284.0, ans=0.2 2023-06-20 23:19:23,150 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-06-20 23:19:54,528 INFO [train.py:996] (3/4) Epoch 4, batch 19250, loss[loss=0.2128, simple_loss=0.2944, pruned_loss=0.0656, over 21700.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3217, pruned_loss=0.08243, over 4255424.50 frames. ], batch size: 230, lr: 7.82e-03, grad_scale: 32.0 2023-06-20 23:21:05,404 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=664584.0, ans=0.04949747468305833 2023-06-20 23:21:42,203 INFO [train.py:996] (3/4) Epoch 4, batch 19300, loss[loss=0.23, simple_loss=0.3122, pruned_loss=0.07385, over 21513.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3191, pruned_loss=0.08215, over 4260976.40 frames. ], batch size: 471, lr: 7.82e-03, grad_scale: 32.0 2023-06-20 23:22:10,380 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.49 vs. limit=22.5 2023-06-20 23:22:49,105 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.685e+02 2.366e+02 2.667e+02 3.289e+02 5.476e+02, threshold=5.333e+02, percent-clipped=0.0 2023-06-20 23:23:22,584 INFO [train.py:996] (3/4) Epoch 4, batch 19350, loss[loss=0.2545, simple_loss=0.3382, pruned_loss=0.08536, over 21549.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3155, pruned_loss=0.07945, over 4269451.45 frames. ], batch size: 473, lr: 7.82e-03, grad_scale: 32.0 2023-06-20 23:23:48,702 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.70 vs. limit=15.0 2023-06-20 23:24:19,911 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=665184.0, ans=0.0 2023-06-20 23:24:20,534 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.30 vs. limit=22.5 2023-06-20 23:24:41,187 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=665244.0, ans=0.1 2023-06-20 23:24:46,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=665304.0, ans=0.125 2023-06-20 23:24:47,894 INFO [train.py:996] (3/4) Epoch 4, batch 19400, loss[loss=0.2513, simple_loss=0.3187, pruned_loss=0.092, over 21873.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3108, pruned_loss=0.07737, over 4273970.66 frames. ], batch size: 414, lr: 7.82e-03, grad_scale: 32.0 2023-06-20 23:24:56,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=665304.0, ans=0.0 2023-06-20 23:25:07,471 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-06-20 23:25:15,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=665304.0, ans=0.125 2023-06-20 23:25:22,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=665364.0, ans=0.125 2023-06-20 23:25:24,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=665364.0, ans=0.1 2023-06-20 23:25:28,645 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=665364.0, ans=0.0 2023-06-20 23:25:57,345 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.512e+02 2.319e+02 2.845e+02 3.666e+02 6.009e+02, threshold=5.690e+02, percent-clipped=2.0 2023-06-20 23:26:29,720 INFO [train.py:996] (3/4) Epoch 4, batch 19450, loss[loss=0.2203, simple_loss=0.2698, pruned_loss=0.08546, over 21322.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3091, pruned_loss=0.07947, over 4282806.43 frames. ], batch size: 548, lr: 7.82e-03, grad_scale: 32.0 2023-06-20 23:26:47,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=665604.0, ans=0.0 2023-06-20 23:26:48,728 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=665664.0, ans=0.125 2023-06-20 23:26:53,212 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=665664.0, ans=0.1 2023-06-20 23:28:18,732 INFO [train.py:996] (3/4) Epoch 4, batch 19500, loss[loss=0.2237, simple_loss=0.2971, pruned_loss=0.07515, over 21663.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3054, pruned_loss=0.08071, over 4276957.39 frames. ], batch size: 332, lr: 7.81e-03, grad_scale: 32.0 2023-06-20 23:28:36,308 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.43 vs. limit=15.0 2023-06-20 23:29:07,744 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=666024.0, ans=0.125 2023-06-20 23:29:12,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=666024.0, ans=0.125 2023-06-20 23:29:20,540 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.962e+02 2.739e+02 3.335e+02 3.880e+02 8.921e+02, threshold=6.671e+02, percent-clipped=3.0 2023-06-20 23:29:28,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=666084.0, ans=0.1 2023-06-20 23:29:56,763 INFO [train.py:996] (3/4) Epoch 4, batch 19550, loss[loss=0.1937, simple_loss=0.2932, pruned_loss=0.04712, over 21700.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3008, pruned_loss=0.07901, over 4278621.56 frames. ], batch size: 298, lr: 7.81e-03, grad_scale: 16.0 2023-06-20 23:30:31,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=666324.0, ans=0.0 2023-06-20 23:30:33,005 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=666324.0, ans=0.0 2023-06-20 23:30:57,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=666384.0, ans=0.1 2023-06-20 23:31:38,468 INFO [train.py:996] (3/4) Epoch 4, batch 19600, loss[loss=0.1591, simple_loss=0.227, pruned_loss=0.04555, over 17614.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3023, pruned_loss=0.07953, over 4278269.62 frames. ], batch size: 60, lr: 7.81e-03, grad_scale: 32.0 2023-06-20 23:31:59,276 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 23:32:03,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=666564.0, ans=0.0 2023-06-20 23:32:24,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=666624.0, ans=0.125 2023-06-20 23:32:32,598 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.820e+02 2.560e+02 2.903e+02 3.359e+02 5.140e+02, threshold=5.805e+02, percent-clipped=0.0 2023-06-20 23:32:49,591 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.37 vs. limit=15.0 2023-06-20 23:32:53,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=666744.0, ans=0.1 2023-06-20 23:33:02,748 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.77 vs. limit=15.0 2023-06-20 23:33:14,399 INFO [train.py:996] (3/4) Epoch 4, batch 19650, loss[loss=0.2703, simple_loss=0.3302, pruned_loss=0.1052, over 21660.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3074, pruned_loss=0.08373, over 4277073.64 frames. ], batch size: 389, lr: 7.81e-03, grad_scale: 32.0 2023-06-20 23:33:21,221 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=666804.0, ans=0.0 2023-06-20 23:33:27,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=666804.0, ans=0.015 2023-06-20 23:33:38,554 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.77 vs. limit=15.0 2023-06-20 23:34:39,444 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=667044.0, ans=0.125 2023-06-20 23:34:47,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=667044.0, ans=0.2 2023-06-20 23:34:52,421 INFO [train.py:996] (3/4) Epoch 4, batch 19700, loss[loss=0.2123, simple_loss=0.2862, pruned_loss=0.06921, over 21587.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3121, pruned_loss=0.08561, over 4274811.03 frames. ], batch size: 230, lr: 7.81e-03, grad_scale: 32.0 2023-06-20 23:35:11,460 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 23:35:48,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=667224.0, ans=0.125 2023-06-20 23:36:17,934 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=667284.0, ans=0.125 2023-06-20 23:36:21,834 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.270e+02 2.712e+02 3.222e+02 3.780e+02 5.761e+02, threshold=6.445e+02, percent-clipped=0.0 2023-06-20 23:36:30,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=667284.0, ans=0.125 2023-06-20 23:36:46,018 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=667344.0, ans=0.1 2023-06-20 23:36:52,634 INFO [train.py:996] (3/4) Epoch 4, batch 19750, loss[loss=0.3587, simple_loss=0.4141, pruned_loss=0.1516, over 21546.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3178, pruned_loss=0.08598, over 4260284.32 frames. ], batch size: 507, lr: 7.81e-03, grad_scale: 32.0 2023-06-20 23:37:05,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=667404.0, ans=0.2 2023-06-20 23:38:39,850 INFO [train.py:996] (3/4) Epoch 4, batch 19800, loss[loss=0.2215, simple_loss=0.2947, pruned_loss=0.07412, over 21828.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3177, pruned_loss=0.08631, over 4261766.03 frames. ], batch size: 316, lr: 7.80e-03, grad_scale: 32.0 2023-06-20 23:38:44,543 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=667704.0, ans=0.125 2023-06-20 23:39:02,706 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=15.0 2023-06-20 23:39:18,528 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.14 vs. limit=15.0 2023-06-20 23:39:40,080 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=667824.0, ans=0.0 2023-06-20 23:39:57,907 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=667884.0, ans=0.125 2023-06-20 23:40:03,067 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.688e+02 2.339e+02 2.763e+02 3.291e+02 5.461e+02, threshold=5.526e+02, percent-clipped=0.0 2023-06-20 23:40:30,185 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 23:40:34,090 INFO [train.py:996] (3/4) Epoch 4, batch 19850, loss[loss=0.2475, simple_loss=0.3091, pruned_loss=0.09297, over 19878.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3113, pruned_loss=0.08194, over 4251725.45 frames. ], batch size: 702, lr: 7.80e-03, grad_scale: 32.0 2023-06-20 23:40:37,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=668004.0, ans=0.025 2023-06-20 23:41:05,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=668064.0, ans=0.125 2023-06-20 23:41:06,376 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.02 vs. limit=22.5 2023-06-20 23:41:07,722 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.62 vs. limit=15.0 2023-06-20 23:42:07,225 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-06-20 23:42:11,833 INFO [train.py:996] (3/4) Epoch 4, batch 19900, loss[loss=0.2024, simple_loss=0.2698, pruned_loss=0.06745, over 21244.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3121, pruned_loss=0.07947, over 4248325.49 frames. ], batch size: 176, lr: 7.80e-03, grad_scale: 32.0 2023-06-20 23:42:28,179 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=668304.0, ans=0.125 2023-06-20 23:42:37,521 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=668364.0, ans=0.0 2023-06-20 23:42:47,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=668364.0, ans=0.125 2023-06-20 23:42:55,476 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=668424.0, ans=0.125 2023-06-20 23:42:55,501 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=668424.0, ans=0.2 2023-06-20 23:43:18,252 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.891e+02 2.481e+02 3.021e+02 3.710e+02 5.505e+02, threshold=6.042e+02, percent-clipped=0.0 2023-06-20 23:43:23,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=668484.0, ans=0.0 2023-06-20 23:43:36,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=668544.0, ans=0.0 2023-06-20 23:43:49,812 INFO [train.py:996] (3/4) Epoch 4, batch 19950, loss[loss=0.2365, simple_loss=0.2939, pruned_loss=0.08955, over 21803.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3059, pruned_loss=0.07926, over 4255522.57 frames. ], batch size: 102, lr: 7.80e-03, grad_scale: 32.0 2023-06-20 23:44:48,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=668784.0, ans=0.1 2023-06-20 23:45:38,930 INFO [train.py:996] (3/4) Epoch 4, batch 20000, loss[loss=0.2545, simple_loss=0.3129, pruned_loss=0.098, over 21410.00 frames. ], tot_loss[loss=0.234, simple_loss=0.308, pruned_loss=0.07998, over 4262238.41 frames. ], batch size: 143, lr: 7.80e-03, grad_scale: 32.0 2023-06-20 23:45:40,720 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=668904.0, ans=0.0 2023-06-20 23:45:43,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=668904.0, ans=0.04949747468305833 2023-06-20 23:45:44,217 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.73 vs. limit=6.0 2023-06-20 23:45:52,123 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 23:46:27,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=669024.0, ans=0.0 2023-06-20 23:46:52,066 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.961e+02 2.484e+02 2.763e+02 3.251e+02 5.110e+02, threshold=5.527e+02, percent-clipped=0.0 2023-06-20 23:47:10,041 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=669144.0, ans=0.0 2023-06-20 23:47:17,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=669144.0, ans=0.125 2023-06-20 23:47:22,969 INFO [train.py:996] (3/4) Epoch 4, batch 20050, loss[loss=0.2315, simple_loss=0.3039, pruned_loss=0.07954, over 21803.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3091, pruned_loss=0.08228, over 4262539.39 frames. ], batch size: 298, lr: 7.79e-03, grad_scale: 32.0 2023-06-20 23:48:01,709 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=669264.0, ans=0.1 2023-06-20 23:48:53,074 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.07 vs. limit=6.0 2023-06-20 23:49:06,456 INFO [train.py:996] (3/4) Epoch 4, batch 20100, loss[loss=0.2385, simple_loss=0.3121, pruned_loss=0.08246, over 21906.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3119, pruned_loss=0.08477, over 4269685.89 frames. ], batch size: 316, lr: 7.79e-03, grad_scale: 32.0 2023-06-20 23:49:17,974 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=669504.0, ans=0.125 2023-06-20 23:49:28,938 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 23:49:46,248 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=669624.0, ans=0.125 2023-06-20 23:49:55,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=669624.0, ans=0.125 2023-06-20 23:50:03,212 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-06-20 23:50:05,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=669624.0, ans=0.2 2023-06-20 23:50:12,789 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.857e+02 2.516e+02 2.838e+02 3.321e+02 5.094e+02, threshold=5.677e+02, percent-clipped=0.0 2023-06-20 23:50:44,841 INFO [train.py:996] (3/4) Epoch 4, batch 20150, loss[loss=0.2925, simple_loss=0.3604, pruned_loss=0.1123, over 21478.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3218, pruned_loss=0.0882, over 4269242.84 frames. ], batch size: 131, lr: 7.79e-03, grad_scale: 32.0 2023-06-20 23:51:35,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=669864.0, ans=0.125 2023-06-20 23:51:38,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=669864.0, ans=0.2 2023-06-20 23:52:20,952 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=669984.0, ans=0.125 2023-06-20 23:53:08,049 INFO [train.py:996] (3/4) Epoch 4, batch 20200, loss[loss=0.2606, simple_loss=0.3548, pruned_loss=0.08321, over 21676.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3288, pruned_loss=0.09178, over 4272357.60 frames. ], batch size: 247, lr: 7.79e-03, grad_scale: 32.0 2023-06-20 23:54:20,154 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.895e+02 2.584e+02 3.030e+02 3.747e+02 5.271e+02, threshold=6.060e+02, percent-clipped=0.0 2023-06-20 23:54:20,605 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=670284.0, ans=0.0 2023-06-20 23:54:58,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=670404.0, ans=0.0 2023-06-20 23:55:04,485 INFO [train.py:996] (3/4) Epoch 4, batch 20250, loss[loss=0.2347, simple_loss=0.3213, pruned_loss=0.07407, over 21662.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.328, pruned_loss=0.09035, over 4270750.86 frames. ], batch size: 389, lr: 7.79e-03, grad_scale: 32.0 2023-06-20 23:55:13,672 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.60 vs. limit=22.5 2023-06-20 23:55:50,333 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.26 vs. limit=15.0 2023-06-20 23:55:55,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=670524.0, ans=0.09899494936611666 2023-06-20 23:56:29,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=670584.0, ans=0.1 2023-06-20 23:56:36,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=670644.0, ans=0.125 2023-06-20 23:56:58,675 INFO [train.py:996] (3/4) Epoch 4, batch 20300, loss[loss=0.1914, simple_loss=0.2468, pruned_loss=0.06806, over 16176.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3247, pruned_loss=0.08696, over 4264352.22 frames. ], batch size: 61, lr: 7.79e-03, grad_scale: 32.0 2023-06-20 23:57:22,609 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-06-20 23:57:53,748 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.921e+02 2.338e+02 2.634e+02 2.984e+02 5.802e+02, threshold=5.268e+02, percent-clipped=0.0 2023-06-20 23:58:07,116 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=670884.0, ans=0.125 2023-06-20 23:58:08,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=670944.0, ans=0.125 2023-06-20 23:58:12,906 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=670944.0, ans=0.0 2023-06-20 23:58:29,866 INFO [train.py:996] (3/4) Epoch 4, batch 20350, loss[loss=0.2745, simple_loss=0.3301, pruned_loss=0.1095, over 21317.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.324, pruned_loss=0.08708, over 4261243.79 frames. ], batch size: 159, lr: 7.78e-03, grad_scale: 32.0 2023-06-20 23:58:49,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=671064.0, ans=0.0 2023-06-20 23:59:08,085 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=671124.0, ans=0.125 2023-06-20 23:59:14,397 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.07 vs. limit=12.0 2023-06-20 23:59:31,462 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=671184.0, ans=0.0 2023-06-20 23:59:47,370 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=671244.0, ans=0.2 2023-06-21 00:00:05,656 INFO [train.py:996] (3/4) Epoch 4, batch 20400, loss[loss=0.3695, simple_loss=0.4083, pruned_loss=0.1654, over 21408.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3271, pruned_loss=0.09039, over 4263223.49 frames. ], batch size: 508, lr: 7.78e-03, grad_scale: 32.0 2023-06-21 00:00:21,112 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.85 vs. limit=6.0 2023-06-21 00:00:29,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=671364.0, ans=0.1 2023-06-21 00:00:54,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=671424.0, ans=0.1 2023-06-21 00:00:56,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=671424.0, ans=0.0 2023-06-21 00:01:03,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=671484.0, ans=0.0 2023-06-21 00:01:05,909 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.044e+02 2.821e+02 3.170e+02 3.737e+02 5.615e+02, threshold=6.339e+02, percent-clipped=2.0 2023-06-21 00:01:47,139 INFO [train.py:996] (3/4) Epoch 4, batch 20450, loss[loss=0.2133, simple_loss=0.2792, pruned_loss=0.07366, over 16305.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.3291, pruned_loss=0.09331, over 4252202.91 frames. ], batch size: 61, lr: 7.78e-03, grad_scale: 32.0 2023-06-21 00:01:47,642 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=671604.0, ans=0.125 2023-06-21 00:01:49,087 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=671604.0, ans=0.1 2023-06-21 00:01:53,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=671604.0, ans=0.0 2023-06-21 00:02:06,432 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=671664.0, ans=0.2 2023-06-21 00:02:19,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=671724.0, ans=0.125 2023-06-21 00:02:36,423 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=671784.0, ans=0.1 2023-06-21 00:02:40,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=671784.0, ans=0.1 2023-06-21 00:03:16,180 INFO [train.py:996] (3/4) Epoch 4, batch 20500, loss[loss=0.2548, simple_loss=0.3073, pruned_loss=0.1012, over 21316.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3244, pruned_loss=0.0931, over 4259682.86 frames. ], batch size: 159, lr: 7.78e-03, grad_scale: 32.0 2023-06-21 00:03:23,894 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=671904.0, ans=0.5 2023-06-21 00:03:50,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=672024.0, ans=0.0 2023-06-21 00:03:53,509 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=672024.0, ans=0.125 2023-06-21 00:04:14,697 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.874e+02 2.579e+02 2.936e+02 3.494e+02 5.643e+02, threshold=5.872e+02, percent-clipped=0.0 2023-06-21 00:04:48,967 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.31 vs. limit=12.0 2023-06-21 00:04:56,150 INFO [train.py:996] (3/4) Epoch 4, batch 20550, loss[loss=0.2088, simple_loss=0.2729, pruned_loss=0.07236, over 21769.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3161, pruned_loss=0.09059, over 4258703.67 frames. ], batch size: 351, lr: 7.78e-03, grad_scale: 32.0 2023-06-21 00:04:58,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=672204.0, ans=0.125 2023-06-21 00:05:02,549 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=672204.0, ans=0.0 2023-06-21 00:05:25,875 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.60 vs. limit=12.0 2023-06-21 00:06:41,248 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=672504.0, ans=0.125 2023-06-21 00:06:42,188 INFO [train.py:996] (3/4) Epoch 4, batch 20600, loss[loss=0.263, simple_loss=0.3248, pruned_loss=0.1006, over 21745.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3172, pruned_loss=0.08768, over 4259287.73 frames. ], batch size: 441, lr: 7.78e-03, grad_scale: 32.0 2023-06-21 00:07:09,330 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=672564.0, ans=0.0 2023-06-21 00:07:42,109 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.963e+02 2.517e+02 2.771e+02 3.210e+02 6.753e+02, threshold=5.541e+02, percent-clipped=2.0 2023-06-21 00:07:51,717 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=672744.0, ans=0.125 2023-06-21 00:08:03,159 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 00:08:18,045 INFO [train.py:996] (3/4) Epoch 4, batch 20650, loss[loss=0.2073, simple_loss=0.2645, pruned_loss=0.07503, over 21153.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.314, pruned_loss=0.08854, over 4262701.53 frames. ], batch size: 159, lr: 7.77e-03, grad_scale: 32.0 2023-06-21 00:08:36,549 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=672864.0, ans=0.125 2023-06-21 00:08:38,787 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.38 vs. limit=15.0 2023-06-21 00:08:46,462 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=12.0 2023-06-21 00:08:55,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=672924.0, ans=0.1 2023-06-21 00:09:51,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=673044.0, ans=0.2 2023-06-21 00:09:55,877 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=673104.0, ans=0.025 2023-06-21 00:09:56,824 INFO [train.py:996] (3/4) Epoch 4, batch 20700, loss[loss=0.2242, simple_loss=0.2985, pruned_loss=0.07499, over 21573.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3051, pruned_loss=0.08437, over 4266841.46 frames. ], batch size: 441, lr: 7.77e-03, grad_scale: 32.0 2023-06-21 00:10:05,088 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=673104.0, ans=0.0 2023-06-21 00:10:55,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=673284.0, ans=0.125 2023-06-21 00:11:07,545 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.934e+02 2.300e+02 2.582e+02 3.075e+02 5.238e+02, threshold=5.163e+02, percent-clipped=0.0 2023-06-21 00:11:37,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=673344.0, ans=0.125 2023-06-21 00:11:40,728 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=673344.0, ans=0.125 2023-06-21 00:11:44,526 INFO [train.py:996] (3/4) Epoch 4, batch 20750, loss[loss=0.1834, simple_loss=0.2466, pruned_loss=0.06007, over 21787.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3096, pruned_loss=0.08357, over 4268977.93 frames. ], batch size: 118, lr: 7.77e-03, grad_scale: 32.0 2023-06-21 00:12:38,815 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-21 00:13:02,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=673584.0, ans=0.125 2023-06-21 00:13:04,847 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.10 vs. limit=22.5 2023-06-21 00:13:14,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=673644.0, ans=0.5 2023-06-21 00:13:21,223 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.93 vs. limit=15.0 2023-06-21 00:13:27,187 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=673704.0, ans=0.0 2023-06-21 00:13:28,254 INFO [train.py:996] (3/4) Epoch 4, batch 20800, loss[loss=0.2125, simple_loss=0.2752, pruned_loss=0.07492, over 21619.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3128, pruned_loss=0.08492, over 4271006.09 frames. ], batch size: 282, lr: 7.77e-03, grad_scale: 32.0 2023-06-21 00:13:44,725 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 00:14:09,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=673824.0, ans=0.125 2023-06-21 00:14:39,398 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.768e+02 2.643e+02 3.014e+02 3.715e+02 5.359e+02, threshold=6.029e+02, percent-clipped=3.0 2023-06-21 00:15:04,248 INFO [train.py:996] (3/4) Epoch 4, batch 20850, loss[loss=0.2307, simple_loss=0.2834, pruned_loss=0.08897, over 21187.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3036, pruned_loss=0.08233, over 4260465.74 frames. ], batch size: 607, lr: 7.77e-03, grad_scale: 32.0 2023-06-21 00:15:14,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=674004.0, ans=0.2 2023-06-21 00:15:27,626 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.31 vs. limit=15.0 2023-06-21 00:16:54,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=674244.0, ans=0.0 2023-06-21 00:16:58,059 INFO [train.py:996] (3/4) Epoch 4, batch 20900, loss[loss=0.2215, simple_loss=0.2969, pruned_loss=0.07307, over 21569.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3054, pruned_loss=0.0838, over 4271474.34 frames. ], batch size: 230, lr: 7.77e-03, grad_scale: 32.0 2023-06-21 00:17:01,836 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.96 vs. limit=10.0 2023-06-21 00:17:52,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=674484.0, ans=0.0 2023-06-21 00:17:58,039 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.573e+02 2.368e+02 2.710e+02 3.422e+02 5.410e+02, threshold=5.420e+02, percent-clipped=0.0 2023-06-21 00:18:19,249 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=674544.0, ans=0.125 2023-06-21 00:18:33,408 INFO [train.py:996] (3/4) Epoch 4, batch 20950, loss[loss=0.2547, simple_loss=0.3187, pruned_loss=0.09533, over 21510.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3017, pruned_loss=0.0801, over 4279226.56 frames. ], batch size: 471, lr: 7.76e-03, grad_scale: 32.0 2023-06-21 00:19:02,829 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.69 vs. limit=15.0 2023-06-21 00:19:09,099 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.60 vs. limit=15.0 2023-06-21 00:19:09,880 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=674724.0, ans=0.0 2023-06-21 00:19:11,198 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=674724.0, ans=0.125 2023-06-21 00:19:11,814 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.24 vs. limit=6.0 2023-06-21 00:19:14,783 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.15 vs. limit=15.0 2023-06-21 00:19:22,318 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=674724.0, ans=0.125 2023-06-21 00:19:28,587 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.62 vs. limit=6.0 2023-06-21 00:19:30,195 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.91 vs. limit=15.0 2023-06-21 00:19:36,039 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.65 vs. limit=22.5 2023-06-21 00:19:39,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=674784.0, ans=0.1 2023-06-21 00:20:07,636 INFO [train.py:996] (3/4) Epoch 4, batch 21000, loss[loss=0.2217, simple_loss=0.2921, pruned_loss=0.07561, over 21966.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.2997, pruned_loss=0.07999, over 4267810.42 frames. ], batch size: 316, lr: 7.76e-03, grad_scale: 32.0 2023-06-21 00:20:07,637 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-21 00:20:59,671 INFO [train.py:1028] (3/4) Epoch 4, validation: loss=0.2681, simple_loss=0.367, pruned_loss=0.0846, over 1796401.00 frames. 2023-06-21 00:20:59,672 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-21 00:21:43,211 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=675024.0, ans=0.0 2023-06-21 00:21:44,445 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=675024.0, ans=0.0 2023-06-21 00:22:00,274 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.758e+02 2.300e+02 2.574e+02 2.981e+02 4.103e+02, threshold=5.148e+02, percent-clipped=0.0 2023-06-21 00:22:18,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=675084.0, ans=0.0 2023-06-21 00:22:35,781 INFO [train.py:996] (3/4) Epoch 4, batch 21050, loss[loss=0.2383, simple_loss=0.2937, pruned_loss=0.09138, over 21576.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.2975, pruned_loss=0.0804, over 4252636.09 frames. ], batch size: 414, lr: 7.76e-03, grad_scale: 32.0 2023-06-21 00:22:41,293 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.58 vs. limit=15.0 2023-06-21 00:23:18,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=675324.0, ans=0.0 2023-06-21 00:23:22,575 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=675324.0, ans=0.2 2023-06-21 00:23:28,669 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=675324.0, ans=0.2 2023-06-21 00:23:48,825 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=675384.0, ans=0.125 2023-06-21 00:24:08,094 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.24 vs. limit=15.0 2023-06-21 00:24:14,090 INFO [train.py:996] (3/4) Epoch 4, batch 21100, loss[loss=0.1957, simple_loss=0.2631, pruned_loss=0.06411, over 21386.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.2947, pruned_loss=0.08, over 4251911.67 frames. ], batch size: 194, lr: 7.76e-03, grad_scale: 32.0 2023-06-21 00:24:39,891 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.51 vs. limit=15.0 2023-06-21 00:25:04,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=675624.0, ans=0.0 2023-06-21 00:25:19,611 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.66 vs. limit=15.0 2023-06-21 00:25:25,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=675684.0, ans=0.0 2023-06-21 00:25:28,113 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.981e+02 2.453e+02 2.744e+02 3.185e+02 4.554e+02, threshold=5.489e+02, percent-clipped=0.0 2023-06-21 00:25:49,981 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.76 vs. limit=22.5 2023-06-21 00:25:59,867 INFO [train.py:996] (3/4) Epoch 4, batch 21150, loss[loss=0.2234, simple_loss=0.2837, pruned_loss=0.08154, over 15427.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.2911, pruned_loss=0.08027, over 4233398.29 frames. ], batch size: 60, lr: 7.76e-03, grad_scale: 32.0 2023-06-21 00:26:46,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=675924.0, ans=0.1 2023-06-21 00:27:03,050 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.01 vs. limit=6.0 2023-06-21 00:27:08,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=675984.0, ans=0.125 2023-06-21 00:27:19,012 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=675984.0, ans=15.0 2023-06-21 00:27:40,828 INFO [train.py:996] (3/4) Epoch 4, batch 21200, loss[loss=0.1912, simple_loss=0.2687, pruned_loss=0.05683, over 21748.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2872, pruned_loss=0.0796, over 4235261.24 frames. ], batch size: 351, lr: 7.76e-03, grad_scale: 32.0 2023-06-21 00:27:45,651 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=676104.0, ans=0.125 2023-06-21 00:27:47,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=676104.0, ans=0.125 2023-06-21 00:28:13,166 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.34 vs. limit=15.0 2023-06-21 00:28:41,445 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.936e+02 2.500e+02 2.845e+02 3.451e+02 6.035e+02, threshold=5.690e+02, percent-clipped=1.0 2023-06-21 00:29:19,083 INFO [train.py:996] (3/4) Epoch 4, batch 21250, loss[loss=0.2073, simple_loss=0.2822, pruned_loss=0.06615, over 21602.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2877, pruned_loss=0.07972, over 4238087.75 frames. ], batch size: 263, lr: 7.75e-03, grad_scale: 32.0 2023-06-21 00:29:25,596 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=676404.0, ans=0.05 2023-06-21 00:29:43,565 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.27 vs. limit=15.0 2023-06-21 00:29:45,882 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 00:30:55,552 INFO [train.py:996] (3/4) Epoch 4, batch 21300, loss[loss=0.2424, simple_loss=0.3152, pruned_loss=0.08484, over 21592.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.2933, pruned_loss=0.08163, over 4251267.56 frames. ], batch size: 230, lr: 7.75e-03, grad_scale: 32.0 2023-06-21 00:31:02,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=676704.0, ans=0.0 2023-06-21 00:31:24,042 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=676764.0, ans=0.0 2023-06-21 00:32:01,499 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.074e+02 2.763e+02 3.058e+02 3.413e+02 5.762e+02, threshold=6.115e+02, percent-clipped=1.0 2023-06-21 00:32:22,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=676944.0, ans=0.125 2023-06-21 00:32:28,265 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=676944.0, ans=0.1 2023-06-21 00:32:32,306 INFO [train.py:996] (3/4) Epoch 4, batch 21350, loss[loss=0.2917, simple_loss=0.3585, pruned_loss=0.1125, over 21468.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.2978, pruned_loss=0.08195, over 4265560.19 frames. ], batch size: 507, lr: 7.75e-03, grad_scale: 16.0 2023-06-21 00:32:35,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=677004.0, ans=0.5 2023-06-21 00:32:40,572 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.85 vs. limit=15.0 2023-06-21 00:33:01,596 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=677064.0, ans=0.0 2023-06-21 00:33:34,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=677184.0, ans=0.125 2023-06-21 00:33:34,917 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=677184.0, ans=0.125 2023-06-21 00:33:40,855 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=677184.0, ans=6.0 2023-06-21 00:33:50,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=677184.0, ans=0.125 2023-06-21 00:34:07,675 INFO [train.py:996] (3/4) Epoch 4, batch 21400, loss[loss=0.2886, simple_loss=0.3578, pruned_loss=0.1097, over 21807.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3035, pruned_loss=0.08359, over 4273700.80 frames. ], batch size: 441, lr: 7.75e-03, grad_scale: 16.0 2023-06-21 00:34:15,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=677304.0, ans=0.0 2023-06-21 00:35:34,486 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.748e+02 2.398e+02 2.726e+02 3.426e+02 6.163e+02, threshold=5.451e+02, percent-clipped=1.0 2023-06-21 00:35:49,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=677544.0, ans=0.125 2023-06-21 00:36:03,774 INFO [train.py:996] (3/4) Epoch 4, batch 21450, loss[loss=0.2747, simple_loss=0.3287, pruned_loss=0.1103, over 21726.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3057, pruned_loss=0.08416, over 4276272.03 frames. ], batch size: 473, lr: 7.75e-03, grad_scale: 16.0 2023-06-21 00:37:10,930 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=677784.0, ans=0.125 2023-06-21 00:37:38,006 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=677844.0, ans=0.125 2023-06-21 00:37:46,119 INFO [train.py:996] (3/4) Epoch 4, batch 21500, loss[loss=0.2305, simple_loss=0.2997, pruned_loss=0.08064, over 20905.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3037, pruned_loss=0.08479, over 4280534.69 frames. ], batch size: 607, lr: 7.74e-03, grad_scale: 16.0 2023-06-21 00:37:57,376 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=677904.0, ans=0.95 2023-06-21 00:38:42,681 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 00:38:45,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=678024.0, ans=0.2 2023-06-21 00:38:48,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=678024.0, ans=0.125 2023-06-21 00:39:06,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=678084.0, ans=0.125 2023-06-21 00:39:06,192 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=678084.0, ans=0.1 2023-06-21 00:39:07,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=678084.0, ans=0.2 2023-06-21 00:39:13,313 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.721e+02 2.885e+02 3.448e+02 4.361e+02 7.505e+02, threshold=6.896e+02, percent-clipped=8.0 2023-06-21 00:39:21,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=678084.0, ans=0.1 2023-06-21 00:39:43,084 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=12.0 2023-06-21 00:39:43,564 INFO [train.py:996] (3/4) Epoch 4, batch 21550, loss[loss=0.1799, simple_loss=0.2544, pruned_loss=0.05272, over 21606.00 frames. ], tot_loss[loss=0.23, simple_loss=0.2967, pruned_loss=0.08162, over 4276480.51 frames. ], batch size: 391, lr: 7.74e-03, grad_scale: 16.0 2023-06-21 00:39:56,938 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.88 vs. limit=12.0 2023-06-21 00:40:10,916 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=678264.0, ans=0.1 2023-06-21 00:40:25,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=678324.0, ans=0.125 2023-06-21 00:40:40,157 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.99 vs. limit=10.0 2023-06-21 00:41:01,966 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 00:41:20,985 INFO [train.py:996] (3/4) Epoch 4, batch 21600, loss[loss=0.2302, simple_loss=0.2801, pruned_loss=0.09011, over 21547.00 frames. ], tot_loss[loss=0.227, simple_loss=0.2931, pruned_loss=0.08045, over 4278705.43 frames. ], batch size: 442, lr: 7.74e-03, grad_scale: 32.0 2023-06-21 00:41:38,510 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.85 vs. limit=10.0 2023-06-21 00:42:36,682 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.32 vs. limit=10.0 2023-06-21 00:42:43,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=678684.0, ans=0.95 2023-06-21 00:42:47,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=678684.0, ans=0.1 2023-06-21 00:42:48,634 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.899e+02 2.277e+02 2.634e+02 3.164e+02 4.314e+02, threshold=5.268e+02, percent-clipped=0.0 2023-06-21 00:42:55,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=678684.0, ans=0.09899494936611666 2023-06-21 00:43:05,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=678744.0, ans=0.125 2023-06-21 00:43:05,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=678744.0, ans=0.125 2023-06-21 00:43:12,686 INFO [train.py:996] (3/4) Epoch 4, batch 21650, loss[loss=0.2733, simple_loss=0.3602, pruned_loss=0.09321, over 21627.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.2975, pruned_loss=0.07861, over 4281306.23 frames. ], batch size: 441, lr: 7.74e-03, grad_scale: 32.0 2023-06-21 00:44:15,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=678924.0, ans=0.2 2023-06-21 00:44:40,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=679044.0, ans=0.125 2023-06-21 00:45:01,846 INFO [train.py:996] (3/4) Epoch 4, batch 21700, loss[loss=0.2139, simple_loss=0.2727, pruned_loss=0.07758, over 21682.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.2985, pruned_loss=0.07686, over 4276396.18 frames. ], batch size: 282, lr: 7.74e-03, grad_scale: 32.0 2023-06-21 00:45:48,987 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.12 vs. limit=22.5 2023-06-21 00:46:24,569 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.663e+02 2.270e+02 2.667e+02 3.310e+02 7.431e+02, threshold=5.334e+02, percent-clipped=8.0 2023-06-21 00:46:26,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=679284.0, ans=0.125 2023-06-21 00:46:46,221 INFO [train.py:996] (3/4) Epoch 4, batch 21750, loss[loss=0.2134, simple_loss=0.2741, pruned_loss=0.0763, over 21802.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.2947, pruned_loss=0.07827, over 4274531.16 frames. ], batch size: 317, lr: 7.74e-03, grad_scale: 16.0 2023-06-21 00:47:04,995 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 00:48:29,392 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.91 vs. limit=15.0 2023-06-21 00:48:44,938 INFO [train.py:996] (3/4) Epoch 4, batch 21800, loss[loss=0.2126, simple_loss=0.2803, pruned_loss=0.07248, over 21692.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.2921, pruned_loss=0.07817, over 4274964.33 frames. ], batch size: 282, lr: 7.73e-03, grad_scale: 16.0 2023-06-21 00:49:33,938 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=679764.0, ans=0.0 2023-06-21 00:49:53,992 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.19 vs. limit=15.0 2023-06-21 00:49:57,585 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=679884.0, ans=0.0 2023-06-21 00:50:03,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=679884.0, ans=0.1 2023-06-21 00:50:04,857 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.990e+02 2.568e+02 2.919e+02 3.829e+02 7.614e+02, threshold=5.838e+02, percent-clipped=7.0 2023-06-21 00:50:12,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=679944.0, ans=0.125 2023-06-21 00:50:41,345 INFO [train.py:996] (3/4) Epoch 4, batch 21850, loss[loss=0.2307, simple_loss=0.3036, pruned_loss=0.07894, over 21257.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.2975, pruned_loss=0.07878, over 4277092.62 frames. ], batch size: 176, lr: 7.73e-03, grad_scale: 16.0 2023-06-21 00:50:54,296 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=680004.0, ans=0.0 2023-06-21 00:50:59,399 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=680004.0, ans=0.2 2023-06-21 00:51:29,193 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=680064.0, ans=0.125 2023-06-21 00:52:01,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=680184.0, ans=0.0 2023-06-21 00:52:44,129 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.28 vs. limit=15.0 2023-06-21 00:52:47,291 INFO [train.py:996] (3/4) Epoch 4, batch 21900, loss[loss=0.2513, simple_loss=0.2889, pruned_loss=0.1068, over 21462.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.2984, pruned_loss=0.07969, over 4279922.56 frames. ], batch size: 508, lr: 7.73e-03, grad_scale: 16.0 2023-06-21 00:52:47,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=680304.0, ans=0.1 2023-06-21 00:53:23,568 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.37 vs. limit=15.0 2023-06-21 00:53:29,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=680424.0, ans=0.125 2023-06-21 00:54:01,909 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.904e+02 2.548e+02 2.891e+02 3.393e+02 5.559e+02, threshold=5.783e+02, percent-clipped=0.0 2023-06-21 00:54:03,800 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=680484.0, ans=0.025 2023-06-21 00:54:23,720 INFO [train.py:996] (3/4) Epoch 4, batch 21950, loss[loss=0.1992, simple_loss=0.2821, pruned_loss=0.05821, over 21512.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.2934, pruned_loss=0.07877, over 4278430.80 frames. ], batch size: 441, lr: 7.73e-03, grad_scale: 16.0 2023-06-21 00:55:12,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=680664.0, ans=0.0 2023-06-21 00:55:39,793 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.92 vs. limit=15.0 2023-06-21 00:55:48,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=680784.0, ans=0.125 2023-06-21 00:56:06,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=680844.0, ans=0.0 2023-06-21 00:56:13,982 INFO [train.py:996] (3/4) Epoch 4, batch 22000, loss[loss=0.3079, simple_loss=0.3482, pruned_loss=0.1338, over 21352.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2871, pruned_loss=0.07601, over 4280704.67 frames. ], batch size: 507, lr: 7.73e-03, grad_scale: 32.0 2023-06-21 00:56:44,558 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=680964.0, ans=0.125 2023-06-21 00:56:58,974 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=680964.0, ans=0.05 2023-06-21 00:57:45,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=681084.0, ans=0.1 2023-06-21 00:57:46,334 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.438e+02 2.067e+02 2.288e+02 2.758e+02 6.986e+02, threshold=4.576e+02, percent-clipped=2.0 2023-06-21 00:58:35,044 INFO [train.py:996] (3/4) Epoch 4, batch 22050, loss[loss=0.3485, simple_loss=0.4147, pruned_loss=0.1411, over 21428.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2904, pruned_loss=0.07701, over 4276289.65 frames. ], batch size: 471, lr: 7.73e-03, grad_scale: 32.0 2023-06-21 01:00:00,300 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=681444.0, ans=0.2 2023-06-21 01:00:00,324 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 01:00:01,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=681444.0, ans=0.125 2023-06-21 01:00:15,412 INFO [train.py:996] (3/4) Epoch 4, batch 22100, loss[loss=0.2686, simple_loss=0.3291, pruned_loss=0.1041, over 21834.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3046, pruned_loss=0.0828, over 4272935.43 frames. ], batch size: 351, lr: 7.72e-03, grad_scale: 32.0 2023-06-21 01:01:07,078 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=681624.0, ans=0.125 2023-06-21 01:01:22,379 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.215e+02 2.888e+02 3.349e+02 4.124e+02 5.904e+02, threshold=6.697e+02, percent-clipped=15.0 2023-06-21 01:02:02,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=681744.0, ans=0.125 2023-06-21 01:02:13,255 INFO [train.py:996] (3/4) Epoch 4, batch 22150, loss[loss=0.2296, simple_loss=0.3042, pruned_loss=0.0775, over 21889.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3077, pruned_loss=0.08489, over 4277503.04 frames. ], batch size: 316, lr: 7.72e-03, grad_scale: 32.0 2023-06-21 01:02:43,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=681864.0, ans=0.0 2023-06-21 01:03:15,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=681984.0, ans=0.125 2023-06-21 01:03:20,057 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=15.0 2023-06-21 01:03:31,609 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=681984.0, ans=0.2 2023-06-21 01:03:53,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.whiten.whitening_limit, batch_count=682044.0, ans=12.0 2023-06-21 01:04:07,377 INFO [train.py:996] (3/4) Epoch 4, batch 22200, loss[loss=0.2632, simple_loss=0.3309, pruned_loss=0.0977, over 21780.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3096, pruned_loss=0.08676, over 4286929.79 frames. ], batch size: 441, lr: 7.72e-03, grad_scale: 16.0 2023-06-21 01:04:12,446 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=682104.0, ans=0.0 2023-06-21 01:04:25,384 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=682104.0, ans=0.125 2023-06-21 01:04:49,394 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 01:05:17,334 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.985e+02 2.532e+02 2.842e+02 3.263e+02 4.792e+02, threshold=5.684e+02, percent-clipped=0.0 2023-06-21 01:06:03,318 INFO [train.py:996] (3/4) Epoch 4, batch 22250, loss[loss=0.3302, simple_loss=0.3835, pruned_loss=0.1384, over 21453.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.317, pruned_loss=0.08787, over 4281563.29 frames. ], batch size: 471, lr: 7.72e-03, grad_scale: 16.0 2023-06-21 01:06:15,023 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=682404.0, ans=0.125 2023-06-21 01:06:26,002 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.08 vs. limit=15.0 2023-06-21 01:07:04,800 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 01:07:15,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=682584.0, ans=0.2 2023-06-21 01:07:29,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=682584.0, ans=0.0 2023-06-21 01:07:44,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=682644.0, ans=0.1 2023-06-21 01:07:47,351 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=682644.0, ans=0.125 2023-06-21 01:07:52,745 INFO [train.py:996] (3/4) Epoch 4, batch 22300, loss[loss=0.2247, simple_loss=0.2855, pruned_loss=0.08201, over 21341.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3191, pruned_loss=0.09039, over 4286503.94 frames. ], batch size: 159, lr: 7.72e-03, grad_scale: 16.0 2023-06-21 01:08:08,565 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=682704.0, ans=0.1 2023-06-21 01:08:33,084 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=682764.0, ans=0.0 2023-06-21 01:08:51,229 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=682824.0, ans=0.125 2023-06-21 01:08:59,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=682824.0, ans=0.0 2023-06-21 01:08:59,595 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.09 vs. limit=6.0 2023-06-21 01:09:12,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=682884.0, ans=0.125 2023-06-21 01:09:13,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=682884.0, ans=0.125 2023-06-21 01:09:16,188 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.111e+02 2.961e+02 3.295e+02 3.929e+02 6.165e+02, threshold=6.589e+02, percent-clipped=1.0 2023-06-21 01:09:41,558 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.06 vs. limit=15.0 2023-06-21 01:09:57,926 INFO [train.py:996] (3/4) Epoch 4, batch 22350, loss[loss=0.2418, simple_loss=0.3067, pruned_loss=0.08848, over 21875.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.317, pruned_loss=0.09042, over 4289700.10 frames. ], batch size: 371, lr: 7.72e-03, grad_scale: 16.0 2023-06-21 01:10:57,591 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 01:10:57,668 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=683124.0, ans=0.125 2023-06-21 01:11:22,438 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=683244.0, ans=0.0 2023-06-21 01:11:25,820 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.28 vs. limit=12.0 2023-06-21 01:11:44,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=683244.0, ans=0.2 2023-06-21 01:11:54,222 INFO [train.py:996] (3/4) Epoch 4, batch 22400, loss[loss=0.2319, simple_loss=0.2989, pruned_loss=0.0825, over 21890.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3136, pruned_loss=0.0872, over 4286703.37 frames. ], batch size: 107, lr: 7.71e-03, grad_scale: 32.0 2023-06-21 01:12:58,951 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.715e+02 2.385e+02 2.684e+02 3.038e+02 4.758e+02, threshold=5.368e+02, percent-clipped=0.0 2023-06-21 01:13:33,330 INFO [train.py:996] (3/4) Epoch 4, batch 22450, loss[loss=0.2281, simple_loss=0.2959, pruned_loss=0.0802, over 21825.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3066, pruned_loss=0.08567, over 4292678.83 frames. ], batch size: 107, lr: 7.71e-03, grad_scale: 32.0 2023-06-21 01:14:20,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=683724.0, ans=0.07 2023-06-21 01:14:21,606 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=683724.0, ans=0.2 2023-06-21 01:14:46,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=683784.0, ans=0.5 2023-06-21 01:15:40,179 INFO [train.py:996] (3/4) Epoch 4, batch 22500, loss[loss=0.2036, simple_loss=0.263, pruned_loss=0.07209, over 20716.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3018, pruned_loss=0.08475, over 4279331.71 frames. ], batch size: 607, lr: 7.71e-03, grad_scale: 32.0 2023-06-21 01:15:40,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=683904.0, ans=0.125 2023-06-21 01:15:55,685 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=683964.0, ans=0.0 2023-06-21 01:16:53,265 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.003e+02 2.672e+02 3.079e+02 3.545e+02 7.228e+02, threshold=6.157e+02, percent-clipped=7.0 2023-06-21 01:17:06,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=684144.0, ans=0.1 2023-06-21 01:17:38,812 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-21 01:17:43,733 INFO [train.py:996] (3/4) Epoch 4, batch 22550, loss[loss=0.2445, simple_loss=0.3072, pruned_loss=0.09094, over 21553.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3053, pruned_loss=0.08482, over 4280681.18 frames. ], batch size: 548, lr: 7.71e-03, grad_scale: 32.0 2023-06-21 01:18:17,854 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=684324.0, ans=0.0 2023-06-21 01:19:15,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=684444.0, ans=0.2 2023-06-21 01:19:34,961 INFO [train.py:996] (3/4) Epoch 4, batch 22600, loss[loss=0.1929, simple_loss=0.2603, pruned_loss=0.06271, over 21613.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3085, pruned_loss=0.08567, over 4286545.25 frames. ], batch size: 230, lr: 7.71e-03, grad_scale: 32.0 2023-06-21 01:19:54,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=684504.0, ans=0.125 2023-06-21 01:19:58,057 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.24 vs. limit=6.0 2023-06-21 01:20:24,507 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=23.33 vs. limit=15.0 2023-06-21 01:20:55,352 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.152e+02 2.560e+02 2.970e+02 3.484e+02 5.965e+02, threshold=5.939e+02, percent-clipped=0.0 2023-06-21 01:20:56,382 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.31 vs. limit=15.0 2023-06-21 01:21:05,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=684744.0, ans=0.125 2023-06-21 01:21:31,528 INFO [train.py:996] (3/4) Epoch 4, batch 22650, loss[loss=0.2116, simple_loss=0.2712, pruned_loss=0.076, over 21823.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3064, pruned_loss=0.08504, over 4285188.18 frames. ], batch size: 107, lr: 7.71e-03, grad_scale: 32.0 2023-06-21 01:23:00,585 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=685044.0, ans=0.5 2023-06-21 01:23:06,099 INFO [train.py:996] (3/4) Epoch 4, batch 22700, loss[loss=0.2124, simple_loss=0.2582, pruned_loss=0.0833, over 20650.00 frames. ], tot_loss[loss=0.234, simple_loss=0.2996, pruned_loss=0.08415, over 4275073.00 frames. ], batch size: 607, lr: 7.70e-03, grad_scale: 32.0 2023-06-21 01:23:09,779 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=685104.0, ans=0.1 2023-06-21 01:24:16,225 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.069e+02 2.521e+02 2.946e+02 3.485e+02 5.279e+02, threshold=5.893e+02, percent-clipped=0.0 2023-06-21 01:24:42,717 INFO [train.py:996] (3/4) Epoch 4, batch 22750, loss[loss=0.3011, simple_loss=0.3589, pruned_loss=0.1216, over 21759.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3031, pruned_loss=0.08727, over 4273152.31 frames. ], batch size: 441, lr: 7.70e-03, grad_scale: 32.0 2023-06-21 01:24:44,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=685404.0, ans=0.0 2023-06-21 01:25:40,654 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=685524.0, ans=0.125 2023-06-21 01:26:48,309 INFO [train.py:996] (3/4) Epoch 4, batch 22800, loss[loss=0.2239, simple_loss=0.2887, pruned_loss=0.07954, over 21660.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3069, pruned_loss=0.08964, over 4283181.01 frames. ], batch size: 263, lr: 7.70e-03, grad_scale: 32.0 2023-06-21 01:26:50,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=685704.0, ans=0.125 2023-06-21 01:27:34,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=685824.0, ans=0.125 2023-06-21 01:28:02,024 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.075e+02 2.664e+02 3.134e+02 3.732e+02 6.258e+02, threshold=6.268e+02, percent-clipped=2.0 2023-06-21 01:28:27,921 INFO [train.py:996] (3/4) Epoch 4, batch 22850, loss[loss=0.2198, simple_loss=0.2759, pruned_loss=0.08183, over 21640.00 frames. ], tot_loss[loss=0.24, simple_loss=0.304, pruned_loss=0.08801, over 4284953.31 frames. ], batch size: 247, lr: 7.70e-03, grad_scale: 32.0 2023-06-21 01:29:01,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=686064.0, ans=0.125 2023-06-21 01:29:14,565 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=686124.0, ans=0.125 2023-06-21 01:30:17,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=686244.0, ans=0.2 2023-06-21 01:30:19,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=686244.0, ans=0.0 2023-06-21 01:30:32,288 INFO [train.py:996] (3/4) Epoch 4, batch 22900, loss[loss=0.2703, simple_loss=0.3741, pruned_loss=0.0833, over 21624.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3059, pruned_loss=0.08693, over 4285283.34 frames. ], batch size: 441, lr: 7.70e-03, grad_scale: 32.0 2023-06-21 01:30:41,386 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=686304.0, ans=0.04949747468305833 2023-06-21 01:30:54,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=686304.0, ans=0.2 2023-06-21 01:30:56,085 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.66 vs. limit=15.0 2023-06-21 01:30:57,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=686364.0, ans=0.0 2023-06-21 01:31:20,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=686364.0, ans=0.125 2023-06-21 01:32:28,758 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.866e+02 2.538e+02 2.990e+02 3.640e+02 6.063e+02, threshold=5.980e+02, percent-clipped=0.0 2023-06-21 01:32:30,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=686484.0, ans=0.125 2023-06-21 01:32:53,720 INFO [train.py:996] (3/4) Epoch 4, batch 22950, loss[loss=0.2, simple_loss=0.268, pruned_loss=0.06599, over 16472.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3203, pruned_loss=0.08574, over 4277657.61 frames. ], batch size: 60, lr: 7.70e-03, grad_scale: 32.0 2023-06-21 01:33:38,384 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.16 vs. limit=15.0 2023-06-21 01:33:40,689 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 01:34:56,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=686904.0, ans=0.1 2023-06-21 01:35:05,435 INFO [train.py:996] (3/4) Epoch 4, batch 23000, loss[loss=0.2477, simple_loss=0.3055, pruned_loss=0.09497, over 21578.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3198, pruned_loss=0.08361, over 4283696.17 frames. ], batch size: 548, lr: 7.69e-03, grad_scale: 16.0 2023-06-21 01:35:38,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=686964.0, ans=0.0 2023-06-21 01:36:32,527 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.634e+02 2.398e+02 2.749e+02 3.269e+02 6.833e+02, threshold=5.498e+02, percent-clipped=2.0 2023-06-21 01:36:35,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=687144.0, ans=0.0 2023-06-21 01:36:50,574 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=687144.0, ans=0.125 2023-06-21 01:37:10,591 INFO [train.py:996] (3/4) Epoch 4, batch 23050, loss[loss=0.2374, simple_loss=0.3006, pruned_loss=0.08716, over 21385.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3198, pruned_loss=0.08572, over 4281487.59 frames. ], batch size: 176, lr: 7.69e-03, grad_scale: 16.0 2023-06-21 01:37:11,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=687204.0, ans=0.125 2023-06-21 01:37:17,028 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=687204.0, ans=0.07 2023-06-21 01:37:48,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=687264.0, ans=0.0 2023-06-21 01:39:15,338 INFO [train.py:996] (3/4) Epoch 4, batch 23100, loss[loss=0.2434, simple_loss=0.294, pruned_loss=0.09639, over 21557.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3148, pruned_loss=0.08588, over 4274398.50 frames. ], batch size: 391, lr: 7.69e-03, grad_scale: 16.0 2023-06-21 01:39:45,649 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.93 vs. limit=15.0 2023-06-21 01:39:58,233 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.89 vs. limit=15.0 2023-06-21 01:40:00,478 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=687564.0, ans=0.0 2023-06-21 01:40:03,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=687564.0, ans=0.125 2023-06-21 01:40:09,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=687624.0, ans=0.125 2023-06-21 01:40:15,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=687624.0, ans=0.04949747468305833 2023-06-21 01:40:16,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=687624.0, ans=0.09899494936611666 2023-06-21 01:40:20,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=687684.0, ans=0.2 2023-06-21 01:40:36,572 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 2.504e+02 2.792e+02 3.257e+02 4.712e+02, threshold=5.583e+02, percent-clipped=0.0 2023-06-21 01:41:10,867 INFO [train.py:996] (3/4) Epoch 4, batch 23150, loss[loss=0.2665, simple_loss=0.3169, pruned_loss=0.1081, over 21768.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3081, pruned_loss=0.08481, over 4275715.72 frames. ], batch size: 441, lr: 7.69e-03, grad_scale: 16.0 2023-06-21 01:41:34,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=687804.0, ans=0.1 2023-06-21 01:41:59,199 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=687864.0, ans=0.0 2023-06-21 01:43:02,806 INFO [train.py:996] (3/4) Epoch 4, batch 23200, loss[loss=0.2288, simple_loss=0.308, pruned_loss=0.07479, over 21823.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.306, pruned_loss=0.08514, over 4281601.50 frames. ], batch size: 124, lr: 7.69e-03, grad_scale: 32.0 2023-06-21 01:44:18,125 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=688284.0, ans=0.1 2023-06-21 01:44:22,615 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=688284.0, ans=0.1 2023-06-21 01:44:27,748 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.177e+02 2.558e+02 3.014e+02 3.330e+02 5.283e+02, threshold=6.028e+02, percent-clipped=0.0 2023-06-21 01:45:06,671 INFO [train.py:996] (3/4) Epoch 4, batch 23250, loss[loss=0.321, simple_loss=0.4129, pruned_loss=0.1145, over 19766.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3066, pruned_loss=0.08643, over 4282735.11 frames. ], batch size: 704, lr: 7.69e-03, grad_scale: 32.0 2023-06-21 01:46:05,605 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.37 vs. limit=6.0 2023-06-21 01:46:14,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=688524.0, ans=0.125 2023-06-21 01:46:28,865 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=688584.0, ans=0.125 2023-06-21 01:47:16,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=688704.0, ans=0.125 2023-06-21 01:47:17,193 INFO [train.py:996] (3/4) Epoch 4, batch 23300, loss[loss=0.2413, simple_loss=0.3301, pruned_loss=0.07619, over 21425.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3154, pruned_loss=0.08831, over 4284296.85 frames. ], batch size: 211, lr: 7.68e-03, grad_scale: 32.0 2023-06-21 01:48:06,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=688824.0, ans=0.125 2023-06-21 01:48:19,775 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-06-21 01:48:50,548 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.961e+02 2.610e+02 2.927e+02 3.361e+02 4.958e+02, threshold=5.855e+02, percent-clipped=0.0 2023-06-21 01:49:18,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=688944.0, ans=15.0 2023-06-21 01:49:30,645 INFO [train.py:996] (3/4) Epoch 4, batch 23350, loss[loss=0.1973, simple_loss=0.2788, pruned_loss=0.0579, over 21702.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3197, pruned_loss=0.08748, over 4289339.67 frames. ], batch size: 298, lr: 7.68e-03, grad_scale: 32.0 2023-06-21 01:49:55,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=689064.0, ans=0.125 2023-06-21 01:50:20,575 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=689124.0, ans=0.1 2023-06-21 01:51:21,121 INFO [train.py:996] (3/4) Epoch 4, batch 23400, loss[loss=0.2384, simple_loss=0.3076, pruned_loss=0.08457, over 21824.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3127, pruned_loss=0.08398, over 4284275.12 frames. ], batch size: 124, lr: 7.68e-03, grad_scale: 32.0 2023-06-21 01:51:44,469 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=689364.0, ans=0.125 2023-06-21 01:52:54,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=689484.0, ans=0.125 2023-06-21 01:52:58,683 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.781e+02 2.326e+02 2.655e+02 3.143e+02 5.119e+02, threshold=5.310e+02, percent-clipped=0.0 2023-06-21 01:53:33,735 INFO [train.py:996] (3/4) Epoch 4, batch 23450, loss[loss=0.2222, simple_loss=0.2606, pruned_loss=0.09189, over 20280.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3142, pruned_loss=0.08684, over 4290812.39 frames. ], batch size: 702, lr: 7.68e-03, grad_scale: 32.0 2023-06-21 01:53:58,980 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=689664.0, ans=0.125 2023-06-21 01:54:07,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=689724.0, ans=0.1 2023-06-21 01:55:27,871 INFO [train.py:996] (3/4) Epoch 4, batch 23500, loss[loss=0.2264, simple_loss=0.293, pruned_loss=0.07986, over 21855.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3143, pruned_loss=0.08858, over 4296053.13 frames. ], batch size: 298, lr: 7.68e-03, grad_scale: 32.0 2023-06-21 01:55:32,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=689904.0, ans=0.2 2023-06-21 01:55:42,179 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=689964.0, ans=0.025 2023-06-21 01:55:42,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=689964.0, ans=0.0 2023-06-21 01:55:43,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=689964.0, ans=0.125 2023-06-21 01:55:49,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=689964.0, ans=0.2 2023-06-21 01:55:53,650 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.69 vs. limit=10.0 2023-06-21 01:56:07,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=690024.0, ans=0.125 2023-06-21 01:56:28,879 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.039e+02 2.673e+02 3.044e+02 3.448e+02 4.722e+02, threshold=6.088e+02, percent-clipped=0.0 2023-06-21 01:57:04,421 INFO [train.py:996] (3/4) Epoch 4, batch 23550, loss[loss=0.2426, simple_loss=0.2878, pruned_loss=0.09864, over 21671.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3079, pruned_loss=0.08772, over 4289485.90 frames. ], batch size: 416, lr: 7.68e-03, grad_scale: 32.0 2023-06-21 01:58:38,931 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 01:58:47,674 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=690444.0, ans=0.125 2023-06-21 01:59:09,017 INFO [train.py:996] (3/4) Epoch 4, batch 23600, loss[loss=0.2558, simple_loss=0.3228, pruned_loss=0.09443, over 21819.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3101, pruned_loss=0.08824, over 4283717.87 frames. ], batch size: 247, lr: 7.67e-03, grad_scale: 32.0 2023-06-21 02:00:52,713 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.585e+02 3.175e+02 3.942e+02 6.182e+02, threshold=6.349e+02, percent-clipped=1.0 2023-06-21 02:00:53,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=690684.0, ans=0.125 2023-06-21 02:01:04,707 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=15.0 2023-06-21 02:01:07,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=690744.0, ans=0.0 2023-06-21 02:01:13,087 INFO [train.py:996] (3/4) Epoch 4, batch 23650, loss[loss=0.2588, simple_loss=0.3312, pruned_loss=0.09321, over 21205.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3093, pruned_loss=0.08616, over 4282697.42 frames. ], batch size: 143, lr: 7.67e-03, grad_scale: 32.0 2023-06-21 02:01:54,598 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=690924.0, ans=15.0 2023-06-21 02:02:32,155 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 02:02:51,052 INFO [train.py:996] (3/4) Epoch 4, batch 23700, loss[loss=0.2713, simple_loss=0.3381, pruned_loss=0.1022, over 21798.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3128, pruned_loss=0.08606, over 4277070.26 frames. ], batch size: 124, lr: 7.67e-03, grad_scale: 32.0 2023-06-21 02:03:58,064 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=691284.0, ans=0.025 2023-06-21 02:04:09,241 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.827e+02 2.529e+02 2.918e+02 3.500e+02 6.134e+02, threshold=5.836e+02, percent-clipped=0.0 2023-06-21 02:04:16,124 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-21 02:04:36,865 INFO [train.py:996] (3/4) Epoch 4, batch 23750, loss[loss=0.2166, simple_loss=0.3074, pruned_loss=0.06287, over 21805.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3139, pruned_loss=0.08553, over 4275173.88 frames. ], batch size: 282, lr: 7.67e-03, grad_scale: 16.0 2023-06-21 02:04:55,513 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.01 vs. limit=15.0 2023-06-21 02:05:00,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=691404.0, ans=0.125 2023-06-21 02:06:39,762 INFO [train.py:996] (3/4) Epoch 4, batch 23800, loss[loss=0.2462, simple_loss=0.3382, pruned_loss=0.07711, over 21638.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3116, pruned_loss=0.08322, over 4273770.76 frames. ], batch size: 263, lr: 7.67e-03, grad_scale: 16.0 2023-06-21 02:06:44,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=691704.0, ans=0.1 2023-06-21 02:06:44,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=691704.0, ans=0.1 2023-06-21 02:07:27,588 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.98 vs. limit=15.0 2023-06-21 02:07:36,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=691764.0, ans=0.1 2023-06-21 02:07:38,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=691764.0, ans=0.09899494936611666 2023-06-21 02:07:45,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=691824.0, ans=0.0 2023-06-21 02:07:47,552 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=691824.0, ans=0.125 2023-06-21 02:08:17,845 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.895e+02 2.570e+02 3.095e+02 3.507e+02 5.751e+02, threshold=6.189e+02, percent-clipped=0.0 2023-06-21 02:09:00,380 INFO [train.py:996] (3/4) Epoch 4, batch 23850, loss[loss=0.2831, simple_loss=0.3984, pruned_loss=0.0839, over 19782.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3204, pruned_loss=0.08504, over 4274723.82 frames. ], batch size: 702, lr: 7.67e-03, grad_scale: 16.0 2023-06-21 02:09:24,824 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.01 vs. limit=15.0 2023-06-21 02:09:28,028 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.01 vs. limit=22.5 2023-06-21 02:09:28,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=692064.0, ans=0.125 2023-06-21 02:09:43,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=692124.0, ans=0.125 2023-06-21 02:10:04,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=692184.0, ans=0.2 2023-06-21 02:10:22,124 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=692244.0, ans=0.125 2023-06-21 02:10:42,684 INFO [train.py:996] (3/4) Epoch 4, batch 23900, loss[loss=0.2271, simple_loss=0.3003, pruned_loss=0.07696, over 21595.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3267, pruned_loss=0.08684, over 4283640.04 frames. ], batch size: 263, lr: 7.66e-03, grad_scale: 16.0 2023-06-21 02:11:18,090 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=692424.0, ans=0.125 2023-06-21 02:11:33,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=692424.0, ans=0.125 2023-06-21 02:12:06,803 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.098e+02 2.492e+02 2.802e+02 3.240e+02 5.431e+02, threshold=5.603e+02, percent-clipped=0.0 2023-06-21 02:12:35,446 INFO [train.py:996] (3/4) Epoch 4, batch 23950, loss[loss=0.2633, simple_loss=0.3182, pruned_loss=0.1042, over 21901.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3208, pruned_loss=0.08698, over 4283139.45 frames. ], batch size: 372, lr: 7.66e-03, grad_scale: 16.0 2023-06-21 02:13:06,913 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=692664.0, ans=0.125 2023-06-21 02:13:07,570 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.26 vs. limit=15.0 2023-06-21 02:14:34,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=692844.0, ans=0.2 2023-06-21 02:14:43,736 INFO [train.py:996] (3/4) Epoch 4, batch 24000, loss[loss=0.2609, simple_loss=0.3307, pruned_loss=0.09558, over 21713.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.321, pruned_loss=0.0893, over 4281462.15 frames. ], batch size: 298, lr: 7.66e-03, grad_scale: 32.0 2023-06-21 02:14:43,736 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-21 02:15:26,315 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.3.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.7699, 2.3950, 2.5653, 2.9053, 2.3275, 2.4213, 2.8647, 2.8652], device='cuda:3') 2023-06-21 02:15:40,418 INFO [train.py:1028] (3/4) Epoch 4, validation: loss=0.268, simple_loss=0.3653, pruned_loss=0.08536, over 1796401.00 frames. 2023-06-21 02:15:40,419 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-21 02:15:50,575 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.69 vs. limit=15.0 2023-06-21 02:16:08,191 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.06 vs. limit=15.0 2023-06-21 02:16:24,854 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.70 vs. limit=15.0 2023-06-21 02:16:25,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=693024.0, ans=0.125 2023-06-21 02:16:52,365 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=693084.0, ans=0.125 2023-06-21 02:16:56,190 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.223e+02 2.654e+02 3.041e+02 3.636e+02 5.941e+02, threshold=6.083e+02, percent-clipped=2.0 2023-06-21 02:17:11,671 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.62 vs. limit=15.0 2023-06-21 02:17:17,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=693144.0, ans=0.0 2023-06-21 02:17:18,382 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.07 vs. limit=6.0 2023-06-21 02:17:26,280 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.84 vs. limit=6.0 2023-06-21 02:17:26,611 INFO [train.py:996] (3/4) Epoch 4, batch 24050, loss[loss=0.2317, simple_loss=0.3194, pruned_loss=0.07197, over 21686.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3236, pruned_loss=0.09022, over 4281409.56 frames. ], batch size: 389, lr: 7.66e-03, grad_scale: 32.0 2023-06-21 02:17:33,282 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=693204.0, ans=0.125 2023-06-21 02:18:47,582 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=693384.0, ans=0.0 2023-06-21 02:19:19,532 INFO [train.py:996] (3/4) Epoch 4, batch 24100, loss[loss=0.2624, simple_loss=0.332, pruned_loss=0.09641, over 21300.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3228, pruned_loss=0.08829, over 4264496.18 frames. ], batch size: 159, lr: 7.66e-03, grad_scale: 32.0 2023-06-21 02:19:50,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=693564.0, ans=0.125 2023-06-21 02:20:21,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=693564.0, ans=0.125 2023-06-21 02:21:09,278 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.718e+02 2.445e+02 2.792e+02 3.280e+02 5.618e+02, threshold=5.584e+02, percent-clipped=0.0 2023-06-21 02:21:11,404 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=693744.0, ans=0.2 2023-06-21 02:21:15,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=693744.0, ans=0.125 2023-06-21 02:21:27,701 INFO [train.py:996] (3/4) Epoch 4, batch 24150, loss[loss=0.2404, simple_loss=0.3123, pruned_loss=0.08422, over 21483.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3233, pruned_loss=0.09045, over 4276680.17 frames. ], batch size: 131, lr: 7.66e-03, grad_scale: 32.0 2023-06-21 02:21:38,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=693804.0, ans=0.1 2023-06-21 02:21:54,707 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=693864.0, ans=0.125 2023-06-21 02:23:35,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=694044.0, ans=0.2 2023-06-21 02:23:39,047 INFO [train.py:996] (3/4) Epoch 4, batch 24200, loss[loss=0.2385, simple_loss=0.3243, pruned_loss=0.0764, over 21780.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3246, pruned_loss=0.09168, over 4270227.04 frames. ], batch size: 282, lr: 7.65e-03, grad_scale: 32.0 2023-06-21 02:23:45,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=694104.0, ans=0.125 2023-06-21 02:23:54,575 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=694164.0, ans=0.05 2023-06-21 02:25:14,756 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.651e+02 2.339e+02 2.793e+02 3.596e+02 6.031e+02, threshold=5.587e+02, percent-clipped=1.0 2023-06-21 02:25:38,068 INFO [train.py:996] (3/4) Epoch 4, batch 24250, loss[loss=0.2209, simple_loss=0.3309, pruned_loss=0.05542, over 21196.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3206, pruned_loss=0.08532, over 4275527.27 frames. ], batch size: 548, lr: 7.65e-03, grad_scale: 32.0 2023-06-21 02:26:07,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=694464.0, ans=0.0 2023-06-21 02:26:55,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=694524.0, ans=0.1 2023-06-21 02:27:00,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=694584.0, ans=0.04949747468305833 2023-06-21 02:28:01,386 INFO [train.py:996] (3/4) Epoch 4, batch 24300, loss[loss=0.1628, simple_loss=0.2454, pruned_loss=0.0401, over 21773.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3118, pruned_loss=0.07819, over 4282019.17 frames. ], batch size: 282, lr: 7.65e-03, grad_scale: 32.0 2023-06-21 02:29:01,262 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=694884.0, ans=0.125 2023-06-21 02:29:02,629 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 02:29:19,387 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 2.168e+02 3.072e+02 4.276e+02 8.509e+02, threshold=6.143e+02, percent-clipped=10.0 2023-06-21 02:29:34,765 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.69 vs. limit=10.0 2023-06-21 02:29:56,145 INFO [train.py:996] (3/4) Epoch 4, batch 24350, loss[loss=0.2221, simple_loss=0.2942, pruned_loss=0.07497, over 21635.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3105, pruned_loss=0.07968, over 4286860.57 frames. ], batch size: 230, lr: 7.65e-03, grad_scale: 32.0 2023-06-21 02:30:51,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=695124.0, ans=0.0 2023-06-21 02:31:28,703 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=695184.0, ans=0.0 2023-06-21 02:31:57,831 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=695244.0, ans=0.015 2023-06-21 02:32:00,364 INFO [train.py:996] (3/4) Epoch 4, batch 24400, loss[loss=0.2478, simple_loss=0.3219, pruned_loss=0.08686, over 21718.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3158, pruned_loss=0.08326, over 4289933.00 frames. ], batch size: 124, lr: 7.65e-03, grad_scale: 32.0 2023-06-21 02:32:30,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=695364.0, ans=0.125 2023-06-21 02:32:48,058 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=695424.0, ans=0.125 2023-06-21 02:33:26,745 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.110e+02 2.702e+02 3.028e+02 3.520e+02 5.844e+02, threshold=6.057e+02, percent-clipped=0.0 2023-06-21 02:33:32,650 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.40 vs. limit=12.0 2023-06-21 02:33:33,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=695544.0, ans=0.0 2023-06-21 02:33:59,841 INFO [train.py:996] (3/4) Epoch 4, batch 24450, loss[loss=0.2486, simple_loss=0.3312, pruned_loss=0.08303, over 21609.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3192, pruned_loss=0.08449, over 4285993.70 frames. ], batch size: 263, lr: 7.65e-03, grad_scale: 32.0 2023-06-21 02:35:21,694 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.28 vs. limit=15.0 2023-06-21 02:35:34,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=695844.0, ans=0.125 2023-06-21 02:35:50,228 INFO [train.py:996] (3/4) Epoch 4, batch 24500, loss[loss=0.238, simple_loss=0.2999, pruned_loss=0.0881, over 21553.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3186, pruned_loss=0.08378, over 4288362.15 frames. ], batch size: 211, lr: 7.64e-03, grad_scale: 32.0 2023-06-21 02:36:05,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=695904.0, ans=15.0 2023-06-21 02:36:20,840 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.40 vs. limit=12.0 2023-06-21 02:37:18,798 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.869e+02 2.487e+02 2.699e+02 3.128e+02 4.356e+02, threshold=5.399e+02, percent-clipped=0.0 2023-06-21 02:38:03,314 INFO [train.py:996] (3/4) Epoch 4, batch 24550, loss[loss=0.2848, simple_loss=0.3566, pruned_loss=0.1065, over 21527.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.322, pruned_loss=0.08689, over 4293880.35 frames. ], batch size: 131, lr: 7.64e-03, grad_scale: 32.0 2023-06-21 02:38:25,605 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=696264.0, ans=0.09899494936611666 2023-06-21 02:38:35,205 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.26 vs. limit=10.0 2023-06-21 02:38:50,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=696324.0, ans=0.125 2023-06-21 02:39:39,657 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.52 vs. limit=15.0 2023-06-21 02:39:57,893 INFO [train.py:996] (3/4) Epoch 4, batch 24600, loss[loss=0.2227, simple_loss=0.276, pruned_loss=0.0847, over 21216.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.317, pruned_loss=0.08684, over 4285565.68 frames. ], batch size: 143, lr: 7.64e-03, grad_scale: 32.0 2023-06-21 02:41:03,211 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=696684.0, ans=0.125 2023-06-21 02:41:15,580 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.931e+02 2.636e+02 3.077e+02 3.640e+02 5.149e+02, threshold=6.153e+02, percent-clipped=0.0 2023-06-21 02:41:15,937 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=696744.0, ans=0.035 2023-06-21 02:41:25,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=696744.0, ans=0.0 2023-06-21 02:41:42,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=696744.0, ans=0.1 2023-06-21 02:41:49,318 INFO [train.py:996] (3/4) Epoch 4, batch 24650, loss[loss=0.2129, simple_loss=0.2849, pruned_loss=0.07043, over 21418.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3097, pruned_loss=0.08517, over 4286052.02 frames. ], batch size: 131, lr: 7.64e-03, grad_scale: 32.0 2023-06-21 02:41:50,339 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.67 vs. limit=15.0 2023-06-21 02:41:51,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=696804.0, ans=0.125 2023-06-21 02:42:08,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=696804.0, ans=0.125 2023-06-21 02:42:14,864 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.07 vs. limit=22.5 2023-06-21 02:42:17,206 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=696864.0, ans=0.125 2023-06-21 02:42:18,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=696864.0, ans=0.125 2023-06-21 02:43:08,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=696984.0, ans=0.0 2023-06-21 02:43:39,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=697044.0, ans=0.1 2023-06-21 02:43:42,984 INFO [train.py:996] (3/4) Epoch 4, batch 24700, loss[loss=0.2176, simple_loss=0.2817, pruned_loss=0.07674, over 21798.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3069, pruned_loss=0.08377, over 4271737.26 frames. ], batch size: 107, lr: 7.64e-03, grad_scale: 32.0 2023-06-21 02:44:49,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=697224.0, ans=0.0 2023-06-21 02:45:01,442 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.81 vs. limit=22.5 2023-06-21 02:45:05,995 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.997e+02 2.595e+02 3.081e+02 3.646e+02 7.591e+02, threshold=6.163e+02, percent-clipped=1.0 2023-06-21 02:45:37,772 INFO [train.py:996] (3/4) Epoch 4, batch 24750, loss[loss=0.1942, simple_loss=0.2526, pruned_loss=0.06785, over 21500.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3003, pruned_loss=0.08146, over 4266402.31 frames. ], batch size: 195, lr: 7.64e-03, grad_scale: 32.0 2023-06-21 02:46:04,943 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=697464.0, ans=0.125 2023-06-21 02:46:13,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=697464.0, ans=0.0 2023-06-21 02:47:48,371 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=697644.0, ans=0.0 2023-06-21 02:47:52,375 INFO [train.py:996] (3/4) Epoch 4, batch 24800, loss[loss=0.2253, simple_loss=0.2831, pruned_loss=0.08373, over 21744.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.2954, pruned_loss=0.0812, over 4274900.33 frames. ], batch size: 247, lr: 7.63e-03, grad_scale: 32.0 2023-06-21 02:48:04,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=697704.0, ans=0.0 2023-06-21 02:48:08,322 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=697704.0, ans=15.0 2023-06-21 02:48:19,054 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.56 vs. limit=6.0 2023-06-21 02:48:30,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=697824.0, ans=0.2 2023-06-21 02:48:31,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=697824.0, ans=0.0 2023-06-21 02:48:52,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=697884.0, ans=0.0 2023-06-21 02:49:02,100 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.986e+02 2.444e+02 2.860e+02 3.241e+02 4.627e+02, threshold=5.721e+02, percent-clipped=0.0 2023-06-21 02:49:29,202 INFO [train.py:996] (3/4) Epoch 4, batch 24850, loss[loss=0.2273, simple_loss=0.3048, pruned_loss=0.0749, over 21845.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.297, pruned_loss=0.08267, over 4279089.00 frames. ], batch size: 332, lr: 7.63e-03, grad_scale: 32.0 2023-06-21 02:49:55,988 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=698064.0, ans=0.0 2023-06-21 02:49:56,544 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.58 vs. limit=15.0 2023-06-21 02:50:27,442 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=698124.0, ans=0.125 2023-06-21 02:51:29,161 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.99 vs. limit=15.0 2023-06-21 02:51:32,383 INFO [train.py:996] (3/4) Epoch 4, batch 24900, loss[loss=0.2669, simple_loss=0.3315, pruned_loss=0.1011, over 21295.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.2994, pruned_loss=0.08303, over 4276898.12 frames. ], batch size: 143, lr: 7.63e-03, grad_scale: 32.0 2023-06-21 02:51:41,201 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=698304.0, ans=0.2 2023-06-21 02:51:58,953 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=698364.0, ans=0.125 2023-06-21 02:52:04,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=698364.0, ans=0.0 2023-06-21 02:52:46,113 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.30 vs. limit=15.0 2023-06-21 02:53:10,787 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.158e+02 2.670e+02 3.202e+02 3.948e+02 7.205e+02, threshold=6.404e+02, percent-clipped=5.0 2023-06-21 02:53:30,892 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.90 vs. limit=15.0 2023-06-21 02:53:38,988 INFO [train.py:996] (3/4) Epoch 4, batch 24950, loss[loss=0.2899, simple_loss=0.3629, pruned_loss=0.1085, over 21449.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3078, pruned_loss=0.08768, over 4282008.15 frames. ], batch size: 131, lr: 7.63e-03, grad_scale: 32.0 2023-06-21 02:53:42,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=698604.0, ans=0.0 2023-06-21 02:54:14,389 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=698664.0, ans=0.2 2023-06-21 02:54:33,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=698724.0, ans=0.2 2023-06-21 02:55:31,524 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=698904.0, ans=0.0 2023-06-21 02:55:32,504 INFO [train.py:996] (3/4) Epoch 4, batch 25000, loss[loss=0.2385, simple_loss=0.301, pruned_loss=0.088, over 21085.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3142, pruned_loss=0.08946, over 4282253.10 frames. ], batch size: 143, lr: 7.63e-03, grad_scale: 32.0 2023-06-21 02:56:09,761 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.72 vs. limit=15.0 2023-06-21 02:56:19,850 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=699024.0, ans=0.1 2023-06-21 02:56:29,742 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 02:56:44,953 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=699084.0, ans=0.07 2023-06-21 02:56:46,779 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.62 vs. limit=15.0 2023-06-21 02:57:02,195 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.987e+02 2.552e+02 2.956e+02 3.917e+02 9.098e+02, threshold=5.911e+02, percent-clipped=2.0 2023-06-21 02:57:19,773 INFO [train.py:996] (3/4) Epoch 4, batch 25050, loss[loss=0.2324, simple_loss=0.285, pruned_loss=0.08992, over 21442.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3081, pruned_loss=0.08824, over 4285546.82 frames. ], batch size: 441, lr: 7.63e-03, grad_scale: 32.0 2023-06-21 02:58:21,638 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.31 vs. limit=6.0 2023-06-21 02:58:42,686 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=699384.0, ans=0.2 2023-06-21 02:58:54,728 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=699444.0, ans=0.0 2023-06-21 02:59:25,196 INFO [train.py:996] (3/4) Epoch 4, batch 25100, loss[loss=0.2193, simple_loss=0.3028, pruned_loss=0.06797, over 21802.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.304, pruned_loss=0.08686, over 4280249.21 frames. ], batch size: 371, lr: 7.62e-03, grad_scale: 32.0 2023-06-21 02:59:28,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=699504.0, ans=0.1 2023-06-21 02:59:54,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=699564.0, ans=15.0 2023-06-21 03:00:12,326 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.28 vs. limit=15.0 2023-06-21 03:00:40,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=699684.0, ans=0.125 2023-06-21 03:00:40,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=699684.0, ans=0.2 2023-06-21 03:00:46,756 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.765e+02 2.444e+02 2.736e+02 3.224e+02 6.054e+02, threshold=5.473e+02, percent-clipped=1.0 2023-06-21 03:01:01,855 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=699744.0, ans=0.125 2023-06-21 03:01:05,102 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=699744.0, ans=0.125 2023-06-21 03:01:08,950 INFO [train.py:996] (3/4) Epoch 4, batch 25150, loss[loss=0.2385, simple_loss=0.3427, pruned_loss=0.06716, over 20791.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3077, pruned_loss=0.0846, over 4272594.51 frames. ], batch size: 608, lr: 7.62e-03, grad_scale: 32.0 2023-06-21 03:01:53,706 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=699924.0, ans=0.1 2023-06-21 03:02:32,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=700044.0, ans=0.125 2023-06-21 03:02:45,532 INFO [train.py:996] (3/4) Epoch 4, batch 25200, loss[loss=0.2113, simple_loss=0.2922, pruned_loss=0.06521, over 21235.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3062, pruned_loss=0.08213, over 4274473.27 frames. ], batch size: 176, lr: 7.62e-03, grad_scale: 32.0 2023-06-21 03:03:03,313 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.85 vs. limit=6.0 2023-06-21 03:03:26,725 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.06 vs. limit=15.0 2023-06-21 03:04:01,010 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.733e+02 2.161e+02 2.477e+02 2.831e+02 4.094e+02, threshold=4.954e+02, percent-clipped=0.0 2023-06-21 03:04:22,193 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=700404.0, ans=0.125 2023-06-21 03:04:23,325 INFO [train.py:996] (3/4) Epoch 4, batch 25250, loss[loss=0.2007, simple_loss=0.2612, pruned_loss=0.07005, over 21212.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3041, pruned_loss=0.08117, over 4264216.55 frames. ], batch size: 548, lr: 7.62e-03, grad_scale: 32.0 2023-06-21 03:04:29,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=700404.0, ans=0.125 2023-06-21 03:05:41,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=700644.0, ans=0.125 2023-06-21 03:06:01,003 INFO [train.py:996] (3/4) Epoch 4, batch 25300, loss[loss=0.2328, simple_loss=0.2962, pruned_loss=0.08467, over 21746.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3014, pruned_loss=0.08066, over 4255478.00 frames. ], batch size: 351, lr: 7.62e-03, grad_scale: 32.0 2023-06-21 03:06:01,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=700704.0, ans=0.2 2023-06-21 03:06:07,330 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 03:06:11,660 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=700704.0, ans=0.1 2023-06-21 03:06:41,083 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-06-21 03:07:10,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=700824.0, ans=0.2 2023-06-21 03:07:33,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=700884.0, ans=0.1 2023-06-21 03:07:35,702 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.941e+02 2.516e+02 3.011e+02 3.707e+02 6.335e+02, threshold=6.023e+02, percent-clipped=10.0 2023-06-21 03:07:36,423 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=700944.0, ans=0.1 2023-06-21 03:07:36,462 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=700944.0, ans=0.0 2023-06-21 03:07:41,263 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.54 vs. limit=22.5 2023-06-21 03:07:42,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=700944.0, ans=0.05 2023-06-21 03:07:48,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=700944.0, ans=22.5 2023-06-21 03:07:55,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=700944.0, ans=0.2 2023-06-21 03:07:55,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=700944.0, ans=0.0 2023-06-21 03:08:04,599 INFO [train.py:996] (3/4) Epoch 4, batch 25350, loss[loss=0.2082, simple_loss=0.2786, pruned_loss=0.06896, over 21174.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3044, pruned_loss=0.08075, over 4252376.51 frames. ], batch size: 548, lr: 7.62e-03, grad_scale: 32.0 2023-06-21 03:08:20,329 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=701004.0, ans=0.0 2023-06-21 03:08:32,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=701064.0, ans=0.1 2023-06-21 03:09:29,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=701184.0, ans=0.125 2023-06-21 03:09:46,645 INFO [train.py:996] (3/4) Epoch 4, batch 25400, loss[loss=0.1943, simple_loss=0.2575, pruned_loss=0.06558, over 21456.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3005, pruned_loss=0.07957, over 4251684.87 frames. ], batch size: 212, lr: 7.62e-03, grad_scale: 32.0 2023-06-21 03:10:33,404 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 03:10:57,598 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.90 vs. limit=15.0 2023-06-21 03:11:13,893 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.752e+02 2.377e+02 2.648e+02 3.077e+02 4.908e+02, threshold=5.297e+02, percent-clipped=0.0 2023-06-21 03:11:15,855 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=701544.0, ans=0.125 2023-06-21 03:11:30,291 INFO [train.py:996] (3/4) Epoch 4, batch 25450, loss[loss=0.2461, simple_loss=0.3306, pruned_loss=0.08084, over 21478.00 frames. ], tot_loss[loss=0.23, simple_loss=0.2996, pruned_loss=0.08015, over 4246225.44 frames. ], batch size: 194, lr: 7.61e-03, grad_scale: 32.0 2023-06-21 03:12:11,001 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=701664.0, ans=0.125 2023-06-21 03:12:43,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=701784.0, ans=0.2 2023-06-21 03:12:49,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=701784.0, ans=0.0 2023-06-21 03:12:57,514 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.76 vs. limit=15.0 2023-06-21 03:13:23,715 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=701844.0, ans=0.0 2023-06-21 03:13:31,392 INFO [train.py:996] (3/4) Epoch 4, batch 25500, loss[loss=0.3346, simple_loss=0.4014, pruned_loss=0.1339, over 21431.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.301, pruned_loss=0.07759, over 4235419.49 frames. ], batch size: 507, lr: 7.61e-03, grad_scale: 32.0 2023-06-21 03:13:46,245 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2023-06-21 03:14:26,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=702024.0, ans=0.125 2023-06-21 03:15:04,299 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.819e+02 2.421e+02 2.742e+02 3.267e+02 5.395e+02, threshold=5.484e+02, percent-clipped=1.0 2023-06-21 03:15:27,260 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=702144.0, ans=0.1 2023-06-21 03:15:27,852 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.90 vs. limit=22.5 2023-06-21 03:15:28,712 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=702144.0, ans=0.125 2023-06-21 03:15:41,943 INFO [train.py:996] (3/4) Epoch 4, batch 25550, loss[loss=0.2207, simple_loss=0.3225, pruned_loss=0.05945, over 19827.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3087, pruned_loss=0.07808, over 4243121.49 frames. ], batch size: 702, lr: 7.61e-03, grad_scale: 16.0 2023-06-21 03:16:10,322 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.17 vs. limit=10.0 2023-06-21 03:16:14,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=702264.0, ans=0.2 2023-06-21 03:16:35,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=702264.0, ans=0.125 2023-06-21 03:16:59,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=702384.0, ans=0.1 2023-06-21 03:17:28,645 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=702444.0, ans=0.125 2023-06-21 03:17:36,136 INFO [train.py:996] (3/4) Epoch 4, batch 25600, loss[loss=0.2589, simple_loss=0.3296, pruned_loss=0.09408, over 21325.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3122, pruned_loss=0.0785, over 4246560.86 frames. ], batch size: 548, lr: 7.61e-03, grad_scale: 32.0 2023-06-21 03:18:25,781 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=702564.0, ans=0.0 2023-06-21 03:18:27,387 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=702564.0, ans=0.125 2023-06-21 03:18:27,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=702564.0, ans=0.04949747468305833 2023-06-21 03:19:09,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=702684.0, ans=0.0 2023-06-21 03:19:14,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=702684.0, ans=0.125 2023-06-21 03:19:16,518 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.914e+02 2.481e+02 2.894e+02 3.647e+02 7.641e+02, threshold=5.788e+02, percent-clipped=5.0 2023-06-21 03:19:31,805 INFO [train.py:996] (3/4) Epoch 4, batch 25650, loss[loss=0.3035, simple_loss=0.4237, pruned_loss=0.0916, over 19735.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3133, pruned_loss=0.08142, over 4246075.44 frames. ], batch size: 702, lr: 7.61e-03, grad_scale: 32.0 2023-06-21 03:19:45,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=702804.0, ans=0.125 2023-06-21 03:20:57,408 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.13 vs. limit=15.0 2023-06-21 03:20:57,449 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=22.5 2023-06-21 03:21:22,355 INFO [train.py:996] (3/4) Epoch 4, batch 25700, loss[loss=0.2816, simple_loss=0.3423, pruned_loss=0.1105, over 21527.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.309, pruned_loss=0.08214, over 4248418.99 frames. ], batch size: 471, lr: 7.61e-03, grad_scale: 32.0 2023-06-21 03:22:07,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=703224.0, ans=0.0 2023-06-21 03:22:43,848 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.097e+02 2.692e+02 3.025e+02 3.444e+02 5.063e+02, threshold=6.050e+02, percent-clipped=0.0 2023-06-21 03:22:50,713 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=703344.0, ans=0.0 2023-06-21 03:23:08,321 INFO [train.py:996] (3/4) Epoch 4, batch 25750, loss[loss=0.3791, simple_loss=0.4245, pruned_loss=0.1669, over 21413.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3141, pruned_loss=0.08576, over 4252039.05 frames. ], batch size: 508, lr: 7.60e-03, grad_scale: 32.0 2023-06-21 03:23:38,414 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=703404.0, ans=0.125 2023-06-21 03:23:59,792 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=703464.0, ans=0.0 2023-06-21 03:24:04,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=703464.0, ans=0.1 2023-06-21 03:24:48,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=703524.0, ans=0.07 2023-06-21 03:25:19,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=703644.0, ans=0.125 2023-06-21 03:25:44,703 INFO [train.py:996] (3/4) Epoch 4, batch 25800, loss[loss=0.336, simple_loss=0.3881, pruned_loss=0.1419, over 21436.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.3259, pruned_loss=0.09058, over 4257351.39 frames. ], batch size: 471, lr: 7.60e-03, grad_scale: 32.0 2023-06-21 03:25:45,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=703704.0, ans=0.2 2023-06-21 03:26:12,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=703704.0, ans=0.125 2023-06-21 03:26:20,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=703764.0, ans=0.2 2023-06-21 03:27:33,234 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.874e+02 2.719e+02 3.110e+02 3.607e+02 5.723e+02, threshold=6.221e+02, percent-clipped=0.0 2023-06-21 03:27:53,226 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.35 vs. limit=6.0 2023-06-21 03:27:53,795 INFO [train.py:996] (3/4) Epoch 4, batch 25850, loss[loss=0.2831, simple_loss=0.3362, pruned_loss=0.115, over 21672.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3279, pruned_loss=0.08988, over 4262687.78 frames. ], batch size: 473, lr: 7.60e-03, grad_scale: 32.0 2023-06-21 03:29:43,934 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=704244.0, ans=0.125 2023-06-21 03:29:45,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=704244.0, ans=0.0 2023-06-21 03:29:52,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=704244.0, ans=0.0 2023-06-21 03:29:59,622 INFO [train.py:996] (3/4) Epoch 4, batch 25900, loss[loss=0.3656, simple_loss=0.4313, pruned_loss=0.15, over 21582.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.3305, pruned_loss=0.0913, over 4274836.52 frames. ], batch size: 471, lr: 7.60e-03, grad_scale: 32.0 2023-06-21 03:30:04,487 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 03:30:19,394 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=704304.0, ans=0.2 2023-06-21 03:30:20,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=704364.0, ans=0.125 2023-06-21 03:31:30,058 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=704484.0, ans=0.1 2023-06-21 03:31:47,712 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 03:31:51,619 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.997e+02 2.727e+02 3.001e+02 3.716e+02 5.320e+02, threshold=6.003e+02, percent-clipped=0.0 2023-06-21 03:31:59,604 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=704544.0, ans=0.125 2023-06-21 03:32:06,580 INFO [train.py:996] (3/4) Epoch 4, batch 25950, loss[loss=0.2528, simple_loss=0.3297, pruned_loss=0.08801, over 21949.00 frames. ], tot_loss[loss=0.2622, simple_loss=0.3356, pruned_loss=0.09441, over 4274809.33 frames. ], batch size: 317, lr: 7.60e-03, grad_scale: 32.0 2023-06-21 03:32:26,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=704604.0, ans=0.125 2023-06-21 03:33:38,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=704784.0, ans=10.0 2023-06-21 03:33:49,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=704844.0, ans=0.125 2023-06-21 03:33:52,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=704844.0, ans=0.2 2023-06-21 03:34:13,436 INFO [train.py:996] (3/4) Epoch 4, batch 26000, loss[loss=0.3087, simple_loss=0.404, pruned_loss=0.1067, over 19738.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.3364, pruned_loss=0.09349, over 4266487.66 frames. ], batch size: 703, lr: 7.60e-03, grad_scale: 32.0 2023-06-21 03:35:45,252 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.908e+02 2.487e+02 2.984e+02 3.738e+02 5.035e+02, threshold=5.968e+02, percent-clipped=0.0 2023-06-21 03:35:58,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=705144.0, ans=0.125 2023-06-21 03:36:20,347 INFO [train.py:996] (3/4) Epoch 4, batch 26050, loss[loss=0.2569, simple_loss=0.3108, pruned_loss=0.1015, over 21924.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.3354, pruned_loss=0.09401, over 4269313.70 frames. ], batch size: 351, lr: 7.59e-03, grad_scale: 32.0 2023-06-21 03:36:25,162 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=705204.0, ans=0.0 2023-06-21 03:36:40,831 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.24 vs. limit=15.0 2023-06-21 03:36:52,075 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.03 vs. limit=12.0 2023-06-21 03:37:16,859 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=705324.0, ans=0.0 2023-06-21 03:38:13,464 INFO [train.py:996] (3/4) Epoch 4, batch 26100, loss[loss=0.2516, simple_loss=0.3236, pruned_loss=0.08981, over 21485.00 frames. ], tot_loss[loss=0.2591, simple_loss=0.3307, pruned_loss=0.09378, over 4276140.01 frames. ], batch size: 131, lr: 7.59e-03, grad_scale: 32.0 2023-06-21 03:38:29,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=705504.0, ans=0.1 2023-06-21 03:39:21,735 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.99 vs. limit=15.0 2023-06-21 03:39:45,699 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.939e+02 2.666e+02 3.098e+02 3.751e+02 6.264e+02, threshold=6.196e+02, percent-clipped=2.0 2023-06-21 03:39:46,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=705744.0, ans=0.0 2023-06-21 03:39:52,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=705744.0, ans=0.125 2023-06-21 03:40:00,818 INFO [train.py:996] (3/4) Epoch 4, batch 26150, loss[loss=0.2818, simple_loss=0.3438, pruned_loss=0.1099, over 21816.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.3275, pruned_loss=0.09417, over 4289155.02 frames. ], batch size: 441, lr: 7.59e-03, grad_scale: 32.0 2023-06-21 03:40:11,305 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=705804.0, ans=0.1 2023-06-21 03:40:12,710 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=705804.0, ans=0.125 2023-06-21 03:41:08,457 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.00 vs. limit=12.0 2023-06-21 03:41:34,596 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 03:41:53,990 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=706044.0, ans=0.1 2023-06-21 03:42:30,067 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=706104.0, ans=0.1 2023-06-21 03:42:31,087 INFO [train.py:996] (3/4) Epoch 4, batch 26200, loss[loss=0.305, simple_loss=0.3918, pruned_loss=0.1091, over 21675.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3274, pruned_loss=0.09272, over 4283912.82 frames. ], batch size: 441, lr: 7.59e-03, grad_scale: 16.0 2023-06-21 03:42:59,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=706164.0, ans=0.125 2023-06-21 03:43:38,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=706284.0, ans=0.0 2023-06-21 03:44:04,534 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.967e+02 2.476e+02 2.776e+02 3.346e+02 6.662e+02, threshold=5.552e+02, percent-clipped=1.0 2023-06-21 03:44:26,318 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.58 vs. limit=15.0 2023-06-21 03:44:38,042 INFO [train.py:996] (3/4) Epoch 4, batch 26250, loss[loss=0.2416, simple_loss=0.3097, pruned_loss=0.08674, over 21856.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.3308, pruned_loss=0.09091, over 4278480.27 frames. ], batch size: 282, lr: 7.59e-03, grad_scale: 16.0 2023-06-21 03:45:06,138 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.95 vs. limit=15.0 2023-06-21 03:46:35,588 INFO [train.py:996] (3/4) Epoch 4, batch 26300, loss[loss=0.2852, simple_loss=0.3361, pruned_loss=0.1171, over 21753.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.3261, pruned_loss=0.09043, over 4280602.93 frames. ], batch size: 508, lr: 7.59e-03, grad_scale: 16.0 2023-06-21 03:46:38,176 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.95 vs. limit=15.0 2023-06-21 03:46:42,382 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.26 vs. limit=10.0 2023-06-21 03:48:39,877 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.984e+02 2.559e+02 2.845e+02 3.128e+02 5.313e+02, threshold=5.690e+02, percent-clipped=0.0 2023-06-21 03:48:50,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=706944.0, ans=0.2 2023-06-21 03:48:53,022 INFO [train.py:996] (3/4) Epoch 4, batch 26350, loss[loss=0.2426, simple_loss=0.2949, pruned_loss=0.09512, over 20085.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.3236, pruned_loss=0.09055, over 4280250.66 frames. ], batch size: 703, lr: 7.58e-03, grad_scale: 16.0 2023-06-21 03:49:03,253 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=707004.0, ans=0.125 2023-06-21 03:49:13,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=707064.0, ans=10.0 2023-06-21 03:50:31,579 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=707244.0, ans=0.1 2023-06-21 03:50:46,308 INFO [train.py:996] (3/4) Epoch 4, batch 26400, loss[loss=0.2112, simple_loss=0.2678, pruned_loss=0.07728, over 21580.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3194, pruned_loss=0.09144, over 4270416.01 frames. ], batch size: 263, lr: 7.58e-03, grad_scale: 32.0 2023-06-21 03:50:58,361 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=707304.0, ans=0.125 2023-06-21 03:51:12,304 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=707364.0, ans=0.125 2023-06-21 03:51:25,125 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=707424.0, ans=0.95 2023-06-21 03:51:35,412 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=707424.0, ans=0.125 2023-06-21 03:52:03,419 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=707484.0, ans=0.125 2023-06-21 03:52:11,727 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.448e+02 2.879e+02 3.292e+02 3.769e+02 5.955e+02, threshold=6.584e+02, percent-clipped=1.0 2023-06-21 03:52:33,065 INFO [train.py:996] (3/4) Epoch 4, batch 26450, loss[loss=0.2144, simple_loss=0.2734, pruned_loss=0.07768, over 21832.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3215, pruned_loss=0.09119, over 4260969.34 frames. ], batch size: 102, lr: 7.58e-03, grad_scale: 32.0 2023-06-21 03:52:51,185 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.85 vs. limit=6.0 2023-06-21 03:53:10,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=707664.0, ans=0.125 2023-06-21 03:54:12,190 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 03:54:54,307 INFO [train.py:996] (3/4) Epoch 4, batch 26500, loss[loss=0.2573, simple_loss=0.3408, pruned_loss=0.08694, over 21785.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.322, pruned_loss=0.08968, over 4264776.63 frames. ], batch size: 332, lr: 7.58e-03, grad_scale: 16.0 2023-06-21 03:55:03,604 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=707904.0, ans=0.2 2023-06-21 03:55:23,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=707904.0, ans=0.2 2023-06-21 03:55:24,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=707904.0, ans=0.125 2023-06-21 03:55:46,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=707964.0, ans=0.125 2023-06-21 03:56:14,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=708024.0, ans=0.2 2023-06-21 03:56:27,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=708084.0, ans=0.1 2023-06-21 03:56:37,412 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=708084.0, ans=0.1 2023-06-21 03:56:53,844 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.031e+02 2.479e+02 2.895e+02 3.453e+02 8.083e+02, threshold=5.789e+02, percent-clipped=2.0 2023-06-21 03:57:25,082 INFO [train.py:996] (3/4) Epoch 4, batch 26550, loss[loss=0.1726, simple_loss=0.2415, pruned_loss=0.05181, over 21252.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3193, pruned_loss=0.08706, over 4266277.29 frames. ], batch size: 176, lr: 7.58e-03, grad_scale: 16.0 2023-06-21 03:57:27,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=708204.0, ans=0.125 2023-06-21 03:57:27,648 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.63 vs. limit=6.0 2023-06-21 03:58:27,346 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=708324.0, ans=0.125 2023-06-21 03:59:12,212 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=708384.0, ans=0.0 2023-06-21 03:59:43,289 INFO [train.py:996] (3/4) Epoch 4, batch 26600, loss[loss=0.2248, simple_loss=0.3037, pruned_loss=0.07295, over 21732.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3161, pruned_loss=0.08323, over 4266805.97 frames. ], batch size: 351, lr: 7.58e-03, grad_scale: 16.0 2023-06-21 03:59:52,787 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.54 vs. limit=22.5 2023-06-21 04:00:08,673 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=12.0 2023-06-21 04:00:34,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=708624.0, ans=0.125 2023-06-21 04:00:52,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=708684.0, ans=0.2 2023-06-21 04:00:55,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=708684.0, ans=0.1 2023-06-21 04:01:27,988 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.759e+02 2.406e+02 2.882e+02 3.356e+02 5.632e+02, threshold=5.763e+02, percent-clipped=0.0 2023-06-21 04:01:31,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=708744.0, ans=0.125 2023-06-21 04:01:39,518 INFO [train.py:996] (3/4) Epoch 4, batch 26650, loss[loss=0.2041, simple_loss=0.2692, pruned_loss=0.06947, over 21605.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3092, pruned_loss=0.08159, over 4268851.86 frames. ], batch size: 263, lr: 7.57e-03, grad_scale: 16.0 2023-06-21 04:02:37,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=708924.0, ans=0.1 2023-06-21 04:03:00,548 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.62 vs. limit=10.0 2023-06-21 04:03:24,221 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=709044.0, ans=0.1 2023-06-21 04:03:32,383 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=10.24 vs. limit=15.0 2023-06-21 04:03:38,413 INFO [train.py:996] (3/4) Epoch 4, batch 26700, loss[loss=0.218, simple_loss=0.2817, pruned_loss=0.07712, over 21301.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3015, pruned_loss=0.07819, over 4273042.64 frames. ], batch size: 608, lr: 7.57e-03, grad_scale: 16.0 2023-06-21 04:03:48,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=709104.0, ans=0.2 2023-06-21 04:04:09,345 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=709164.0, ans=0.125 2023-06-21 04:04:16,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=709164.0, ans=0.2 2023-06-21 04:04:19,763 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=709224.0, ans=0.125 2023-06-21 04:04:49,385 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=709284.0, ans=0.125 2023-06-21 04:05:03,602 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 2.086e+02 2.352e+02 2.691e+02 3.815e+02, threshold=4.705e+02, percent-clipped=0.0 2023-06-21 04:05:05,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=709344.0, ans=0.125 2023-06-21 04:05:20,968 INFO [train.py:996] (3/4) Epoch 4, batch 26750, loss[loss=0.2459, simple_loss=0.3278, pruned_loss=0.08202, over 21914.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3008, pruned_loss=0.07654, over 4275806.33 frames. ], batch size: 372, lr: 7.57e-03, grad_scale: 16.0 2023-06-21 04:05:21,331 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=709404.0, ans=0.125 2023-06-21 04:06:07,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=709524.0, ans=0.0 2023-06-21 04:06:07,879 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=709524.0, ans=0.125 2023-06-21 04:06:41,629 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=709584.0, ans=0.125 2023-06-21 04:07:04,505 INFO [train.py:996] (3/4) Epoch 4, batch 26800, loss[loss=0.2781, simple_loss=0.3459, pruned_loss=0.1051, over 21693.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3086, pruned_loss=0.08095, over 4278496.33 frames. ], batch size: 351, lr: 7.57e-03, grad_scale: 32.0 2023-06-21 04:07:17,346 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=709704.0, ans=0.1 2023-06-21 04:07:35,555 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 04:07:36,204 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.70 vs. limit=22.5 2023-06-21 04:07:40,542 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.57 vs. limit=15.0 2023-06-21 04:08:02,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=709884.0, ans=0.2 2023-06-21 04:08:09,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=709884.0, ans=0.04949747468305833 2023-06-21 04:08:19,054 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.199e+02 2.660e+02 3.035e+02 3.389e+02 6.268e+02, threshold=6.069e+02, percent-clipped=8.0 2023-06-21 04:08:33,197 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.72 vs. limit=15.0 2023-06-21 04:08:36,515 INFO [train.py:996] (3/4) Epoch 4, batch 26850, loss[loss=0.2169, simple_loss=0.2779, pruned_loss=0.078, over 21801.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3115, pruned_loss=0.08418, over 4277135.15 frames. ], batch size: 98, lr: 7.57e-03, grad_scale: 32.0 2023-06-21 04:08:44,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=710004.0, ans=0.2 2023-06-21 04:08:50,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=710064.0, ans=0.1 2023-06-21 04:09:02,552 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=710064.0, ans=0.125 2023-06-21 04:09:21,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=710124.0, ans=0.125 2023-06-21 04:09:39,181 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=710184.0, ans=0.125 2023-06-21 04:09:40,620 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=710184.0, ans=0.125 2023-06-21 04:10:11,249 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=710304.0, ans=0.125 2023-06-21 04:10:12,271 INFO [train.py:996] (3/4) Epoch 4, batch 26900, loss[loss=0.2224, simple_loss=0.2789, pruned_loss=0.08289, over 21804.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3028, pruned_loss=0.08285, over 4276780.16 frames. ], batch size: 352, lr: 7.57e-03, grad_scale: 32.0 2023-06-21 04:10:22,931 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=710304.0, ans=0.0 2023-06-21 04:11:04,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=710424.0, ans=0.5 2023-06-21 04:11:08,967 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=710484.0, ans=0.0 2023-06-21 04:11:10,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=710484.0, ans=0.0 2023-06-21 04:11:29,914 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.955e+02 2.421e+02 2.683e+02 3.097e+02 4.785e+02, threshold=5.366e+02, percent-clipped=0.0 2023-06-21 04:11:47,564 INFO [train.py:996] (3/4) Epoch 4, batch 26950, loss[loss=0.2151, simple_loss=0.2795, pruned_loss=0.07536, over 21813.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3027, pruned_loss=0.08316, over 4272286.26 frames. ], batch size: 98, lr: 7.57e-03, grad_scale: 32.0 2023-06-21 04:12:23,282 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.87 vs. limit=15.0 2023-06-21 04:12:58,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=710784.0, ans=0.0 2023-06-21 04:13:24,352 INFO [train.py:996] (3/4) Epoch 4, batch 27000, loss[loss=0.2419, simple_loss=0.3254, pruned_loss=0.07924, over 21395.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3039, pruned_loss=0.08143, over 4264399.27 frames. ], batch size: 471, lr: 7.56e-03, grad_scale: 32.0 2023-06-21 04:13:24,352 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-21 04:14:23,370 INFO [train.py:1028] (3/4) Epoch 4, validation: loss=0.2574, simple_loss=0.3499, pruned_loss=0.08242, over 1796401.00 frames. 2023-06-21 04:14:23,372 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-21 04:14:37,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=710964.0, ans=0.0 2023-06-21 04:15:07,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=710964.0, ans=0.1 2023-06-21 04:15:32,697 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=711084.0, ans=0.2 2023-06-21 04:15:48,123 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.638e+02 2.341e+02 2.653e+02 3.174e+02 5.780e+02, threshold=5.306e+02, percent-clipped=1.0 2023-06-21 04:15:59,802 INFO [train.py:996] (3/4) Epoch 4, batch 27050, loss[loss=0.2339, simple_loss=0.3116, pruned_loss=0.07807, over 21751.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3062, pruned_loss=0.07861, over 4267088.31 frames. ], batch size: 112, lr: 7.56e-03, grad_scale: 32.0 2023-06-21 04:16:06,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=711204.0, ans=0.125 2023-06-21 04:16:43,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=711324.0, ans=0.1 2023-06-21 04:16:44,049 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.01 vs. limit=12.0 2023-06-21 04:17:07,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=711384.0, ans=0.0 2023-06-21 04:17:36,459 INFO [train.py:996] (3/4) Epoch 4, batch 27100, loss[loss=0.2315, simple_loss=0.3257, pruned_loss=0.06864, over 21777.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3092, pruned_loss=0.08018, over 4273415.40 frames. ], batch size: 298, lr: 7.56e-03, grad_scale: 32.0 2023-06-21 04:18:28,784 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=711564.0, ans=0.1 2023-06-21 04:18:50,365 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.09 vs. limit=22.5 2023-06-21 04:19:13,953 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.915e+02 2.531e+02 3.029e+02 3.575e+02 6.566e+02, threshold=6.059e+02, percent-clipped=4.0 2023-06-21 04:19:21,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=711744.0, ans=0.125 2023-06-21 04:19:24,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=711804.0, ans=0.125 2023-06-21 04:19:25,874 INFO [train.py:996] (3/4) Epoch 4, batch 27150, loss[loss=0.3371, simple_loss=0.4196, pruned_loss=0.1273, over 21538.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3198, pruned_loss=0.08296, over 4269478.06 frames. ], batch size: 471, lr: 7.56e-03, grad_scale: 32.0 2023-06-21 04:19:34,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=711804.0, ans=0.125 2023-06-21 04:19:51,143 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.25 vs. limit=15.0 2023-06-21 04:19:51,899 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=711864.0, ans=0.1 2023-06-21 04:20:37,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=711984.0, ans=0.2 2023-06-21 04:20:44,771 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=711984.0, ans=0.125 2023-06-21 04:21:13,171 INFO [train.py:996] (3/4) Epoch 4, batch 27200, loss[loss=0.2595, simple_loss=0.3257, pruned_loss=0.09667, over 21300.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3277, pruned_loss=0.08596, over 4277544.56 frames. ], batch size: 159, lr: 7.56e-03, grad_scale: 32.0 2023-06-21 04:21:15,105 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=712104.0, ans=0.125 2023-06-21 04:21:32,797 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 04:21:37,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=712164.0, ans=0.0 2023-06-21 04:21:54,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=712224.0, ans=0.125 2023-06-21 04:23:02,074 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.156e+02 2.916e+02 3.596e+02 4.478e+02 6.197e+02, threshold=7.191e+02, percent-clipped=3.0 2023-06-21 04:23:04,105 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=712344.0, ans=0.125 2023-06-21 04:23:20,364 INFO [train.py:996] (3/4) Epoch 4, batch 27250, loss[loss=0.3622, simple_loss=0.3989, pruned_loss=0.1628, over 21431.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.3324, pruned_loss=0.0913, over 4279284.29 frames. ], batch size: 510, lr: 7.56e-03, grad_scale: 32.0 2023-06-21 04:23:41,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=712464.0, ans=0.125 2023-06-21 04:23:48,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=712464.0, ans=0.2 2023-06-21 04:23:59,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=712524.0, ans=0.0 2023-06-21 04:24:02,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=712524.0, ans=0.0 2023-06-21 04:24:07,954 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.63 vs. limit=15.0 2023-06-21 04:24:17,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=712584.0, ans=0.125 2023-06-21 04:25:04,909 INFO [train.py:996] (3/4) Epoch 4, batch 27300, loss[loss=0.2522, simple_loss=0.3449, pruned_loss=0.07972, over 20749.00 frames. ], tot_loss[loss=0.2594, simple_loss=0.3343, pruned_loss=0.09225, over 4281398.59 frames. ], batch size: 607, lr: 7.55e-03, grad_scale: 32.0 2023-06-21 04:25:57,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=712764.0, ans=0.125 2023-06-21 04:26:17,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=712824.0, ans=0.125 2023-06-21 04:26:37,800 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=712884.0, ans=0.125 2023-06-21 04:26:50,685 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.116e+02 2.547e+02 2.887e+02 3.256e+02 5.962e+02, threshold=5.774e+02, percent-clipped=0.0 2023-06-21 04:26:55,388 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=712944.0, ans=0.125 2023-06-21 04:27:01,876 INFO [train.py:996] (3/4) Epoch 4, batch 27350, loss[loss=0.27, simple_loss=0.3418, pruned_loss=0.09904, over 21720.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3367, pruned_loss=0.09274, over 4284951.31 frames. ], batch size: 414, lr: 7.55e-03, grad_scale: 32.0 2023-06-21 04:28:08,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=713124.0, ans=0.125 2023-06-21 04:28:08,489 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 04:28:38,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=713184.0, ans=0.125 2023-06-21 04:29:05,751 INFO [train.py:996] (3/4) Epoch 4, batch 27400, loss[loss=0.2402, simple_loss=0.3075, pruned_loss=0.08644, over 21738.00 frames. ], tot_loss[loss=0.257, simple_loss=0.3307, pruned_loss=0.09169, over 4282590.55 frames. ], batch size: 112, lr: 7.55e-03, grad_scale: 32.0 2023-06-21 04:29:17,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=713304.0, ans=0.1 2023-06-21 04:30:42,129 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.947e+02 2.364e+02 2.653e+02 3.007e+02 3.565e+02, threshold=5.305e+02, percent-clipped=0.0 2023-06-21 04:30:49,744 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=713544.0, ans=0.1 2023-06-21 04:31:01,229 INFO [train.py:996] (3/4) Epoch 4, batch 27450, loss[loss=0.2219, simple_loss=0.313, pruned_loss=0.06544, over 21556.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.3231, pruned_loss=0.08897, over 4284118.18 frames. ], batch size: 230, lr: 7.55e-03, grad_scale: 32.0 2023-06-21 04:32:42,496 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=713844.0, ans=0.125 2023-06-21 04:32:46,519 INFO [train.py:996] (3/4) Epoch 4, batch 27500, loss[loss=0.2084, simple_loss=0.2798, pruned_loss=0.06851, over 21604.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3215, pruned_loss=0.08952, over 4292413.80 frames. ], batch size: 263, lr: 7.55e-03, grad_scale: 32.0 2023-06-21 04:33:33,853 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=714024.0, ans=0.125 2023-06-21 04:34:15,077 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.074e+02 2.580e+02 3.045e+02 3.674e+02 6.292e+02, threshold=6.090e+02, percent-clipped=1.0 2023-06-21 04:34:27,551 INFO [train.py:996] (3/4) Epoch 4, batch 27550, loss[loss=0.2193, simple_loss=0.2873, pruned_loss=0.07564, over 21741.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3182, pruned_loss=0.08656, over 4282719.99 frames. ], batch size: 351, lr: 7.55e-03, grad_scale: 32.0 2023-06-21 04:34:30,160 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.93 vs. limit=22.5 2023-06-21 04:35:02,771 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=714264.0, ans=0.125 2023-06-21 04:36:00,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=714444.0, ans=0.125 2023-06-21 04:36:03,177 INFO [train.py:996] (3/4) Epoch 4, batch 27600, loss[loss=0.201, simple_loss=0.2629, pruned_loss=0.06959, over 21597.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3104, pruned_loss=0.0851, over 4269118.77 frames. ], batch size: 247, lr: 7.54e-03, grad_scale: 32.0 2023-06-21 04:36:25,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=714564.0, ans=0.125 2023-06-21 04:37:11,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=714684.0, ans=0.125 2023-06-21 04:37:21,107 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.53 vs. limit=15.0 2023-06-21 04:37:27,293 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.907e+02 2.399e+02 2.649e+02 3.187e+02 5.092e+02, threshold=5.298e+02, percent-clipped=0.0 2023-06-21 04:37:38,967 INFO [train.py:996] (3/4) Epoch 4, batch 27650, loss[loss=0.2236, simple_loss=0.305, pruned_loss=0.0711, over 21471.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3047, pruned_loss=0.08461, over 4271037.91 frames. ], batch size: 211, lr: 7.54e-03, grad_scale: 32.0 2023-06-21 04:38:24,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=714924.0, ans=0.1 2023-06-21 04:38:27,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=714924.0, ans=0.125 2023-06-21 04:39:31,719 INFO [train.py:996] (3/4) Epoch 4, batch 27700, loss[loss=0.2438, simple_loss=0.2944, pruned_loss=0.09666, over 20229.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3041, pruned_loss=0.08364, over 4265014.23 frames. ], batch size: 703, lr: 7.54e-03, grad_scale: 32.0 2023-06-21 04:39:38,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=715104.0, ans=0.1 2023-06-21 04:39:53,398 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.38 vs. limit=12.0 2023-06-21 04:40:06,164 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.81 vs. limit=15.0 2023-06-21 04:40:35,051 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=715284.0, ans=0.0 2023-06-21 04:40:45,220 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=715284.0, ans=0.125 2023-06-21 04:40:56,522 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.830e+02 2.480e+02 2.917e+02 3.384e+02 5.795e+02, threshold=5.834e+02, percent-clipped=2.0 2023-06-21 04:41:04,357 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=715344.0, ans=0.2 2023-06-21 04:41:07,995 INFO [train.py:996] (3/4) Epoch 4, batch 27750, loss[loss=0.2912, simple_loss=0.367, pruned_loss=0.1077, over 21498.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3083, pruned_loss=0.08382, over 4270487.17 frames. ], batch size: 508, lr: 7.54e-03, grad_scale: 32.0 2023-06-21 04:42:29,473 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=715644.0, ans=0.2 2023-06-21 04:42:45,359 INFO [train.py:996] (3/4) Epoch 4, batch 27800, loss[loss=0.2615, simple_loss=0.3315, pruned_loss=0.09579, over 21844.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3064, pruned_loss=0.08342, over 4275400.70 frames. ], batch size: 124, lr: 7.54e-03, grad_scale: 32.0 2023-06-21 04:44:21,153 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.064e+02 2.548e+02 3.003e+02 3.795e+02 7.044e+02, threshold=6.005e+02, percent-clipped=2.0 2023-06-21 04:44:33,138 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.10 vs. limit=15.0 2023-06-21 04:44:33,680 INFO [train.py:996] (3/4) Epoch 4, batch 27850, loss[loss=0.2225, simple_loss=0.2938, pruned_loss=0.07567, over 21610.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3062, pruned_loss=0.08481, over 4287514.88 frames. ], batch size: 263, lr: 7.54e-03, grad_scale: 32.0 2023-06-21 04:45:44,958 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=716184.0, ans=0.1 2023-06-21 04:45:58,879 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=716244.0, ans=0.125 2023-06-21 04:46:03,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=716244.0, ans=0.1 2023-06-21 04:46:17,450 INFO [train.py:996] (3/4) Epoch 4, batch 27900, loss[loss=0.2225, simple_loss=0.3037, pruned_loss=0.07061, over 21243.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3145, pruned_loss=0.08539, over 4293563.32 frames. ], batch size: 176, lr: 7.54e-03, grad_scale: 32.0 2023-06-21 04:46:36,508 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=716364.0, ans=0.0 2023-06-21 04:46:55,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=716364.0, ans=0.125 2023-06-21 04:46:55,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=716364.0, ans=0.025 2023-06-21 04:47:14,422 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=716424.0, ans=0.125 2023-06-21 04:47:17,882 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.93 vs. limit=15.0 2023-06-21 04:47:26,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=716484.0, ans=0.035 2023-06-21 04:47:27,220 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.98 vs. limit=15.0 2023-06-21 04:48:01,882 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.814e+02 2.532e+02 3.006e+02 3.855e+02 8.106e+02, threshold=6.012e+02, percent-clipped=3.0 2023-06-21 04:48:19,856 INFO [train.py:996] (3/4) Epoch 4, batch 27950, loss[loss=0.3257, simple_loss=0.3896, pruned_loss=0.1309, over 21391.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.314, pruned_loss=0.08149, over 4289626.00 frames. ], batch size: 507, lr: 7.53e-03, grad_scale: 32.0 2023-06-21 04:48:31,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=716604.0, ans=0.125 2023-06-21 04:49:30,965 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.77 vs. limit=15.0 2023-06-21 04:50:18,745 INFO [train.py:996] (3/4) Epoch 4, batch 28000, loss[loss=0.2473, simple_loss=0.333, pruned_loss=0.08085, over 21271.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3119, pruned_loss=0.07923, over 4290131.12 frames. ], batch size: 549, lr: 7.53e-03, grad_scale: 32.0 2023-06-21 04:50:58,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=717024.0, ans=0.0 2023-06-21 04:51:00,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=717024.0, ans=0.0 2023-06-21 04:51:52,158 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=717084.0, ans=0.125 2023-06-21 04:51:59,101 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.819e+02 2.391e+02 2.762e+02 3.354e+02 5.546e+02, threshold=5.523e+02, percent-clipped=0.0 2023-06-21 04:52:22,060 INFO [train.py:996] (3/4) Epoch 4, batch 28050, loss[loss=0.2281, simple_loss=0.3079, pruned_loss=0.07412, over 21822.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3098, pruned_loss=0.08047, over 4296737.03 frames. ], batch size: 332, lr: 7.53e-03, grad_scale: 32.0 2023-06-21 04:52:33,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=717204.0, ans=0.2 2023-06-21 04:52:47,401 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=717264.0, ans=0.125 2023-06-21 04:52:55,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=717324.0, ans=0.2 2023-06-21 04:53:28,324 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-21 04:53:43,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=717384.0, ans=0.1 2023-06-21 04:54:21,883 INFO [train.py:996] (3/4) Epoch 4, batch 28100, loss[loss=0.2151, simple_loss=0.2716, pruned_loss=0.07926, over 21236.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.308, pruned_loss=0.08057, over 4278014.54 frames. ], batch size: 159, lr: 7.53e-03, grad_scale: 32.0 2023-06-21 04:54:41,571 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=717564.0, ans=10.0 2023-06-21 04:55:04,903 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.46 vs. limit=22.5 2023-06-21 04:55:07,059 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=717564.0, ans=0.0 2023-06-21 04:55:11,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=717624.0, ans=0.2 2023-06-21 04:55:57,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=717684.0, ans=0.025 2023-06-21 04:56:10,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=717744.0, ans=0.035 2023-06-21 04:56:20,286 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.029e+02 2.718e+02 3.349e+02 4.404e+02 9.791e+02, threshold=6.698e+02, percent-clipped=12.0 2023-06-21 04:56:39,055 INFO [train.py:996] (3/4) Epoch 4, batch 28150, loss[loss=0.2246, simple_loss=0.2867, pruned_loss=0.08128, over 21464.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3014, pruned_loss=0.08037, over 4274122.82 frames. ], batch size: 132, lr: 7.53e-03, grad_scale: 32.0 2023-06-21 04:57:07,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=717804.0, ans=0.125 2023-06-21 04:58:09,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=717924.0, ans=0.0 2023-06-21 04:58:28,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=717984.0, ans=0.0 2023-06-21 04:58:29,569 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 04:58:33,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=717984.0, ans=0.125 2023-06-21 04:58:44,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=718044.0, ans=0.0 2023-06-21 04:58:44,606 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=718044.0, ans=0.125 2023-06-21 04:58:44,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=718044.0, ans=0.04949747468305833 2023-06-21 04:59:19,396 INFO [train.py:996] (3/4) Epoch 4, batch 28200, loss[loss=0.2808, simple_loss=0.3458, pruned_loss=0.1079, over 21567.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3006, pruned_loss=0.08233, over 4276640.79 frames. ], batch size: 389, lr: 7.53e-03, grad_scale: 32.0 2023-06-21 04:59:31,639 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.16 vs. limit=6.0 2023-06-21 04:59:37,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=718104.0, ans=0.2 2023-06-21 04:59:57,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=718164.0, ans=0.125 2023-06-21 04:59:58,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=718164.0, ans=0.1 2023-06-21 05:00:09,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=718164.0, ans=0.1 2023-06-21 05:00:21,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=718224.0, ans=0.0 2023-06-21 05:00:51,265 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=718284.0, ans=0.125 2023-06-21 05:00:51,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=718284.0, ans=0.0 2023-06-21 05:01:01,116 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.143e+02 2.643e+02 3.167e+02 3.877e+02 7.013e+02, threshold=6.334e+02, percent-clipped=1.0 2023-06-21 05:01:03,498 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=718344.0, ans=15.0 2023-06-21 05:01:27,688 INFO [train.py:996] (3/4) Epoch 4, batch 28250, loss[loss=0.2197, simple_loss=0.2846, pruned_loss=0.07743, over 22016.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3049, pruned_loss=0.08507, over 4269861.08 frames. ], batch size: 103, lr: 7.52e-03, grad_scale: 32.0 2023-06-21 05:02:15,941 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-06-21 05:04:18,248 INFO [train.py:996] (3/4) Epoch 4, batch 28300, loss[loss=0.2438, simple_loss=0.3352, pruned_loss=0.07621, over 21496.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3037, pruned_loss=0.08308, over 4271450.10 frames. ], batch size: 471, lr: 7.52e-03, grad_scale: 32.0 2023-06-21 05:04:26,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=718704.0, ans=0.0 2023-06-21 05:04:36,469 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=718764.0, ans=0.1 2023-06-21 05:04:39,349 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=718764.0, ans=0.0 2023-06-21 05:04:48,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=718764.0, ans=0.1 2023-06-21 05:05:08,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=718824.0, ans=0.02 2023-06-21 05:06:09,702 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.566e+02 2.185e+02 2.470e+02 3.133e+02 5.042e+02, threshold=4.941e+02, percent-clipped=0.0 2023-06-21 05:06:39,497 INFO [train.py:996] (3/4) Epoch 4, batch 28350, loss[loss=0.2816, simple_loss=0.324, pruned_loss=0.1196, over 21343.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.2997, pruned_loss=0.07756, over 4268339.98 frames. ], batch size: 507, lr: 7.52e-03, grad_scale: 32.0 2023-06-21 05:07:06,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=719004.0, ans=0.2 2023-06-21 05:08:03,151 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=719124.0, ans=0.0 2023-06-21 05:08:46,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=719184.0, ans=0.0 2023-06-21 05:08:55,316 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=719244.0, ans=0.0 2023-06-21 05:09:31,701 INFO [train.py:996] (3/4) Epoch 4, batch 28400, loss[loss=0.2637, simple_loss=0.3175, pruned_loss=0.1049, over 21604.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.2957, pruned_loss=0.07735, over 4268447.69 frames. ], batch size: 441, lr: 7.52e-03, grad_scale: 32.0 2023-06-21 05:09:51,967 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 05:10:47,016 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.79 vs. limit=10.0 2023-06-21 05:11:17,054 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=719484.0, ans=0.0 2023-06-21 05:11:33,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=719544.0, ans=0.0 2023-06-21 05:11:42,785 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.979e+02 2.715e+02 3.259e+02 3.998e+02 7.508e+02, threshold=6.518e+02, percent-clipped=8.0 2023-06-21 05:12:07,595 INFO [train.py:996] (3/4) Epoch 4, batch 28450, loss[loss=0.3248, simple_loss=0.3756, pruned_loss=0.137, over 21500.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3013, pruned_loss=0.08143, over 4263179.18 frames. ], batch size: 471, lr: 7.52e-03, grad_scale: 32.0 2023-06-21 05:12:23,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=719604.0, ans=0.1 2023-06-21 05:12:58,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=719664.0, ans=0.125 2023-06-21 05:13:13,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=719724.0, ans=0.125 2023-06-21 05:13:19,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=719724.0, ans=0.125 2023-06-21 05:14:31,148 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=719844.0, ans=0.2 2023-06-21 05:14:34,764 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.94 vs. limit=15.0 2023-06-21 05:14:54,378 INFO [train.py:996] (3/4) Epoch 4, batch 28500, loss[loss=0.2428, simple_loss=0.3149, pruned_loss=0.08531, over 21774.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3051, pruned_loss=0.08448, over 4275533.26 frames. ], batch size: 112, lr: 7.52e-03, grad_scale: 32.0 2023-06-21 05:15:31,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=719964.0, ans=0.2 2023-06-21 05:16:10,597 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=720024.0, ans=0.0 2023-06-21 05:16:15,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=720024.0, ans=0.125 2023-06-21 05:16:53,167 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=720084.0, ans=0.125 2023-06-21 05:16:54,707 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=720144.0, ans=0.035 2023-06-21 05:16:54,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=720144.0, ans=0.2 2023-06-21 05:16:57,571 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=720144.0, ans=0.125 2023-06-21 05:17:05,831 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.000e+02 2.689e+02 3.008e+02 3.521e+02 7.488e+02, threshold=6.015e+02, percent-clipped=1.0 2023-06-21 05:17:06,410 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=720144.0, ans=0.1 2023-06-21 05:17:07,800 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=720144.0, ans=0.1 2023-06-21 05:17:37,831 INFO [train.py:996] (3/4) Epoch 4, batch 28550, loss[loss=0.2762, simple_loss=0.3734, pruned_loss=0.08954, over 21229.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3142, pruned_loss=0.08808, over 4282937.94 frames. ], batch size: 548, lr: 7.51e-03, grad_scale: 32.0 2023-06-21 05:17:53,751 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=720204.0, ans=0.125 2023-06-21 05:18:09,854 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=720264.0, ans=10.0 2023-06-21 05:18:11,619 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-06-21 05:18:39,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=720324.0, ans=0.2 2023-06-21 05:19:17,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=720384.0, ans=0.0 2023-06-21 05:19:38,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=720384.0, ans=0.1 2023-06-21 05:19:38,603 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=720384.0, ans=0.125 2023-06-21 05:19:48,978 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.71 vs. limit=22.5 2023-06-21 05:19:49,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=720444.0, ans=0.0 2023-06-21 05:19:54,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=720444.0, ans=0.0 2023-06-21 05:20:20,879 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=720444.0, ans=0.0 2023-06-21 05:20:28,540 INFO [train.py:996] (3/4) Epoch 4, batch 28600, loss[loss=0.258, simple_loss=0.3265, pruned_loss=0.09477, over 21951.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.321, pruned_loss=0.09005, over 4285790.29 frames. ], batch size: 373, lr: 7.51e-03, grad_scale: 16.0 2023-06-21 05:20:34,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=720504.0, ans=0.125 2023-06-21 05:20:37,791 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=720504.0, ans=0.2 2023-06-21 05:22:37,118 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.101e+02 2.703e+02 3.009e+02 3.352e+02 5.683e+02, threshold=6.017e+02, percent-clipped=0.0 2023-06-21 05:22:51,419 INFO [train.py:996] (3/4) Epoch 4, batch 28650, loss[loss=0.2199, simple_loss=0.2768, pruned_loss=0.08152, over 21263.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3155, pruned_loss=0.08931, over 4272907.47 frames. ], batch size: 549, lr: 7.51e-03, grad_scale: 16.0 2023-06-21 05:23:09,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=720804.0, ans=0.125 2023-06-21 05:23:50,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=720864.0, ans=0.125 2023-06-21 05:24:20,733 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=720924.0, ans=0.125 2023-06-21 05:25:30,148 INFO [train.py:996] (3/4) Epoch 4, batch 28700, loss[loss=0.2478, simple_loss=0.3085, pruned_loss=0.09361, over 21423.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3147, pruned_loss=0.09015, over 4272740.02 frames. ], batch size: 211, lr: 7.51e-03, grad_scale: 16.0 2023-06-21 05:26:52,641 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 05:27:52,696 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.027e+02 2.736e+02 3.172e+02 3.821e+02 7.853e+02, threshold=6.343e+02, percent-clipped=6.0 2023-06-21 05:28:04,614 INFO [train.py:996] (3/4) Epoch 4, batch 28750, loss[loss=0.2321, simple_loss=0.3092, pruned_loss=0.0775, over 21793.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3145, pruned_loss=0.0904, over 4279776.30 frames. ], batch size: 298, lr: 7.51e-03, grad_scale: 16.0 2023-06-21 05:28:27,677 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-21 05:28:50,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=721464.0, ans=0.125 2023-06-21 05:29:06,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=721464.0, ans=0.1 2023-06-21 05:29:16,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=721524.0, ans=0.0 2023-06-21 05:29:41,176 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=721524.0, ans=0.125 2023-06-21 05:30:56,310 INFO [train.py:996] (3/4) Epoch 4, batch 28800, loss[loss=0.266, simple_loss=0.3399, pruned_loss=0.09605, over 21592.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3177, pruned_loss=0.08967, over 4282385.27 frames. ], batch size: 389, lr: 7.51e-03, grad_scale: 32.0 2023-06-21 05:31:52,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=721764.0, ans=0.09899494936611666 2023-06-21 05:32:37,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=721884.0, ans=0.125 2023-06-21 05:33:14,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=721944.0, ans=0.2 2023-06-21 05:33:18,597 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.932e+02 2.546e+02 2.847e+02 3.234e+02 4.509e+02, threshold=5.694e+02, percent-clipped=0.0 2023-06-21 05:33:30,217 INFO [train.py:996] (3/4) Epoch 4, batch 28850, loss[loss=0.2227, simple_loss=0.2881, pruned_loss=0.07866, over 21837.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3182, pruned_loss=0.09096, over 4287939.24 frames. ], batch size: 247, lr: 7.51e-03, grad_scale: 32.0 2023-06-21 05:35:18,407 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.06 vs. limit=15.0 2023-06-21 05:35:55,609 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=722244.0, ans=0.0 2023-06-21 05:36:12,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=722244.0, ans=0.125 2023-06-21 05:36:22,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=722244.0, ans=0.1 2023-06-21 05:36:42,440 INFO [train.py:996] (3/4) Epoch 4, batch 28900, loss[loss=0.2922, simple_loss=0.3988, pruned_loss=0.09275, over 19920.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3209, pruned_loss=0.09248, over 4287696.86 frames. ], batch size: 702, lr: 7.50e-03, grad_scale: 32.0 2023-06-21 05:37:34,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=722364.0, ans=0.0 2023-06-21 05:37:34,221 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=722364.0, ans=0.125 2023-06-21 05:37:46,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=722364.0, ans=0.1 2023-06-21 05:38:05,114 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=722424.0, ans=0.2 2023-06-21 05:38:40,545 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=722484.0, ans=0.0 2023-06-21 05:39:05,966 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.047e+02 2.757e+02 3.326e+02 3.999e+02 7.317e+02, threshold=6.653e+02, percent-clipped=7.0 2023-06-21 05:39:44,253 INFO [train.py:996] (3/4) Epoch 4, batch 28950, loss[loss=0.3185, simple_loss=0.3841, pruned_loss=0.1264, over 21531.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3209, pruned_loss=0.09152, over 4275004.33 frames. ], batch size: 508, lr: 7.50e-03, grad_scale: 32.0 2023-06-21 05:40:19,638 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.27 vs. limit=22.5 2023-06-21 05:41:13,005 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 05:41:17,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=722784.0, ans=0.125 2023-06-21 05:41:23,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=722784.0, ans=0.125 2023-06-21 05:41:24,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=722784.0, ans=0.0 2023-06-21 05:41:41,939 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.08 vs. limit=15.0 2023-06-21 05:42:33,550 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.78 vs. limit=15.0 2023-06-21 05:42:35,360 INFO [train.py:996] (3/4) Epoch 4, batch 29000, loss[loss=0.2709, simple_loss=0.3493, pruned_loss=0.09628, over 21404.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3246, pruned_loss=0.09059, over 4270469.33 frames. ], batch size: 131, lr: 7.50e-03, grad_scale: 32.0 2023-06-21 05:43:21,650 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=722964.0, ans=0.125 2023-06-21 05:44:00,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=723024.0, ans=0.0 2023-06-21 05:44:06,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=723084.0, ans=0.1 2023-06-21 05:44:36,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=723084.0, ans=0.04949747468305833 2023-06-21 05:44:46,012 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=723084.0, ans=0.125 2023-06-21 05:44:54,070 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.954e+02 2.818e+02 3.202e+02 3.885e+02 5.481e+02, threshold=6.403e+02, percent-clipped=0.0 2023-06-21 05:45:27,126 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.48 vs. limit=15.0 2023-06-21 05:45:34,349 INFO [train.py:996] (3/4) Epoch 4, batch 29050, loss[loss=0.2454, simple_loss=0.3112, pruned_loss=0.08977, over 20111.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.3234, pruned_loss=0.09151, over 4272675.73 frames. ], batch size: 702, lr: 7.50e-03, grad_scale: 32.0 2023-06-21 05:45:40,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=723204.0, ans=0.125 2023-06-21 05:46:20,349 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=723324.0, ans=0.125 2023-06-21 05:47:30,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=723444.0, ans=0.125 2023-06-21 05:47:40,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=723444.0, ans=0.2 2023-06-21 05:47:50,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=723504.0, ans=0.0 2023-06-21 05:47:50,945 INFO [train.py:996] (3/4) Epoch 4, batch 29100, loss[loss=0.2061, simple_loss=0.2644, pruned_loss=0.07395, over 21697.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.314, pruned_loss=0.08886, over 4276254.48 frames. ], batch size: 316, lr: 7.50e-03, grad_scale: 32.0 2023-06-21 05:48:03,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=723504.0, ans=0.0 2023-06-21 05:48:30,129 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=723564.0, ans=0.0 2023-06-21 05:48:50,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=723564.0, ans=0.125 2023-06-21 05:49:47,789 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=723684.0, ans=0.125 2023-06-21 05:50:06,135 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 2.494e+02 2.914e+02 3.696e+02 5.946e+02, threshold=5.828e+02, percent-clipped=0.0 2023-06-21 05:50:11,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=723744.0, ans=0.2 2023-06-21 05:50:19,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=723744.0, ans=0.0 2023-06-21 05:50:36,713 INFO [train.py:996] (3/4) Epoch 4, batch 29150, loss[loss=0.2235, simple_loss=0.3146, pruned_loss=0.06621, over 21635.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3126, pruned_loss=0.08679, over 4275358.57 frames. ], batch size: 263, lr: 7.50e-03, grad_scale: 32.0 2023-06-21 05:51:31,262 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff3.min_abs, batch_count=723864.0, ans=0.2 2023-06-21 05:51:34,674 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.71 vs. limit=6.0 2023-06-21 05:51:51,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=723924.0, ans=0.0 2023-06-21 05:52:46,894 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=724044.0, ans=0.2 2023-06-21 05:53:07,605 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.60 vs. limit=15.0 2023-06-21 05:53:18,581 INFO [train.py:996] (3/4) Epoch 4, batch 29200, loss[loss=0.2076, simple_loss=0.2674, pruned_loss=0.07395, over 21407.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3087, pruned_loss=0.08578, over 4274933.00 frames. ], batch size: 194, lr: 7.49e-03, grad_scale: 32.0 2023-06-21 05:53:19,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=724104.0, ans=0.1 2023-06-21 05:53:23,716 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=724104.0, ans=0.2 2023-06-21 05:53:25,575 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=724104.0, ans=0.125 2023-06-21 05:54:46,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=724224.0, ans=0.0 2023-06-21 05:55:17,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=724284.0, ans=0.1 2023-06-21 05:55:24,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=724284.0, ans=0.5 2023-06-21 05:55:44,461 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.808e+02 2.500e+02 2.791e+02 3.395e+02 6.739e+02, threshold=5.582e+02, percent-clipped=1.0 2023-06-21 05:55:47,855 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 05:55:53,793 INFO [train.py:996] (3/4) Epoch 4, batch 29250, loss[loss=0.2157, simple_loss=0.2818, pruned_loss=0.07479, over 21794.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3067, pruned_loss=0.08362, over 4274637.13 frames. ], batch size: 98, lr: 7.49e-03, grad_scale: 32.0 2023-06-21 05:56:09,064 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=724464.0, ans=0.0 2023-06-21 05:56:09,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=724464.0, ans=0.04949747468305833 2023-06-21 05:57:27,675 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 05:58:05,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=724644.0, ans=0.1 2023-06-21 05:58:12,173 INFO [train.py:996] (3/4) Epoch 4, batch 29300, loss[loss=0.2123, simple_loss=0.2735, pruned_loss=0.07552, over 21564.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3084, pruned_loss=0.08317, over 4269050.48 frames. ], batch size: 231, lr: 7.49e-03, grad_scale: 32.0 2023-06-21 05:58:58,085 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=724764.0, ans=0.125 2023-06-21 05:59:00,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=724764.0, ans=0.035 2023-06-21 05:59:31,967 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=724824.0, ans=0.125 2023-06-21 05:59:43,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=724884.0, ans=0.125 2023-06-21 06:00:15,596 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=724884.0, ans=0.0 2023-06-21 06:00:47,270 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.908e+02 2.479e+02 2.818e+02 3.454e+02 4.910e+02, threshold=5.636e+02, percent-clipped=0.0 2023-06-21 06:00:58,264 INFO [train.py:996] (3/4) Epoch 4, batch 29350, loss[loss=0.2145, simple_loss=0.2815, pruned_loss=0.07375, over 21128.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3053, pruned_loss=0.08181, over 4258087.46 frames. ], batch size: 143, lr: 7.49e-03, grad_scale: 32.0 2023-06-21 06:01:28,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=725004.0, ans=0.0 2023-06-21 06:01:33,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=725064.0, ans=0.0 2023-06-21 06:01:37,428 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.36 vs. limit=15.0 2023-06-21 06:02:43,709 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=725184.0, ans=0.125 2023-06-21 06:02:46,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=725184.0, ans=0.0 2023-06-21 06:03:11,676 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.09 vs. limit=6.0 2023-06-21 06:03:35,560 INFO [train.py:996] (3/4) Epoch 4, batch 29400, loss[loss=0.2226, simple_loss=0.3051, pruned_loss=0.07008, over 21731.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.303, pruned_loss=0.07918, over 4257399.80 frames. ], batch size: 351, lr: 7.49e-03, grad_scale: 32.0 2023-06-21 06:03:36,674 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.56 vs. limit=22.5 2023-06-21 06:03:42,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=725304.0, ans=0.2 2023-06-21 06:05:19,248 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.75 vs. limit=15.0 2023-06-21 06:06:06,852 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.934e+02 2.554e+02 2.866e+02 3.298e+02 4.992e+02, threshold=5.731e+02, percent-clipped=0.0 2023-06-21 06:06:14,122 INFO [train.py:996] (3/4) Epoch 4, batch 29450, loss[loss=0.2544, simple_loss=0.3289, pruned_loss=0.08998, over 21370.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3029, pruned_loss=0.07893, over 4266032.78 frames. ], batch size: 549, lr: 7.49e-03, grad_scale: 16.0 2023-06-21 06:07:03,496 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=725664.0, ans=0.2 2023-06-21 06:08:17,438 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.92 vs. limit=10.0 2023-06-21 06:08:50,007 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=725844.0, ans=0.125 2023-06-21 06:08:59,135 INFO [train.py:996] (3/4) Epoch 4, batch 29500, loss[loss=0.2212, simple_loss=0.2849, pruned_loss=0.07879, over 21789.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.306, pruned_loss=0.08139, over 4272211.43 frames. ], batch size: 247, lr: 7.49e-03, grad_scale: 16.0 2023-06-21 06:08:59,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=725904.0, ans=0.1 2023-06-21 06:09:29,231 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=725904.0, ans=0.0 2023-06-21 06:09:54,410 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.70 vs. limit=15.0 2023-06-21 06:10:43,206 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.10 vs. limit=12.0 2023-06-21 06:10:44,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=726084.0, ans=0.025 2023-06-21 06:11:02,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=726144.0, ans=0.0 2023-06-21 06:11:27,322 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.078e+02 2.546e+02 2.828e+02 3.561e+02 7.164e+02, threshold=5.655e+02, percent-clipped=2.0 2023-06-21 06:11:36,388 INFO [train.py:996] (3/4) Epoch 4, batch 29550, loss[loss=0.2233, simple_loss=0.294, pruned_loss=0.07631, over 21461.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3066, pruned_loss=0.08405, over 4277851.35 frames. ], batch size: 144, lr: 7.48e-03, grad_scale: 16.0 2023-06-21 06:12:32,197 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=726264.0, ans=0.125 2023-06-21 06:12:33,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=726264.0, ans=0.1 2023-06-21 06:12:45,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=726264.0, ans=0.1 2023-06-21 06:13:52,398 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=726384.0, ans=0.0 2023-06-21 06:14:36,516 INFO [train.py:996] (3/4) Epoch 4, batch 29600, loss[loss=0.2611, simple_loss=0.3464, pruned_loss=0.08796, over 21802.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3131, pruned_loss=0.08637, over 4282210.43 frames. ], batch size: 282, lr: 7.48e-03, grad_scale: 32.0 2023-06-21 06:14:57,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=726504.0, ans=0.2 2023-06-21 06:15:53,417 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=726624.0, ans=0.125 2023-06-21 06:17:16,811 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=726744.0, ans=0.1 2023-06-21 06:17:17,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=726744.0, ans=10.0 2023-06-21 06:17:17,832 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.726e+02 2.703e+02 2.966e+02 3.927e+02 7.887e+02, threshold=5.932e+02, percent-clipped=5.0 2023-06-21 06:17:25,413 INFO [train.py:996] (3/4) Epoch 4, batch 29650, loss[loss=0.2243, simple_loss=0.2891, pruned_loss=0.07977, over 21451.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3161, pruned_loss=0.08508, over 4279426.96 frames. ], batch size: 144, lr: 7.48e-03, grad_scale: 32.0 2023-06-21 06:17:30,641 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.20 vs. limit=15.0 2023-06-21 06:20:05,844 INFO [train.py:996] (3/4) Epoch 4, batch 29700, loss[loss=0.2397, simple_loss=0.3325, pruned_loss=0.07341, over 21311.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3157, pruned_loss=0.08457, over 4279280.36 frames. ], batch size: 159, lr: 7.48e-03, grad_scale: 32.0 2023-06-21 06:21:29,802 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.61 vs. limit=15.0 2023-06-21 06:21:32,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=727224.0, ans=0.125 2023-06-21 06:22:19,654 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=727344.0, ans=0.125 2023-06-21 06:22:20,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=727344.0, ans=0.125 2023-06-21 06:22:32,541 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.021e+02 2.514e+02 2.951e+02 3.470e+02 6.624e+02, threshold=5.902e+02, percent-clipped=3.0 2023-06-21 06:22:49,526 INFO [train.py:996] (3/4) Epoch 4, batch 29750, loss[loss=0.2202, simple_loss=0.3115, pruned_loss=0.06445, over 21700.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3197, pruned_loss=0.08411, over 4276513.08 frames. ], batch size: 263, lr: 7.48e-03, grad_scale: 32.0 2023-06-21 06:24:14,745 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=727524.0, ans=0.1 2023-06-21 06:24:20,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=727524.0, ans=0.2 2023-06-21 06:24:26,319 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=727584.0, ans=0.125 2023-06-21 06:25:34,489 INFO [train.py:996] (3/4) Epoch 4, batch 29800, loss[loss=0.2454, simple_loss=0.3075, pruned_loss=0.09165, over 21808.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3201, pruned_loss=0.08509, over 4282083.21 frames. ], batch size: 389, lr: 7.48e-03, grad_scale: 32.0 2023-06-21 06:27:12,066 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-21 06:27:23,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=727884.0, ans=0.125 2023-06-21 06:27:35,571 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=727884.0, ans=0.0 2023-06-21 06:27:53,332 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.31 vs. limit=10.0 2023-06-21 06:28:12,227 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.829e+02 2.465e+02 2.740e+02 3.189e+02 4.611e+02, threshold=5.479e+02, percent-clipped=0.0 2023-06-21 06:28:17,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=727944.0, ans=0.0 2023-06-21 06:28:19,323 INFO [train.py:996] (3/4) Epoch 4, batch 29850, loss[loss=0.2067, simple_loss=0.2864, pruned_loss=0.06355, over 21430.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3165, pruned_loss=0.08288, over 4274341.81 frames. ], batch size: 131, lr: 7.47e-03, grad_scale: 32.0 2023-06-21 06:31:05,309 INFO [train.py:996] (3/4) Epoch 4, batch 29900, loss[loss=0.2095, simple_loss=0.2837, pruned_loss=0.06764, over 21238.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3143, pruned_loss=0.08376, over 4288461.66 frames. ], batch size: 176, lr: 7.47e-03, grad_scale: 16.0 2023-06-21 06:32:45,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=728484.0, ans=0.2 2023-06-21 06:33:31,147 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.044e+02 2.587e+02 2.985e+02 3.529e+02 5.856e+02, threshold=5.969e+02, percent-clipped=3.0 2023-06-21 06:33:48,921 INFO [train.py:996] (3/4) Epoch 4, batch 29950, loss[loss=0.255, simple_loss=0.3216, pruned_loss=0.09417, over 21739.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3171, pruned_loss=0.08724, over 4290351.65 frames. ], batch size: 332, lr: 7.47e-03, grad_scale: 16.0 2023-06-21 06:34:51,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=728724.0, ans=0.0 2023-06-21 06:35:47,182 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=728784.0, ans=0.125 2023-06-21 06:36:22,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=728844.0, ans=0.125 2023-06-21 06:36:36,777 INFO [train.py:996] (3/4) Epoch 4, batch 30000, loss[loss=0.2164, simple_loss=0.3085, pruned_loss=0.06219, over 21640.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3185, pruned_loss=0.08678, over 4283784.66 frames. ], batch size: 230, lr: 7.47e-03, grad_scale: 32.0 2023-06-21 06:36:36,777 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-21 06:37:40,809 INFO [train.py:1028] (3/4) Epoch 4, validation: loss=0.2514, simple_loss=0.3484, pruned_loss=0.07722, over 1796401.00 frames. 2023-06-21 06:37:40,810 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-21 06:37:47,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=728904.0, ans=0.125 2023-06-21 06:38:07,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=728964.0, ans=0.125 2023-06-21 06:38:39,722 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=729024.0, ans=0.125 2023-06-21 06:40:00,391 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.920e+02 2.416e+02 2.798e+02 3.507e+02 5.014e+02, threshold=5.595e+02, percent-clipped=0.0 2023-06-21 06:40:22,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=729144.0, ans=0.0 2023-06-21 06:40:22,502 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=729144.0, ans=0.025 2023-06-21 06:40:26,619 INFO [train.py:996] (3/4) Epoch 4, batch 30050, loss[loss=0.2453, simple_loss=0.3396, pruned_loss=0.07552, over 21773.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3211, pruned_loss=0.0836, over 4276347.05 frames. ], batch size: 316, lr: 7.47e-03, grad_scale: 32.0 2023-06-21 06:40:31,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=729204.0, ans=0.125 2023-06-21 06:41:20,700 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.81 vs. limit=22.5 2023-06-21 06:41:31,855 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=729324.0, ans=0.2 2023-06-21 06:42:25,773 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 06:42:37,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=729444.0, ans=0.125 2023-06-21 06:42:43,069 INFO [train.py:996] (3/4) Epoch 4, batch 30100, loss[loss=0.2714, simple_loss=0.304, pruned_loss=0.1193, over 21292.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3206, pruned_loss=0.08401, over 4272120.75 frames. ], batch size: 507, lr: 7.47e-03, grad_scale: 32.0 2023-06-21 06:44:19,151 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=729624.0, ans=0.0 2023-06-21 06:44:43,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=729684.0, ans=0.125 2023-06-21 06:44:43,331 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=729684.0, ans=0.125 2023-06-21 06:44:56,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=729744.0, ans=0.125 2023-06-21 06:45:00,414 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.306e+02 2.761e+02 3.179e+02 3.830e+02 7.077e+02, threshold=6.357e+02, percent-clipped=3.0 2023-06-21 06:45:32,473 INFO [train.py:996] (3/4) Epoch 4, batch 30150, loss[loss=0.2486, simple_loss=0.3108, pruned_loss=0.09323, over 21646.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3194, pruned_loss=0.08578, over 4266793.05 frames. ], batch size: 263, lr: 7.47e-03, grad_scale: 32.0 2023-06-21 06:46:42,287 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.97 vs. limit=6.0 2023-06-21 06:46:45,195 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=729864.0, ans=0.125 2023-06-21 06:46:48,595 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.38 vs. limit=22.5 2023-06-21 06:47:11,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=729984.0, ans=0.125 2023-06-21 06:48:06,473 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=730044.0, ans=0.0 2023-06-21 06:48:29,932 INFO [train.py:996] (3/4) Epoch 4, batch 30200, loss[loss=0.2449, simple_loss=0.3068, pruned_loss=0.09146, over 21362.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3204, pruned_loss=0.08405, over 4266305.77 frames. ], batch size: 549, lr: 7.46e-03, grad_scale: 32.0 2023-06-21 06:49:32,620 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=730164.0, ans=0.125 2023-06-21 06:49:32,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=730164.0, ans=0.125 2023-06-21 06:51:08,902 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.030e+02 2.530e+02 2.908e+02 3.490e+02 5.232e+02, threshold=5.817e+02, percent-clipped=0.0 2023-06-21 06:51:14,941 INFO [train.py:996] (3/4) Epoch 4, batch 30250, loss[loss=0.219, simple_loss=0.287, pruned_loss=0.07553, over 21905.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3259, pruned_loss=0.08623, over 4258733.12 frames. ], batch size: 98, lr: 7.46e-03, grad_scale: 32.0 2023-06-21 06:52:01,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=730464.0, ans=0.125 2023-06-21 06:53:27,703 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=730644.0, ans=0.125 2023-06-21 06:53:48,471 INFO [train.py:996] (3/4) Epoch 4, batch 30300, loss[loss=0.1952, simple_loss=0.2598, pruned_loss=0.06533, over 21113.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.323, pruned_loss=0.08584, over 4262198.89 frames. ], batch size: 176, lr: 7.46e-03, grad_scale: 32.0 2023-06-21 06:54:12,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=730704.0, ans=0.0 2023-06-21 06:54:18,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=730764.0, ans=0.2 2023-06-21 06:55:28,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=730824.0, ans=0.0 2023-06-21 06:55:34,413 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=730824.0, ans=0.0 2023-06-21 06:55:49,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=730884.0, ans=0.0 2023-06-21 06:55:59,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=730884.0, ans=0.125 2023-06-21 06:56:35,036 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.013e+02 2.610e+02 3.012e+02 3.806e+02 5.738e+02, threshold=6.023e+02, percent-clipped=0.0 2023-06-21 06:56:45,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=731004.0, ans=0.0 2023-06-21 06:56:46,358 INFO [train.py:996] (3/4) Epoch 4, batch 30350, loss[loss=0.2257, simple_loss=0.3, pruned_loss=0.07572, over 21759.00 frames. ], tot_loss[loss=0.25, simple_loss=0.325, pruned_loss=0.08749, over 4269341.09 frames. ], batch size: 282, lr: 7.46e-03, grad_scale: 32.0 2023-06-21 06:57:00,221 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=731004.0, ans=0.0 2023-06-21 06:57:26,114 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=731064.0, ans=0.125 2023-06-21 06:59:09,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=731184.0, ans=0.125 2023-06-21 06:59:19,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=731244.0, ans=0.125 2023-06-21 06:59:57,904 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=731244.0, ans=0.125 2023-06-21 07:00:07,932 INFO [train.py:996] (3/4) Epoch 4, batch 30400, loss[loss=0.2541, simple_loss=0.3063, pruned_loss=0.101, over 20028.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3174, pruned_loss=0.08579, over 4261176.10 frames. ], batch size: 703, lr: 7.46e-03, grad_scale: 32.0 2023-06-21 07:00:59,910 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=12.0 2023-06-21 07:03:22,921 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=731424.0, ans=0.0 2023-06-21 07:03:52,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=731484.0, ans=0.125 2023-06-21 07:05:08,746 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.116e+02 3.449e+02 4.209e+02 5.452e+02 1.525e+03, threshold=8.417e+02, percent-clipped=19.0 2023-06-21 07:05:24,260 INFO [train.py:996] (3/4) Epoch 4, batch 30450, loss[loss=0.3204, simple_loss=0.4288, pruned_loss=0.106, over 19787.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3188, pruned_loss=0.08609, over 4202920.98 frames. ], batch size: 702, lr: 7.46e-03, grad_scale: 32.0 2023-06-21 07:06:14,199 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.82 vs. limit=15.0 2023-06-21 07:06:16,631 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.85 vs. limit=10.0 2023-06-21 07:07:42,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=731664.0, ans=0.0 2023-06-21 07:12:03,773 INFO [train.py:996] (3/4) Epoch 5, batch 0, loss[loss=0.2581, simple_loss=0.3234, pruned_loss=0.0964, over 21717.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3234, pruned_loss=0.0964, over 21717.00 frames. ], batch size: 124, lr: 6.61e-03, grad_scale: 32.0 2023-06-21 07:12:03,774 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-21 07:12:43,344 INFO [train.py:1028] (3/4) Epoch 5, validation: loss=0.2379, simple_loss=0.3479, pruned_loss=0.06395, over 1796401.00 frames. 2023-06-21 07:12:43,349 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-21 07:13:00,150 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=731934.0, ans=0.1 2023-06-21 07:13:12,654 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.16 vs. limit=22.5 2023-06-21 07:14:15,602 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 07:15:06,032 INFO [train.py:996] (3/4) Epoch 5, batch 50, loss[loss=0.2914, simple_loss=0.3735, pruned_loss=0.1047, over 21643.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3298, pruned_loss=0.0857, over 963703.95 frames. ], batch size: 389, lr: 6.60e-03, grad_scale: 32.0 2023-06-21 07:15:13,372 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.161e+02 3.133e+02 4.909e+02 7.707e+02 2.246e+03, threshold=9.818e+02, percent-clipped=21.0 2023-06-21 07:15:41,201 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=732234.0, ans=0.0 2023-06-21 07:16:32,196 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.86 vs. limit=15.0 2023-06-21 07:16:58,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=732414.0, ans=0.125 2023-06-21 07:17:09,034 INFO [train.py:996] (3/4) Epoch 5, batch 100, loss[loss=0.2539, simple_loss=0.3369, pruned_loss=0.08544, over 21475.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.3402, pruned_loss=0.08737, over 1693366.14 frames. ], batch size: 131, lr: 6.60e-03, grad_scale: 32.0 2023-06-21 07:17:49,893 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=732534.0, ans=0.0 2023-06-21 07:19:24,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=732714.0, ans=0.0 2023-06-21 07:19:30,629 INFO [train.py:996] (3/4) Epoch 5, batch 150, loss[loss=0.2792, simple_loss=0.3643, pruned_loss=0.09709, over 21588.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.3405, pruned_loss=0.08663, over 2261966.79 frames. ], batch size: 441, lr: 6.60e-03, grad_scale: 16.0 2023-06-21 07:19:43,416 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.868e+02 2.471e+02 2.754e+02 3.178e+02 4.719e+02, threshold=5.509e+02, percent-clipped=0.0 2023-06-21 07:20:40,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=732894.0, ans=0.0 2023-06-21 07:22:07,533 INFO [train.py:996] (3/4) Epoch 5, batch 200, loss[loss=0.2567, simple_loss=0.3381, pruned_loss=0.0877, over 21597.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3345, pruned_loss=0.08456, over 2708664.69 frames. ], batch size: 414, lr: 6.60e-03, grad_scale: 16.0 2023-06-21 07:22:29,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=733134.0, ans=0.2 2023-06-21 07:22:43,494 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.34 vs. limit=15.0 2023-06-21 07:23:15,088 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=733194.0, ans=0.125 2023-06-21 07:23:17,073 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.19 vs. limit=22.5 2023-06-21 07:23:48,389 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=733314.0, ans=0.1 2023-06-21 07:24:18,182 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=733314.0, ans=0.125 2023-06-21 07:24:18,207 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 07:24:31,200 INFO [train.py:996] (3/4) Epoch 5, batch 250, loss[loss=0.2354, simple_loss=0.327, pruned_loss=0.07185, over 21813.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3304, pruned_loss=0.08514, over 3052226.01 frames. ], batch size: 332, lr: 6.60e-03, grad_scale: 16.0 2023-06-21 07:24:34,085 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.132e+02 2.530e+02 2.881e+02 3.593e+02 5.629e+02, threshold=5.761e+02, percent-clipped=1.0 2023-06-21 07:25:50,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=733494.0, ans=0.0 2023-06-21 07:25:52,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=733494.0, ans=0.2 2023-06-21 07:26:14,521 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=733554.0, ans=0.1 2023-06-21 07:26:45,877 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=733614.0, ans=0.125 2023-06-21 07:27:07,460 INFO [train.py:996] (3/4) Epoch 5, batch 300, loss[loss=0.248, simple_loss=0.3635, pruned_loss=0.06618, over 19883.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3243, pruned_loss=0.0842, over 3310607.76 frames. ], batch size: 703, lr: 6.60e-03, grad_scale: 16.0 2023-06-21 07:28:12,409 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.18 vs. limit=12.0 2023-06-21 07:28:17,554 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=733794.0, ans=0.125 2023-06-21 07:29:39,575 INFO [train.py:996] (3/4) Epoch 5, batch 350, loss[loss=0.1977, simple_loss=0.2583, pruned_loss=0.06854, over 21195.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3146, pruned_loss=0.08216, over 3523643.62 frames. ], batch size: 176, lr: 6.60e-03, grad_scale: 16.0 2023-06-21 07:29:51,532 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.928e+02 2.569e+02 2.912e+02 3.547e+02 5.180e+02, threshold=5.824e+02, percent-clipped=0.0 2023-06-21 07:30:54,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=734094.0, ans=0.07 2023-06-21 07:30:56,024 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.67 vs. limit=8.0 2023-06-21 07:31:50,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=734214.0, ans=0.0 2023-06-21 07:31:56,037 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 07:32:07,380 INFO [train.py:996] (3/4) Epoch 5, batch 400, loss[loss=0.2095, simple_loss=0.2704, pruned_loss=0.07432, over 21681.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3099, pruned_loss=0.08081, over 3690262.74 frames. ], batch size: 282, lr: 6.59e-03, grad_scale: 32.0 2023-06-21 07:32:43,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=734334.0, ans=0.2 2023-06-21 07:33:27,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=734394.0, ans=0.125 2023-06-21 07:34:39,339 INFO [train.py:996] (3/4) Epoch 5, batch 450, loss[loss=0.1804, simple_loss=0.2481, pruned_loss=0.05638, over 21250.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3107, pruned_loss=0.08126, over 3828235.87 frames. ], batch size: 144, lr: 6.59e-03, grad_scale: 32.0 2023-06-21 07:34:47,777 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.082e+02 2.593e+02 3.148e+02 3.879e+02 6.028e+02, threshold=6.296e+02, percent-clipped=1.0 2023-06-21 07:36:07,823 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=15.0 2023-06-21 07:36:28,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=734754.0, ans=0.1 2023-06-21 07:36:45,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=734814.0, ans=0.2 2023-06-21 07:37:22,098 INFO [train.py:996] (3/4) Epoch 5, batch 500, loss[loss=0.22, simple_loss=0.2791, pruned_loss=0.08049, over 21708.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3083, pruned_loss=0.08091, over 3930797.71 frames. ], batch size: 112, lr: 6.59e-03, grad_scale: 32.0 2023-06-21 07:37:32,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=734874.0, ans=0.125 2023-06-21 07:37:48,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=734934.0, ans=0.125 2023-06-21 07:39:34,428 INFO [train.py:996] (3/4) Epoch 5, batch 550, loss[loss=0.2434, simple_loss=0.3519, pruned_loss=0.06739, over 21209.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3101, pruned_loss=0.08035, over 4002667.15 frames. ], batch size: 548, lr: 6.59e-03, grad_scale: 32.0 2023-06-21 07:39:56,895 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.846e+02 2.656e+02 3.238e+02 3.999e+02 7.986e+02, threshold=6.476e+02, percent-clipped=2.0 2023-06-21 07:40:09,119 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.54 vs. limit=6.0 2023-06-21 07:40:12,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=735234.0, ans=0.1 2023-06-21 07:41:13,009 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.53 vs. limit=15.0 2023-06-21 07:41:23,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=735354.0, ans=0.125 2023-06-21 07:42:15,952 INFO [train.py:996] (3/4) Epoch 5, batch 600, loss[loss=0.2258, simple_loss=0.2888, pruned_loss=0.08139, over 21746.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3154, pruned_loss=0.08175, over 4056979.49 frames. ], batch size: 371, lr: 6.59e-03, grad_scale: 32.0 2023-06-21 07:42:16,333 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=735474.0, ans=0.1 2023-06-21 07:42:40,323 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.34 vs. limit=15.0 2023-06-21 07:43:13,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=735594.0, ans=0.09899494936611666 2023-06-21 07:44:10,068 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=735714.0, ans=0.2 2023-06-21 07:44:38,930 INFO [train.py:996] (3/4) Epoch 5, batch 650, loss[loss=0.2595, simple_loss=0.3236, pruned_loss=0.0977, over 21844.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3181, pruned_loss=0.08172, over 4107007.12 frames. ], batch size: 371, lr: 6.59e-03, grad_scale: 32.0 2023-06-21 07:44:41,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=735774.0, ans=0.125 2023-06-21 07:44:43,360 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.114e+02 2.568e+02 2.858e+02 3.474e+02 5.611e+02, threshold=5.715e+02, percent-clipped=0.0 2023-06-21 07:45:04,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=735774.0, ans=0.125 2023-06-21 07:45:09,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=735834.0, ans=0.125 2023-06-21 07:45:51,074 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=735894.0, ans=0.125 2023-06-21 07:46:24,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=735954.0, ans=0.125 2023-06-21 07:46:46,011 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=736014.0, ans=0.0 2023-06-21 07:47:07,523 INFO [train.py:996] (3/4) Epoch 5, batch 700, loss[loss=0.2132, simple_loss=0.275, pruned_loss=0.07569, over 21676.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3184, pruned_loss=0.08195, over 4150059.09 frames. ], batch size: 230, lr: 6.59e-03, grad_scale: 32.0 2023-06-21 07:47:45,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=736134.0, ans=0.07 2023-06-21 07:48:02,793 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=736194.0, ans=0.125 2023-06-21 07:48:50,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=736254.0, ans=0.0 2023-06-21 07:48:52,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=736314.0, ans=0.0 2023-06-21 07:49:07,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=736314.0, ans=0.2 2023-06-21 07:49:41,162 INFO [train.py:996] (3/4) Epoch 5, batch 750, loss[loss=0.2743, simple_loss=0.3244, pruned_loss=0.1121, over 21822.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3168, pruned_loss=0.08246, over 4170190.04 frames. ], batch size: 508, lr: 6.59e-03, grad_scale: 32.0 2023-06-21 07:49:43,945 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.986e+02 2.823e+02 3.263e+02 3.934e+02 5.736e+02, threshold=6.525e+02, percent-clipped=1.0 2023-06-21 07:49:58,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=736374.0, ans=0.125 2023-06-21 07:51:22,158 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=736554.0, ans=0.1 2023-06-21 07:52:01,041 INFO [train.py:996] (3/4) Epoch 5, batch 800, loss[loss=0.2267, simple_loss=0.2927, pruned_loss=0.08033, over 21863.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3116, pruned_loss=0.08194, over 4198831.07 frames. ], batch size: 351, lr: 6.58e-03, grad_scale: 32.0 2023-06-21 07:52:56,147 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=736794.0, ans=0.125 2023-06-21 07:53:31,642 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=736854.0, ans=0.125 2023-06-21 07:54:08,821 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=736914.0, ans=0.125 2023-06-21 07:54:31,376 INFO [train.py:996] (3/4) Epoch 5, batch 850, loss[loss=0.2825, simple_loss=0.3266, pruned_loss=0.1192, over 21646.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3102, pruned_loss=0.08266, over 4219018.86 frames. ], batch size: 508, lr: 6.58e-03, grad_scale: 32.0 2023-06-21 07:54:34,150 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.860e+02 2.467e+02 2.771e+02 3.268e+02 5.744e+02, threshold=5.542e+02, percent-clipped=0.0 2023-06-21 07:54:52,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=737034.0, ans=0.0 2023-06-21 07:55:39,866 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.21 vs. limit=15.0 2023-06-21 07:55:40,870 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 07:55:40,896 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=737094.0, ans=0.0 2023-06-21 07:56:41,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=737214.0, ans=0.125 2023-06-21 07:56:54,548 INFO [train.py:996] (3/4) Epoch 5, batch 900, loss[loss=0.2282, simple_loss=0.2984, pruned_loss=0.07899, over 21111.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3091, pruned_loss=0.08267, over 4229533.54 frames. ], batch size: 143, lr: 6.58e-03, grad_scale: 32.0 2023-06-21 07:57:24,616 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.09 vs. limit=22.5 2023-06-21 07:57:28,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=737334.0, ans=0.0 2023-06-21 07:58:30,952 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=737454.0, ans=0.125 2023-06-21 07:59:28,523 INFO [train.py:996] (3/4) Epoch 5, batch 950, loss[loss=0.2307, simple_loss=0.2973, pruned_loss=0.08204, over 21826.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3057, pruned_loss=0.08165, over 4247066.56 frames. ], batch size: 282, lr: 6.58e-03, grad_scale: 32.0 2023-06-21 07:59:31,412 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.019e+02 2.538e+02 2.883e+02 3.307e+02 5.189e+02, threshold=5.766e+02, percent-clipped=0.0 2023-06-21 08:00:30,060 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.15 vs. limit=22.5 2023-06-21 08:01:26,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=737754.0, ans=0.125 2023-06-21 08:02:00,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=737814.0, ans=0.0 2023-06-21 08:02:02,911 INFO [train.py:996] (3/4) Epoch 5, batch 1000, loss[loss=0.2371, simple_loss=0.3035, pruned_loss=0.08534, over 21313.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3055, pruned_loss=0.08222, over 4259276.25 frames. ], batch size: 143, lr: 6.58e-03, grad_scale: 32.0 2023-06-21 08:02:08,689 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.63 vs. limit=22.5 2023-06-21 08:03:01,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=737994.0, ans=0.125 2023-06-21 08:04:05,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=738114.0, ans=0.2 2023-06-21 08:04:24,290 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=738174.0, ans=0.125 2023-06-21 08:04:25,212 INFO [train.py:996] (3/4) Epoch 5, batch 1050, loss[loss=0.2299, simple_loss=0.3089, pruned_loss=0.07548, over 21531.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3055, pruned_loss=0.08214, over 4264969.82 frames. ], batch size: 471, lr: 6.58e-03, grad_scale: 32.0 2023-06-21 08:04:28,166 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.999e+02 2.450e+02 2.796e+02 3.213e+02 4.581e+02, threshold=5.591e+02, percent-clipped=0.0 2023-06-21 08:04:44,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=738174.0, ans=0.0 2023-06-21 08:05:23,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=738294.0, ans=0.125 2023-06-21 08:06:21,344 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=12.0 2023-06-21 08:07:09,344 INFO [train.py:996] (3/4) Epoch 5, batch 1100, loss[loss=0.2361, simple_loss=0.3106, pruned_loss=0.08082, over 21265.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3061, pruned_loss=0.08186, over 4273485.09 frames. ], batch size: 176, lr: 6.58e-03, grad_scale: 16.0 2023-06-21 08:09:11,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=738714.0, ans=0.0 2023-06-21 08:09:28,113 INFO [train.py:996] (3/4) Epoch 5, batch 1150, loss[loss=0.2242, simple_loss=0.3195, pruned_loss=0.06448, over 21785.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3074, pruned_loss=0.08168, over 4277463.04 frames. ], batch size: 351, lr: 6.57e-03, grad_scale: 16.0 2023-06-21 08:09:37,097 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 2.476e+02 2.814e+02 3.322e+02 5.569e+02, threshold=5.628e+02, percent-clipped=0.0 2023-06-21 08:10:31,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=738834.0, ans=0.0 2023-06-21 08:11:37,791 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 08:12:23,030 INFO [train.py:996] (3/4) Epoch 5, batch 1200, loss[loss=0.2169, simple_loss=0.3042, pruned_loss=0.06475, over 21765.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3101, pruned_loss=0.082, over 4284866.83 frames. ], batch size: 282, lr: 6.57e-03, grad_scale: 32.0 2023-06-21 08:13:29,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=739194.0, ans=0.1 2023-06-21 08:13:43,684 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.24 vs. limit=15.0 2023-06-21 08:14:21,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=739314.0, ans=0.2 2023-06-21 08:14:40,383 INFO [train.py:996] (3/4) Epoch 5, batch 1250, loss[loss=0.2486, simple_loss=0.323, pruned_loss=0.08712, over 21877.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3112, pruned_loss=0.08207, over 4284067.01 frames. ], batch size: 118, lr: 6.57e-03, grad_scale: 32.0 2023-06-21 08:14:43,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=739374.0, ans=0.125 2023-06-21 08:14:48,800 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.156e+02 2.637e+02 3.101e+02 3.888e+02 6.560e+02, threshold=6.202e+02, percent-clipped=3.0 2023-06-21 08:15:41,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=739434.0, ans=0.125 2023-06-21 08:15:42,865 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=739494.0, ans=0.125 2023-06-21 08:16:02,789 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=739554.0, ans=0.0 2023-06-21 08:17:07,393 INFO [train.py:996] (3/4) Epoch 5, batch 1300, loss[loss=0.2348, simple_loss=0.3033, pruned_loss=0.08314, over 21288.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3121, pruned_loss=0.08267, over 4278509.64 frames. ], batch size: 176, lr: 6.57e-03, grad_scale: 32.0 2023-06-21 08:17:13,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=739674.0, ans=0.125 2023-06-21 08:17:59,820 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=739734.0, ans=0.0 2023-06-21 08:18:08,756 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=739734.0, ans=0.125 2023-06-21 08:18:11,680 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=739794.0, ans=0.125 2023-06-21 08:18:15,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=739794.0, ans=0.125 2023-06-21 08:18:23,763 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=739794.0, ans=0.125 2023-06-21 08:19:33,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=739914.0, ans=0.07 2023-06-21 08:19:41,999 INFO [train.py:996] (3/4) Epoch 5, batch 1350, loss[loss=0.3322, simple_loss=0.3817, pruned_loss=0.1413, over 21323.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3138, pruned_loss=0.0835, over 4288827.78 frames. ], batch size: 507, lr: 6.57e-03, grad_scale: 32.0 2023-06-21 08:19:51,965 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.970e+02 2.635e+02 2.966e+02 3.709e+02 5.719e+02, threshold=5.932e+02, percent-clipped=0.0 2023-06-21 08:19:55,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=739974.0, ans=0.125 2023-06-21 08:20:11,653 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=739974.0, ans=0.125 2023-06-21 08:20:37,537 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=740094.0, ans=0.125 2023-06-21 08:20:45,720 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=740094.0, ans=0.125 2023-06-21 08:22:02,823 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.76 vs. limit=15.0 2023-06-21 08:22:06,201 INFO [train.py:996] (3/4) Epoch 5, batch 1400, loss[loss=0.2313, simple_loss=0.2974, pruned_loss=0.08258, over 21352.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3101, pruned_loss=0.08278, over 4287660.39 frames. ], batch size: 159, lr: 6.57e-03, grad_scale: 32.0 2023-06-21 08:22:18,886 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.02 vs. limit=10.0 2023-06-21 08:23:30,366 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.81 vs. limit=10.0 2023-06-21 08:24:28,423 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=740574.0, ans=0.125 2023-06-21 08:24:29,092 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.73 vs. limit=12.0 2023-06-21 08:24:29,371 INFO [train.py:996] (3/4) Epoch 5, batch 1450, loss[loss=0.2674, simple_loss=0.3408, pruned_loss=0.09702, over 21377.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.31, pruned_loss=0.08329, over 4288430.20 frames. ], batch size: 548, lr: 6.57e-03, grad_scale: 32.0 2023-06-21 08:24:40,586 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.711e+02 2.440e+02 2.892e+02 3.416e+02 5.937e+02, threshold=5.784e+02, percent-clipped=1.0 2023-06-21 08:24:41,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=740574.0, ans=0.1 2023-06-21 08:25:39,432 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=740694.0, ans=0.0 2023-06-21 08:26:55,889 INFO [train.py:996] (3/4) Epoch 5, batch 1500, loss[loss=0.2377, simple_loss=0.2995, pruned_loss=0.08794, over 21328.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.312, pruned_loss=0.08441, over 4285148.69 frames. ], batch size: 176, lr: 6.57e-03, grad_scale: 16.0 2023-06-21 08:27:32,886 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=740934.0, ans=0.5 2023-06-21 08:27:34,767 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.37 vs. limit=15.0 2023-06-21 08:28:30,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=741054.0, ans=0.1 2023-06-21 08:28:36,968 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=741054.0, ans=10.0 2023-06-21 08:28:57,669 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.27 vs. limit=6.0 2023-06-21 08:29:22,635 INFO [train.py:996] (3/4) Epoch 5, batch 1550, loss[loss=0.1408, simple_loss=0.2037, pruned_loss=0.03897, over 16917.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3103, pruned_loss=0.08401, over 4287687.70 frames. ], batch size: 61, lr: 6.56e-03, grad_scale: 16.0 2023-06-21 08:29:33,067 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=741174.0, ans=0.5 2023-06-21 08:29:34,020 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.890e+02 2.613e+02 3.171e+02 3.955e+02 6.837e+02, threshold=6.341e+02, percent-clipped=1.0 2023-06-21 08:30:04,000 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=741234.0, ans=0.5 2023-06-21 08:31:14,474 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=741354.0, ans=0.1 2023-06-21 08:32:00,146 INFO [train.py:996] (3/4) Epoch 5, batch 1600, loss[loss=0.1994, simple_loss=0.2917, pruned_loss=0.05362, over 19838.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3069, pruned_loss=0.08165, over 4273162.88 frames. ], batch size: 702, lr: 6.56e-03, grad_scale: 32.0 2023-06-21 08:32:18,179 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.22 vs. limit=15.0 2023-06-21 08:33:25,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=741594.0, ans=0.125 2023-06-21 08:33:36,139 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.61 vs. limit=15.0 2023-06-21 08:34:26,735 INFO [train.py:996] (3/4) Epoch 5, batch 1650, loss[loss=0.2666, simple_loss=0.3356, pruned_loss=0.09884, over 21627.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3053, pruned_loss=0.08068, over 4272560.64 frames. ], batch size: 389, lr: 6.56e-03, grad_scale: 16.0 2023-06-21 08:34:38,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=741774.0, ans=0.125 2023-06-21 08:34:39,415 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.900e+02 2.504e+02 2.926e+02 3.545e+02 5.904e+02, threshold=5.852e+02, percent-clipped=0.0 2023-06-21 08:35:50,399 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=741954.0, ans=0.04949747468305833 2023-06-21 08:36:34,212 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.41 vs. limit=15.0 2023-06-21 08:37:08,135 INFO [train.py:996] (3/4) Epoch 5, batch 1700, loss[loss=0.3118, simple_loss=0.3892, pruned_loss=0.1172, over 21521.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3085, pruned_loss=0.08204, over 4273810.76 frames. ], batch size: 471, lr: 6.56e-03, grad_scale: 16.0 2023-06-21 08:37:25,011 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.71 vs. limit=22.5 2023-06-21 08:38:04,491 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.51 vs. limit=15.0 2023-06-21 08:39:07,790 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.17 vs. limit=6.0 2023-06-21 08:39:25,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=742314.0, ans=0.125 2023-06-21 08:39:33,378 INFO [train.py:996] (3/4) Epoch 5, batch 1750, loss[loss=0.2052, simple_loss=0.2724, pruned_loss=0.06898, over 21646.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3097, pruned_loss=0.08183, over 4270663.26 frames. ], batch size: 263, lr: 6.56e-03, grad_scale: 16.0 2023-06-21 08:39:51,707 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.969e+02 2.601e+02 3.019e+02 3.659e+02 6.555e+02, threshold=6.038e+02, percent-clipped=1.0 2023-06-21 08:40:03,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=742374.0, ans=0.125 2023-06-21 08:41:16,162 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=742554.0, ans=0.1 2023-06-21 08:41:48,837 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.26 vs. limit=22.5 2023-06-21 08:42:02,293 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=742614.0, ans=0.2 2023-06-21 08:42:27,570 INFO [train.py:996] (3/4) Epoch 5, batch 1800, loss[loss=0.294, simple_loss=0.3764, pruned_loss=0.1057, over 21461.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3091, pruned_loss=0.08019, over 4275804.75 frames. ], batch size: 507, lr: 6.56e-03, grad_scale: 16.0 2023-06-21 08:42:42,676 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.97 vs. limit=12.0 2023-06-21 08:42:44,022 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.91 vs. limit=15.0 2023-06-21 08:42:46,192 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=742674.0, ans=0.125 2023-06-21 08:43:39,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=742794.0, ans=0.125 2023-06-21 08:44:22,365 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 08:44:52,833 INFO [train.py:996] (3/4) Epoch 5, batch 1850, loss[loss=0.2252, simple_loss=0.2987, pruned_loss=0.0758, over 16206.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3106, pruned_loss=0.07838, over 4269619.24 frames. ], batch size: 60, lr: 6.56e-03, grad_scale: 16.0 2023-06-21 08:45:21,491 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.680e+02 2.344e+02 2.714e+02 3.170e+02 5.790e+02, threshold=5.429e+02, percent-clipped=0.0 2023-06-21 08:45:33,532 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=743034.0, ans=0.0 2023-06-21 08:47:07,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=743214.0, ans=0.1 2023-06-21 08:47:36,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=743214.0, ans=0.0 2023-06-21 08:47:40,692 INFO [train.py:996] (3/4) Epoch 5, batch 1900, loss[loss=0.2239, simple_loss=0.3031, pruned_loss=0.07236, over 21630.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3107, pruned_loss=0.07863, over 4278117.78 frames. ], batch size: 263, lr: 6.56e-03, grad_scale: 16.0 2023-06-21 08:47:55,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=743334.0, ans=0.04949747468305833 2023-06-21 08:47:55,562 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=743334.0, ans=0.2 2023-06-21 08:49:32,811 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.24 vs. limit=22.5 2023-06-21 08:49:34,245 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-06-21 08:49:48,612 INFO [train.py:996] (3/4) Epoch 5, batch 1950, loss[loss=0.2244, simple_loss=0.2748, pruned_loss=0.08699, over 21521.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3066, pruned_loss=0.07809, over 4279563.12 frames. ], batch size: 441, lr: 6.55e-03, grad_scale: 16.0 2023-06-21 08:49:51,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=743574.0, ans=0.125 2023-06-21 08:49:51,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=743574.0, ans=0.1 2023-06-21 08:50:13,868 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.090e+02 2.637e+02 3.082e+02 3.738e+02 5.890e+02, threshold=6.165e+02, percent-clipped=2.0 2023-06-21 08:50:19,649 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.84 vs. limit=15.0 2023-06-21 08:51:27,539 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.98 vs. limit=15.0 2023-06-21 08:52:28,575 INFO [train.py:996] (3/4) Epoch 5, batch 2000, loss[loss=0.1856, simple_loss=0.2607, pruned_loss=0.05523, over 21586.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3008, pruned_loss=0.07663, over 4274053.65 frames. ], batch size: 230, lr: 6.55e-03, grad_scale: 32.0 2023-06-21 08:53:14,318 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=743994.0, ans=0.2 2023-06-21 08:53:15,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=743994.0, ans=0.0 2023-06-21 08:53:15,668 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=743994.0, ans=0.0 2023-06-21 08:53:24,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=743994.0, ans=0.0 2023-06-21 08:53:39,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=743994.0, ans=0.125 2023-06-21 08:54:14,766 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.61 vs. limit=22.5 2023-06-21 08:54:20,602 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.77 vs. limit=22.5 2023-06-21 08:54:42,032 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=744114.0, ans=0.2 2023-06-21 08:54:44,410 INFO [train.py:996] (3/4) Epoch 5, batch 2050, loss[loss=0.1716, simple_loss=0.2549, pruned_loss=0.04418, over 21392.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.303, pruned_loss=0.07719, over 4268241.54 frames. ], batch size: 194, lr: 6.55e-03, grad_scale: 32.0 2023-06-21 08:55:04,151 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.048e+02 2.606e+02 2.968e+02 3.488e+02 6.810e+02, threshold=5.937e+02, percent-clipped=2.0 2023-06-21 08:55:08,713 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=744174.0, ans=0.125 2023-06-21 08:56:39,846 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=744354.0, ans=0.1 2023-06-21 08:57:04,039 INFO [train.py:996] (3/4) Epoch 5, batch 2100, loss[loss=0.3041, simple_loss=0.3706, pruned_loss=0.1188, over 21737.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3073, pruned_loss=0.07991, over 4274990.73 frames. ], batch size: 441, lr: 6.55e-03, grad_scale: 32.0 2023-06-21 08:57:21,785 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=744474.0, ans=0.2 2023-06-21 08:57:51,444 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.86 vs. limit=22.5 2023-06-21 08:58:02,282 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.22 vs. limit=22.5 2023-06-21 08:58:38,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=744654.0, ans=0.125 2023-06-21 08:59:34,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=744774.0, ans=0.2 2023-06-21 08:59:35,519 INFO [train.py:996] (3/4) Epoch 5, batch 2150, loss[loss=0.2724, simple_loss=0.3196, pruned_loss=0.1126, over 21305.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3101, pruned_loss=0.0804, over 4273904.37 frames. ], batch size: 471, lr: 6.55e-03, grad_scale: 32.0 2023-06-21 09:00:06,676 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.964e+02 2.554e+02 3.162e+02 3.775e+02 6.322e+02, threshold=6.325e+02, percent-clipped=1.0 2023-06-21 09:00:12,937 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=744834.0, ans=0.1 2023-06-21 09:00:23,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=744834.0, ans=0.0 2023-06-21 09:00:28,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=744834.0, ans=0.125 2023-06-21 09:01:01,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=744894.0, ans=0.125 2023-06-21 09:01:18,697 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=744954.0, ans=0.125 2023-06-21 09:01:21,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=744954.0, ans=0.0 2023-06-21 09:01:56,627 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=745014.0, ans=0.1 2023-06-21 09:02:04,893 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.51 vs. limit=15.0 2023-06-21 09:02:08,215 INFO [train.py:996] (3/4) Epoch 5, batch 2200, loss[loss=0.299, simple_loss=0.376, pruned_loss=0.111, over 21530.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3128, pruned_loss=0.08057, over 4265598.56 frames. ], batch size: 471, lr: 6.55e-03, grad_scale: 32.0 2023-06-21 09:02:59,465 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=745134.0, ans=0.2 2023-06-21 09:03:08,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=745194.0, ans=0.1 2023-06-21 09:04:08,120 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=745254.0, ans=0.125 2023-06-21 09:04:38,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=745374.0, ans=0.02 2023-06-21 09:04:39,505 INFO [train.py:996] (3/4) Epoch 5, batch 2250, loss[loss=0.1903, simple_loss=0.2706, pruned_loss=0.05498, over 21371.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3093, pruned_loss=0.07836, over 4262408.37 frames. ], batch size: 211, lr: 6.55e-03, grad_scale: 32.0 2023-06-21 09:04:48,406 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.866e+02 2.467e+02 2.789e+02 3.250e+02 6.214e+02, threshold=5.578e+02, percent-clipped=0.0 2023-06-21 09:05:00,715 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=745374.0, ans=0.1 2023-06-21 09:05:04,735 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.13 vs. limit=10.0 2023-06-21 09:05:35,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=745434.0, ans=0.1 2023-06-21 09:05:55,274 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=745494.0, ans=0.07 2023-06-21 09:06:21,831 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=745554.0, ans=0.2 2023-06-21 09:06:49,010 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.93 vs. limit=15.0 2023-06-21 09:06:49,373 INFO [train.py:996] (3/4) Epoch 5, batch 2300, loss[loss=0.2226, simple_loss=0.2706, pruned_loss=0.08731, over 21210.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3013, pruned_loss=0.077, over 4264301.27 frames. ], batch size: 471, lr: 6.54e-03, grad_scale: 32.0 2023-06-21 09:06:52,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=745674.0, ans=0.125 2023-06-21 09:09:04,699 INFO [train.py:996] (3/4) Epoch 5, batch 2350, loss[loss=0.2331, simple_loss=0.301, pruned_loss=0.08258, over 21480.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.2985, pruned_loss=0.07803, over 4269712.36 frames. ], batch size: 389, lr: 6.54e-03, grad_scale: 32.0 2023-06-21 09:09:18,947 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.931e+02 2.648e+02 3.067e+02 3.920e+02 6.116e+02, threshold=6.134e+02, percent-clipped=2.0 2023-06-21 09:09:39,008 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=745974.0, ans=0.1 2023-06-21 09:09:41,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=746034.0, ans=0.0 2023-06-21 09:10:18,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=746094.0, ans=22.5 2023-06-21 09:10:21,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=746094.0, ans=0.125 2023-06-21 09:10:26,954 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=746094.0, ans=0.125 2023-06-21 09:11:46,923 INFO [train.py:996] (3/4) Epoch 5, batch 2400, loss[loss=0.2138, simple_loss=0.3202, pruned_loss=0.05369, over 19694.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3024, pruned_loss=0.08071, over 4269743.48 frames. ], batch size: 702, lr: 6.54e-03, grad_scale: 32.0 2023-06-21 09:12:31,150 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 09:13:48,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=746514.0, ans=0.125 2023-06-21 09:14:12,800 INFO [train.py:996] (3/4) Epoch 5, batch 2450, loss[loss=0.2399, simple_loss=0.3027, pruned_loss=0.08852, over 21578.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3095, pruned_loss=0.08365, over 4277780.08 frames. ], batch size: 441, lr: 6.54e-03, grad_scale: 32.0 2023-06-21 09:14:38,910 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.112e+02 2.774e+02 3.114e+02 3.672e+02 6.323e+02, threshold=6.229e+02, percent-clipped=1.0 2023-06-21 09:14:42,830 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.90 vs. limit=15.0 2023-06-21 09:15:08,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=746634.0, ans=0.0 2023-06-21 09:15:26,598 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=746694.0, ans=0.125 2023-06-21 09:15:28,042 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=746694.0, ans=0.1 2023-06-21 09:15:47,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=746754.0, ans=10.0 2023-06-21 09:16:24,635 INFO [train.py:996] (3/4) Epoch 5, batch 2500, loss[loss=0.23, simple_loss=0.3015, pruned_loss=0.07924, over 21861.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3095, pruned_loss=0.0836, over 4275921.09 frames. ], batch size: 107, lr: 6.54e-03, grad_scale: 16.0 2023-06-21 09:16:30,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=746874.0, ans=0.125 2023-06-21 09:16:53,850 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=746934.0, ans=0.1 2023-06-21 09:17:21,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=746994.0, ans=0.5 2023-06-21 09:17:40,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=747054.0, ans=0.2 2023-06-21 09:18:07,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=747114.0, ans=0.2 2023-06-21 09:18:42,276 INFO [train.py:996] (3/4) Epoch 5, batch 2550, loss[loss=0.2162, simple_loss=0.3141, pruned_loss=0.05915, over 21651.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3065, pruned_loss=0.08237, over 4279315.33 frames. ], batch size: 298, lr: 6.54e-03, grad_scale: 16.0 2023-06-21 09:18:50,716 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.885e+02 2.510e+02 2.866e+02 3.285e+02 4.415e+02, threshold=5.731e+02, percent-clipped=0.0 2023-06-21 09:19:24,510 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=747234.0, ans=0.2 2023-06-21 09:19:55,235 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.41 vs. limit=15.0 2023-06-21 09:20:05,420 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=747354.0, ans=0.125 2023-06-21 09:20:59,461 INFO [train.py:996] (3/4) Epoch 5, batch 2600, loss[loss=0.2561, simple_loss=0.3277, pruned_loss=0.09224, over 21893.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3064, pruned_loss=0.08325, over 4280992.34 frames. ], batch size: 372, lr: 6.54e-03, grad_scale: 16.0 2023-06-21 09:21:17,586 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=747474.0, ans=0.125 2023-06-21 09:21:54,302 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=747534.0, ans=0.125 2023-06-21 09:22:09,012 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=747594.0, ans=0.125 2023-06-21 09:23:26,451 INFO [train.py:996] (3/4) Epoch 5, batch 2650, loss[loss=0.2445, simple_loss=0.3247, pruned_loss=0.08216, over 21075.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3099, pruned_loss=0.08478, over 4277579.34 frames. ], batch size: 607, lr: 6.54e-03, grad_scale: 16.0 2023-06-21 09:23:35,114 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.884e+02 2.828e+02 3.188e+02 4.094e+02 7.867e+02, threshold=6.375e+02, percent-clipped=3.0 2023-06-21 09:24:48,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=747894.0, ans=0.0 2023-06-21 09:24:53,837 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.70 vs. limit=15.0 2023-06-21 09:25:28,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=748014.0, ans=0.1 2023-06-21 09:25:52,126 INFO [train.py:996] (3/4) Epoch 5, batch 2700, loss[loss=0.2033, simple_loss=0.2741, pruned_loss=0.06623, over 21753.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3089, pruned_loss=0.0845, over 4278320.01 frames. ], batch size: 282, lr: 6.53e-03, grad_scale: 16.0 2023-06-21 09:26:36,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=748134.0, ans=0.125 2023-06-21 09:26:59,057 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=748194.0, ans=0.0 2023-06-21 09:27:11,324 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=748194.0, ans=0.125 2023-06-21 09:28:06,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=748314.0, ans=0.125 2023-06-21 09:28:18,191 INFO [train.py:996] (3/4) Epoch 5, batch 2750, loss[loss=0.2122, simple_loss=0.2777, pruned_loss=0.07332, over 21575.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3068, pruned_loss=0.08357, over 4284814.35 frames. ], batch size: 263, lr: 6.53e-03, grad_scale: 16.0 2023-06-21 09:28:33,250 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.906e+02 2.495e+02 2.844e+02 3.275e+02 5.915e+02, threshold=5.688e+02, percent-clipped=0.0 2023-06-21 09:29:40,951 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 09:30:15,286 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=748614.0, ans=0.1 2023-06-21 09:31:01,730 INFO [train.py:996] (3/4) Epoch 5, batch 2800, loss[loss=0.2593, simple_loss=0.325, pruned_loss=0.09674, over 21312.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3124, pruned_loss=0.08505, over 4284666.57 frames. ], batch size: 176, lr: 6.53e-03, grad_scale: 32.0 2023-06-21 09:31:05,097 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=748674.0, ans=0.0 2023-06-21 09:32:28,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=748854.0, ans=0.0 2023-06-21 09:33:40,819 INFO [train.py:996] (3/4) Epoch 5, batch 2850, loss[loss=0.298, simple_loss=0.3565, pruned_loss=0.1197, over 21391.00 frames. ], tot_loss[loss=0.243, simple_loss=0.314, pruned_loss=0.08598, over 4280258.69 frames. ], batch size: 549, lr: 6.53e-03, grad_scale: 32.0 2023-06-21 09:33:50,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=748974.0, ans=0.04949747468305833 2023-06-21 09:34:00,811 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.211e+02 2.971e+02 3.664e+02 4.232e+02 8.442e+02, threshold=7.329e+02, percent-clipped=6.0 2023-06-21 09:34:59,115 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.01 vs. limit=15.0 2023-06-21 09:35:31,902 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=749214.0, ans=0.125 2023-06-21 09:36:07,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=749214.0, ans=0.1 2023-06-21 09:36:12,164 INFO [train.py:996] (3/4) Epoch 5, batch 2900, loss[loss=0.2505, simple_loss=0.3076, pruned_loss=0.09675, over 21897.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3088, pruned_loss=0.08403, over 4274832.35 frames. ], batch size: 414, lr: 6.53e-03, grad_scale: 32.0 2023-06-21 09:36:25,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=749274.0, ans=0.0 2023-06-21 09:36:32,647 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=749334.0, ans=0.1 2023-06-21 09:37:08,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=749394.0, ans=0.0 2023-06-21 09:37:10,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=749394.0, ans=10.0 2023-06-21 09:37:15,031 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.17 vs. limit=15.0 2023-06-21 09:37:45,082 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.83 vs. limit=15.0 2023-06-21 09:37:46,234 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=749454.0, ans=0.125 2023-06-21 09:38:29,367 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=749514.0, ans=0.125 2023-06-21 09:38:29,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=749514.0, ans=0.125 2023-06-21 09:38:34,130 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.87 vs. limit=15.0 2023-06-21 09:38:34,557 INFO [train.py:996] (3/4) Epoch 5, batch 2950, loss[loss=0.2309, simple_loss=0.3164, pruned_loss=0.07272, over 21653.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3087, pruned_loss=0.08314, over 4283163.75 frames. ], batch size: 263, lr: 6.53e-03, grad_scale: 32.0 2023-06-21 09:38:48,673 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.079e+02 2.603e+02 2.918e+02 3.396e+02 5.731e+02, threshold=5.836e+02, percent-clipped=0.0 2023-06-21 09:39:06,955 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=749634.0, ans=0.125 2023-06-21 09:39:36,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=749694.0, ans=0.125 2023-06-21 09:41:10,940 INFO [train.py:996] (3/4) Epoch 5, batch 3000, loss[loss=0.2388, simple_loss=0.3015, pruned_loss=0.08808, over 21630.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3137, pruned_loss=0.08419, over 4284222.81 frames. ], batch size: 263, lr: 6.53e-03, grad_scale: 32.0 2023-06-21 09:41:10,940 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-21 09:42:09,553 INFO [train.py:1028] (3/4) Epoch 5, validation: loss=0.2543, simple_loss=0.346, pruned_loss=0.08133, over 1796401.00 frames. 2023-06-21 09:42:09,560 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-21 09:42:10,182 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=749874.0, ans=0.125 2023-06-21 09:42:10,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=749874.0, ans=0.025 2023-06-21 09:43:20,138 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=749994.0, ans=0.5 2023-06-21 09:43:20,617 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.33 vs. limit=15.0 2023-06-21 09:43:27,852 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.97 vs. limit=15.0 2023-06-21 09:43:30,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=750054.0, ans=0.0 2023-06-21 09:43:39,391 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.19 vs. limit=8.0 2023-06-21 09:44:23,329 INFO [train.py:996] (3/4) Epoch 5, batch 3050, loss[loss=0.2426, simple_loss=0.3267, pruned_loss=0.07922, over 21462.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3142, pruned_loss=0.08253, over 4291378.68 frames. ], batch size: 548, lr: 6.52e-03, grad_scale: 32.0 2023-06-21 09:44:24,404 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.97 vs. limit=15.0 2023-06-21 09:44:43,392 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.887e+02 2.527e+02 2.843e+02 3.371e+02 5.319e+02, threshold=5.686e+02, percent-clipped=0.0 2023-06-21 09:44:45,642 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=750174.0, ans=0.125 2023-06-21 09:45:00,472 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.82 vs. limit=15.0 2023-06-21 09:46:19,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=750354.0, ans=0.0 2023-06-21 09:46:41,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=750414.0, ans=0.125 2023-06-21 09:46:43,691 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=12.0 2023-06-21 09:46:44,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=750414.0, ans=0.0 2023-06-21 09:46:48,370 INFO [train.py:996] (3/4) Epoch 5, batch 3100, loss[loss=0.2093, simple_loss=0.2842, pruned_loss=0.06721, over 21209.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3143, pruned_loss=0.08183, over 4293165.40 frames. ], batch size: 159, lr: 6.52e-03, grad_scale: 32.0 2023-06-21 09:47:06,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=750474.0, ans=0.1 2023-06-21 09:47:43,257 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=750534.0, ans=0.125 2023-06-21 09:47:50,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=750594.0, ans=0.0 2023-06-21 09:49:00,372 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.25 vs. limit=6.0 2023-06-21 09:49:01,197 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=750714.0, ans=0.125 2023-06-21 09:49:08,177 INFO [train.py:996] (3/4) Epoch 5, batch 3150, loss[loss=0.3279, simple_loss=0.3801, pruned_loss=0.1378, over 21371.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3167, pruned_loss=0.08302, over 4290336.22 frames. ], batch size: 159, lr: 6.52e-03, grad_scale: 32.0 2023-06-21 09:49:10,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=750774.0, ans=0.2 2023-06-21 09:49:24,639 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.47 vs. limit=15.0 2023-06-21 09:49:24,713 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-06-21 09:49:26,478 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.529e+02 2.952e+02 3.587e+02 6.103e+02, threshold=5.905e+02, percent-clipped=1.0 2023-06-21 09:50:53,846 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=750954.0, ans=0.1 2023-06-21 09:51:31,238 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=751014.0, ans=0.1 2023-06-21 09:51:31,823 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.06 vs. limit=15.0 2023-06-21 09:51:33,930 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=751014.0, ans=0.125 2023-06-21 09:51:56,057 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=751014.0, ans=0.1 2023-06-21 09:51:58,573 INFO [train.py:996] (3/4) Epoch 5, batch 3200, loss[loss=0.2952, simple_loss=0.3699, pruned_loss=0.1103, over 21444.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3193, pruned_loss=0.0842, over 4287597.71 frames. ], batch size: 507, lr: 6.52e-03, grad_scale: 32.0 2023-06-21 09:52:18,088 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=751134.0, ans=0.2 2023-06-21 09:52:29,390 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.96 vs. limit=12.0 2023-06-21 09:52:40,438 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=751134.0, ans=0.0 2023-06-21 09:52:41,826 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=751134.0, ans=0.04949747468305833 2023-06-21 09:54:06,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=751314.0, ans=6.0 2023-06-21 09:54:09,983 INFO [train.py:996] (3/4) Epoch 5, batch 3250, loss[loss=0.2165, simple_loss=0.3097, pruned_loss=0.06163, over 21682.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3204, pruned_loss=0.08619, over 4285622.69 frames. ], batch size: 263, lr: 6.52e-03, grad_scale: 16.0 2023-06-21 09:54:25,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=751374.0, ans=0.125 2023-06-21 09:54:25,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=751374.0, ans=0.0 2023-06-21 09:54:34,746 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.832e+02 2.744e+02 3.235e+02 3.683e+02 5.247e+02, threshold=6.470e+02, percent-clipped=0.0 2023-06-21 09:54:35,400 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=751374.0, ans=0.125 2023-06-21 09:55:30,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=751494.0, ans=0.2 2023-06-21 09:55:48,809 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.37 vs. limit=15.0 2023-06-21 09:56:54,138 INFO [train.py:996] (3/4) Epoch 5, batch 3300, loss[loss=0.1898, simple_loss=0.251, pruned_loss=0.06436, over 15324.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3152, pruned_loss=0.08435, over 4278907.53 frames. ], batch size: 61, lr: 6.52e-03, grad_scale: 16.0 2023-06-21 09:57:18,815 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=751734.0, ans=0.0 2023-06-21 09:59:09,224 INFO [train.py:996] (3/4) Epoch 5, batch 3350, loss[loss=0.2657, simple_loss=0.3297, pruned_loss=0.1008, over 21391.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3173, pruned_loss=0.08587, over 4277089.63 frames. ], batch size: 159, lr: 6.52e-03, grad_scale: 16.0 2023-06-21 09:59:38,680 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.996e+02 2.776e+02 3.180e+02 3.722e+02 8.013e+02, threshold=6.359e+02, percent-clipped=4.0 2023-06-21 09:59:39,945 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=15.0 2023-06-21 10:00:16,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=752094.0, ans=0.0 2023-06-21 10:00:37,076 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.45 vs. limit=15.0 2023-06-21 10:01:29,147 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=752214.0, ans=15.0 2023-06-21 10:01:43,802 INFO [train.py:996] (3/4) Epoch 5, batch 3400, loss[loss=0.2126, simple_loss=0.2751, pruned_loss=0.07501, over 21634.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3174, pruned_loss=0.08606, over 4278857.27 frames. ], batch size: 264, lr: 6.52e-03, grad_scale: 16.0 2023-06-21 10:01:46,208 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.98 vs. limit=10.0 2023-06-21 10:02:05,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=752334.0, ans=0.2 2023-06-21 10:02:31,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=752334.0, ans=0.125 2023-06-21 10:02:37,623 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.10 vs. limit=22.5 2023-06-21 10:04:00,551 INFO [train.py:996] (3/4) Epoch 5, batch 3450, loss[loss=0.2199, simple_loss=0.279, pruned_loss=0.08041, over 21732.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3119, pruned_loss=0.08438, over 4263913.64 frames. ], batch size: 316, lr: 6.51e-03, grad_scale: 16.0 2023-06-21 10:04:04,343 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=752574.0, ans=0.2 2023-06-21 10:04:11,206 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.968e+02 2.675e+02 2.919e+02 3.509e+02 4.747e+02, threshold=5.839e+02, percent-clipped=0.0 2023-06-21 10:04:27,691 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=752634.0, ans=0.125 2023-06-21 10:06:26,800 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.48 vs. limit=12.0 2023-06-21 10:06:27,098 INFO [train.py:996] (3/4) Epoch 5, batch 3500, loss[loss=0.2713, simple_loss=0.3425, pruned_loss=0.1001, over 21436.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.319, pruned_loss=0.08701, over 4265444.88 frames. ], batch size: 548, lr: 6.51e-03, grad_scale: 16.0 2023-06-21 10:06:41,674 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=752874.0, ans=0.04949747468305833 2023-06-21 10:07:00,354 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-21 10:07:50,498 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.78 vs. limit=10.0 2023-06-21 10:08:26,414 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=753054.0, ans=0.125 2023-06-21 10:08:50,563 INFO [train.py:996] (3/4) Epoch 5, batch 3550, loss[loss=0.1995, simple_loss=0.2727, pruned_loss=0.06316, over 21652.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3214, pruned_loss=0.08793, over 4270420.74 frames. ], batch size: 282, lr: 6.51e-03, grad_scale: 16.0 2023-06-21 10:08:58,165 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=753174.0, ans=0.125 2023-06-21 10:08:58,172 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=753174.0, ans=0.125 2023-06-21 10:09:06,196 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.097e+02 2.617e+02 3.171e+02 3.907e+02 6.956e+02, threshold=6.342e+02, percent-clipped=4.0 2023-06-21 10:09:08,691 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.71 vs. limit=6.0 2023-06-21 10:09:11,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=753234.0, ans=0.125 2023-06-21 10:09:59,253 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=753354.0, ans=0.125 2023-06-21 10:10:13,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=753354.0, ans=0.0 2023-06-21 10:10:36,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=753414.0, ans=0.125 2023-06-21 10:10:52,560 INFO [train.py:996] (3/4) Epoch 5, batch 3600, loss[loss=0.2476, simple_loss=0.3177, pruned_loss=0.08874, over 20644.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3155, pruned_loss=0.08697, over 4271646.21 frames. ], batch size: 607, lr: 6.51e-03, grad_scale: 32.0 2023-06-21 10:11:09,980 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.97 vs. limit=6.0 2023-06-21 10:11:19,180 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.05 vs. limit=22.5 2023-06-21 10:12:38,916 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-21 10:12:59,209 INFO [train.py:996] (3/4) Epoch 5, batch 3650, loss[loss=0.2416, simple_loss=0.3127, pruned_loss=0.08531, over 21602.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3156, pruned_loss=0.08724, over 4268453.71 frames. ], batch size: 263, lr: 6.51e-03, grad_scale: 32.0 2023-06-21 10:12:59,580 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=753774.0, ans=0.125 2023-06-21 10:13:10,053 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.861e+02 2.602e+02 2.931e+02 3.344e+02 6.459e+02, threshold=5.862e+02, percent-clipped=1.0 2023-06-21 10:13:47,750 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=753834.0, ans=0.0 2023-06-21 10:14:19,469 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=753894.0, ans=0.2 2023-06-21 10:14:27,589 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.59 vs. limit=15.0 2023-06-21 10:14:53,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=754014.0, ans=0.025 2023-06-21 10:15:32,322 INFO [train.py:996] (3/4) Epoch 5, batch 3700, loss[loss=0.2631, simple_loss=0.3366, pruned_loss=0.09478, over 21838.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3154, pruned_loss=0.08609, over 4269997.95 frames. ], batch size: 414, lr: 6.51e-03, grad_scale: 32.0 2023-06-21 10:16:11,417 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=754134.0, ans=0.1 2023-06-21 10:16:18,752 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=754194.0, ans=0.1 2023-06-21 10:16:32,744 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=754194.0, ans=0.0 2023-06-21 10:16:55,601 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.92 vs. limit=15.0 2023-06-21 10:17:17,085 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.79 vs. limit=15.0 2023-06-21 10:17:18,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=754314.0, ans=0.2 2023-06-21 10:17:26,623 INFO [train.py:996] (3/4) Epoch 5, batch 3750, loss[loss=0.2905, simple_loss=0.3351, pruned_loss=0.123, over 21742.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3142, pruned_loss=0.08597, over 4275888.59 frames. ], batch size: 508, lr: 6.51e-03, grad_scale: 32.0 2023-06-21 10:17:31,422 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=754374.0, ans=0.0 2023-06-21 10:17:37,726 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.880e+02 2.410e+02 2.787e+02 3.137e+02 4.786e+02, threshold=5.574e+02, percent-clipped=0.0 2023-06-21 10:18:41,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=754494.0, ans=0.125 2023-06-21 10:19:05,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=754554.0, ans=0.125 2023-06-21 10:20:02,553 INFO [train.py:996] (3/4) Epoch 5, batch 3800, loss[loss=0.2633, simple_loss=0.3252, pruned_loss=0.1007, over 21649.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3121, pruned_loss=0.08411, over 4277278.77 frames. ], batch size: 351, lr: 6.51e-03, grad_scale: 32.0 2023-06-21 10:20:21,965 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 10:21:08,014 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=10.09 vs. limit=15.0 2023-06-21 10:21:30,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=754854.0, ans=0.07 2023-06-21 10:21:36,427 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=754854.0, ans=0.125 2023-06-21 10:21:55,059 INFO [train.py:996] (3/4) Epoch 5, batch 3850, loss[loss=0.2175, simple_loss=0.2793, pruned_loss=0.07787, over 15348.00 frames. ], tot_loss[loss=0.24, simple_loss=0.31, pruned_loss=0.08498, over 4273208.78 frames. ], batch size: 61, lr: 6.50e-03, grad_scale: 32.0 2023-06-21 10:22:22,378 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.793e+02 2.524e+02 3.055e+02 3.931e+02 8.028e+02, threshold=6.111e+02, percent-clipped=3.0 2023-06-21 10:22:28,706 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=755034.0, ans=0.125 2023-06-21 10:22:48,448 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=755094.0, ans=0.125 2023-06-21 10:23:08,647 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=755094.0, ans=0.0 2023-06-21 10:23:35,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=755154.0, ans=0.05 2023-06-21 10:23:36,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=755154.0, ans=0.2 2023-06-21 10:23:40,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=755154.0, ans=0.2 2023-06-21 10:23:55,603 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.25 vs. limit=22.5 2023-06-21 10:24:04,951 INFO [train.py:996] (3/4) Epoch 5, batch 3900, loss[loss=0.2501, simple_loss=0.3243, pruned_loss=0.08796, over 16842.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3058, pruned_loss=0.0843, over 4270452.63 frames. ], batch size: 60, lr: 6.50e-03, grad_scale: 32.0 2023-06-21 10:24:15,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=755274.0, ans=0.02 2023-06-21 10:25:02,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=755334.0, ans=0.0 2023-06-21 10:26:27,939 INFO [train.py:996] (3/4) Epoch 5, batch 3950, loss[loss=0.1722, simple_loss=0.2517, pruned_loss=0.04635, over 21615.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3061, pruned_loss=0.08198, over 4278171.23 frames. ], batch size: 230, lr: 6.50e-03, grad_scale: 32.0 2023-06-21 10:26:28,904 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.19 vs. limit=15.0 2023-06-21 10:26:46,966 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.953e+02 2.463e+02 2.789e+02 3.515e+02 5.351e+02, threshold=5.577e+02, percent-clipped=0.0 2023-06-21 10:27:00,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=755634.0, ans=0.0 2023-06-21 10:27:05,736 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=755634.0, ans=0.0 2023-06-21 10:27:38,771 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=755694.0, ans=22.5 2023-06-21 10:28:36,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=755814.0, ans=0.0 2023-06-21 10:28:48,518 INFO [train.py:996] (3/4) Epoch 5, batch 4000, loss[loss=0.2183, simple_loss=0.2768, pruned_loss=0.07986, over 21533.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.299, pruned_loss=0.07887, over 4270061.45 frames. ], batch size: 414, lr: 6.50e-03, grad_scale: 32.0 2023-06-21 10:28:53,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=755874.0, ans=0.125 2023-06-21 10:28:56,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=755874.0, ans=0.125 2023-06-21 10:29:51,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=755994.0, ans=0.125 2023-06-21 10:29:55,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=755994.0, ans=0.1 2023-06-21 10:29:55,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=755994.0, ans=0.125 2023-06-21 10:30:33,083 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=756054.0, ans=0.1 2023-06-21 10:30:34,709 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=756054.0, ans=0.125 2023-06-21 10:30:54,375 INFO [train.py:996] (3/4) Epoch 5, batch 4050, loss[loss=0.22, simple_loss=0.2904, pruned_loss=0.07478, over 21275.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.2994, pruned_loss=0.07756, over 4271372.87 frames. ], batch size: 159, lr: 6.50e-03, grad_scale: 32.0 2023-06-21 10:31:20,398 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.698e+02 2.419e+02 2.822e+02 3.375e+02 5.095e+02, threshold=5.643e+02, percent-clipped=0.0 2023-06-21 10:32:25,996 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-06-21 10:32:48,980 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=756414.0, ans=0.125 2023-06-21 10:33:22,488 INFO [train.py:996] (3/4) Epoch 5, batch 4100, loss[loss=0.2649, simple_loss=0.3336, pruned_loss=0.09809, over 20703.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3009, pruned_loss=0.07812, over 4272809.53 frames. ], batch size: 607, lr: 6.50e-03, grad_scale: 16.0 2023-06-21 10:33:26,059 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=756474.0, ans=0.125 2023-06-21 10:33:37,179 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 10:34:05,249 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=756534.0, ans=0.125 2023-06-21 10:34:21,064 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.72 vs. limit=15.0 2023-06-21 10:34:24,447 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.30 vs. limit=10.0 2023-06-21 10:34:50,853 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.79 vs. limit=10.0 2023-06-21 10:34:57,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=756714.0, ans=0.0 2023-06-21 10:35:00,539 INFO [train.py:996] (3/4) Epoch 5, batch 4150, loss[loss=0.1812, simple_loss=0.2716, pruned_loss=0.04545, over 21503.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3024, pruned_loss=0.07556, over 4273748.27 frames. ], batch size: 230, lr: 6.50e-03, grad_scale: 16.0 2023-06-21 10:35:06,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=756774.0, ans=0.0 2023-06-21 10:35:12,283 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.490e+02 2.501e+02 2.940e+02 3.469e+02 5.994e+02, threshold=5.880e+02, percent-clipped=1.0 2023-06-21 10:35:26,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=756834.0, ans=0.0 2023-06-21 10:35:39,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=756834.0, ans=0.0 2023-06-21 10:36:06,523 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=756954.0, ans=0.125 2023-06-21 10:36:29,331 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=757014.0, ans=0.125 2023-06-21 10:36:46,358 INFO [train.py:996] (3/4) Epoch 5, batch 4200, loss[loss=0.2003, simple_loss=0.2783, pruned_loss=0.06112, over 21524.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.302, pruned_loss=0.07541, over 4266534.52 frames. ], batch size: 230, lr: 6.50e-03, grad_scale: 16.0 2023-06-21 10:37:12,508 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.08 vs. limit=10.0 2023-06-21 10:37:41,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=757134.0, ans=0.125 2023-06-21 10:37:42,401 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=12.0 2023-06-21 10:38:38,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=757314.0, ans=0.0 2023-06-21 10:38:56,400 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.17 vs. limit=6.0 2023-06-21 10:38:58,442 INFO [train.py:996] (3/4) Epoch 5, batch 4250, loss[loss=0.265, simple_loss=0.3467, pruned_loss=0.09166, over 19960.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3087, pruned_loss=0.0783, over 4270817.06 frames. ], batch size: 702, lr: 6.49e-03, grad_scale: 16.0 2023-06-21 10:39:04,612 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=757374.0, ans=0.07 2023-06-21 10:39:19,024 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.928e+02 2.640e+02 3.180e+02 4.167e+02 9.459e+02, threshold=6.360e+02, percent-clipped=16.0 2023-06-21 10:39:21,590 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.52 vs. limit=15.0 2023-06-21 10:39:28,388 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=757434.0, ans=0.0 2023-06-21 10:40:59,201 INFO [train.py:996] (3/4) Epoch 5, batch 4300, loss[loss=0.2005, simple_loss=0.2815, pruned_loss=0.05974, over 21289.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3139, pruned_loss=0.07961, over 4276712.62 frames. ], batch size: 176, lr: 6.49e-03, grad_scale: 16.0 2023-06-21 10:42:07,917 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=757794.0, ans=0.125 2023-06-21 10:43:06,963 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.11 vs. limit=15.0 2023-06-21 10:43:29,606 INFO [train.py:996] (3/4) Epoch 5, batch 4350, loss[loss=0.288, simple_loss=0.351, pruned_loss=0.1125, over 21365.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.312, pruned_loss=0.07916, over 4282278.89 frames. ], batch size: 507, lr: 6.49e-03, grad_scale: 16.0 2023-06-21 10:43:41,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=757974.0, ans=0.04949747468305833 2023-06-21 10:43:46,273 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.166e+02 2.644e+02 3.174e+02 3.856e+02 7.919e+02, threshold=6.347e+02, percent-clipped=3.0 2023-06-21 10:44:44,277 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.66 vs. limit=15.0 2023-06-21 10:44:51,478 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=758154.0, ans=0.125 2023-06-21 10:44:58,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=758214.0, ans=0.125 2023-06-21 10:45:00,851 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.36 vs. limit=22.5 2023-06-21 10:45:06,634 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=22.5 2023-06-21 10:45:24,074 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 10:45:25,084 INFO [train.py:996] (3/4) Epoch 5, batch 4400, loss[loss=0.2257, simple_loss=0.3126, pruned_loss=0.06945, over 21606.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.308, pruned_loss=0.07825, over 4273892.61 frames. ], batch size: 263, lr: 6.49e-03, grad_scale: 32.0 2023-06-21 10:46:35,278 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.47 vs. limit=22.5 2023-06-21 10:46:48,667 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=758454.0, ans=0.1 2023-06-21 10:47:50,937 INFO [train.py:996] (3/4) Epoch 5, batch 4450, loss[loss=0.2229, simple_loss=0.2973, pruned_loss=0.07422, over 21881.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3168, pruned_loss=0.0806, over 4280248.37 frames. ], batch size: 107, lr: 6.49e-03, grad_scale: 32.0 2023-06-21 10:48:08,169 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.953e+02 2.525e+02 2.989e+02 3.664e+02 5.986e+02, threshold=5.979e+02, percent-clipped=0.0 2023-06-21 10:49:04,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=758754.0, ans=0.05 2023-06-21 10:49:10,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=758754.0, ans=0.1 2023-06-21 10:49:43,438 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=758814.0, ans=0.0 2023-06-21 10:49:47,733 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=758814.0, ans=0.0 2023-06-21 10:50:13,879 INFO [train.py:996] (3/4) Epoch 5, batch 4500, loss[loss=0.2609, simple_loss=0.3632, pruned_loss=0.07934, over 21263.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3205, pruned_loss=0.08345, over 4288802.19 frames. ], batch size: 548, lr: 6.49e-03, grad_scale: 32.0 2023-06-21 10:50:16,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=758874.0, ans=0.125 2023-06-21 10:51:22,168 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.55 vs. limit=12.0 2023-06-21 10:51:24,522 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=758994.0, ans=0.0 2023-06-21 10:51:58,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=759054.0, ans=0.1 2023-06-21 10:52:36,235 INFO [train.py:996] (3/4) Epoch 5, batch 4550, loss[loss=0.3444, simple_loss=0.4611, pruned_loss=0.1139, over 19766.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3257, pruned_loss=0.08447, over 4288749.42 frames. ], batch size: 702, lr: 6.49e-03, grad_scale: 32.0 2023-06-21 10:52:36,857 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=759174.0, ans=0.125 2023-06-21 10:53:00,221 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.936e+02 2.644e+02 2.944e+02 3.521e+02 6.236e+02, threshold=5.889e+02, percent-clipped=2.0 2023-06-21 10:54:26,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=759414.0, ans=0.035 2023-06-21 10:54:29,191 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=12.0 2023-06-21 10:54:54,674 INFO [train.py:996] (3/4) Epoch 5, batch 4600, loss[loss=0.2171, simple_loss=0.2924, pruned_loss=0.07087, over 21274.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3265, pruned_loss=0.08496, over 4285906.08 frames. ], batch size: 176, lr: 6.49e-03, grad_scale: 16.0 2023-06-21 10:55:05,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=759474.0, ans=0.125 2023-06-21 10:56:03,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=759594.0, ans=0.125 2023-06-21 10:56:05,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=759654.0, ans=0.125 2023-06-21 10:56:21,671 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.52 vs. limit=15.0 2023-06-21 10:56:31,588 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.19 vs. limit=15.0 2023-06-21 10:57:01,247 INFO [train.py:996] (3/4) Epoch 5, batch 4650, loss[loss=0.2529, simple_loss=0.3139, pruned_loss=0.096, over 21782.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3206, pruned_loss=0.08378, over 4287120.13 frames. ], batch size: 441, lr: 6.48e-03, grad_scale: 16.0 2023-06-21 10:57:04,939 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.64 vs. limit=15.0 2023-06-21 10:57:17,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=759774.0, ans=0.125 2023-06-21 10:57:33,515 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.740e+02 2.446e+02 2.893e+02 3.577e+02 6.132e+02, threshold=5.786e+02, percent-clipped=2.0 2023-06-21 10:57:50,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=759894.0, ans=0.2 2023-06-21 10:58:06,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=759894.0, ans=0.125 2023-06-21 10:58:14,853 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=759954.0, ans=0.0 2023-06-21 10:59:26,248 INFO [train.py:996] (3/4) Epoch 5, batch 4700, loss[loss=0.1954, simple_loss=0.2585, pruned_loss=0.06617, over 21717.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.31, pruned_loss=0.08092, over 4279233.98 frames. ], batch size: 300, lr: 6.48e-03, grad_scale: 16.0 2023-06-21 10:59:44,930 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=760074.0, ans=0.125 2023-06-21 11:00:01,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=760134.0, ans=0.125 2023-06-21 11:00:52,880 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=760254.0, ans=0.035 2023-06-21 11:01:11,859 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=760314.0, ans=0.125 2023-06-21 11:01:43,752 INFO [train.py:996] (3/4) Epoch 5, batch 4750, loss[loss=0.2293, simple_loss=0.3043, pruned_loss=0.07716, over 21831.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3048, pruned_loss=0.08123, over 4278691.40 frames. ], batch size: 112, lr: 6.48e-03, grad_scale: 16.0 2023-06-21 11:02:00,445 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=760374.0, ans=0.1 2023-06-21 11:02:04,345 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.910e+02 2.535e+02 2.851e+02 3.318e+02 5.705e+02, threshold=5.702e+02, percent-clipped=0.0 2023-06-21 11:02:52,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=760554.0, ans=0.125 2023-06-21 11:03:45,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=760674.0, ans=0.5 2023-06-21 11:03:55,418 INFO [train.py:996] (3/4) Epoch 5, batch 4800, loss[loss=0.218, simple_loss=0.3164, pruned_loss=0.05985, over 21769.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3065, pruned_loss=0.08143, over 4281225.65 frames. ], batch size: 298, lr: 6.48e-03, grad_scale: 32.0 2023-06-21 11:04:16,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=760674.0, ans=0.07 2023-06-21 11:04:19,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=760734.0, ans=0.2 2023-06-21 11:06:05,295 INFO [train.py:996] (3/4) Epoch 5, batch 4850, loss[loss=0.2455, simple_loss=0.3338, pruned_loss=0.07856, over 21712.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3051, pruned_loss=0.08082, over 4281318.71 frames. ], batch size: 441, lr: 6.48e-03, grad_scale: 32.0 2023-06-21 11:06:26,493 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.999e+02 2.507e+02 2.882e+02 3.548e+02 6.033e+02, threshold=5.763e+02, percent-clipped=2.0 2023-06-21 11:07:10,779 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 11:07:14,116 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=761154.0, ans=0.07 2023-06-21 11:08:20,822 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=761214.0, ans=0.1 2023-06-21 11:08:30,021 INFO [train.py:996] (3/4) Epoch 5, batch 4900, loss[loss=0.2425, simple_loss=0.3343, pruned_loss=0.07539, over 21295.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3055, pruned_loss=0.08156, over 4276176.02 frames. ], batch size: 176, lr: 6.48e-03, grad_scale: 32.0 2023-06-21 11:09:10,634 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.18 vs. limit=10.0 2023-06-21 11:10:38,830 INFO [train.py:996] (3/4) Epoch 5, batch 4950, loss[loss=0.1822, simple_loss=0.2648, pruned_loss=0.04979, over 21217.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3093, pruned_loss=0.07989, over 4275285.58 frames. ], batch size: 176, lr: 6.48e-03, grad_scale: 32.0 2023-06-21 11:10:51,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=761574.0, ans=0.125 2023-06-21 11:10:52,202 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.871e+02 2.398e+02 2.770e+02 3.056e+02 4.888e+02, threshold=5.540e+02, percent-clipped=0.0 2023-06-21 11:11:10,073 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.74 vs. limit=15.0 2023-06-21 11:12:20,052 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=761814.0, ans=0.125 2023-06-21 11:12:40,764 INFO [train.py:996] (3/4) Epoch 5, batch 5000, loss[loss=0.2239, simple_loss=0.3047, pruned_loss=0.07155, over 21801.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3079, pruned_loss=0.07712, over 4270649.68 frames. ], batch size: 282, lr: 6.47e-03, grad_scale: 32.0 2023-06-21 11:12:41,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=761874.0, ans=0.125 2023-06-21 11:13:01,173 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=761934.0, ans=0.025 2023-06-21 11:13:14,585 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=761934.0, ans=0.0 2023-06-21 11:13:56,043 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 11:14:26,318 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=762114.0, ans=0.125 2023-06-21 11:14:46,859 INFO [train.py:996] (3/4) Epoch 5, batch 5050, loss[loss=0.2582, simple_loss=0.3255, pruned_loss=0.09541, over 21925.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3087, pruned_loss=0.07925, over 4278229.65 frames. ], batch size: 107, lr: 6.47e-03, grad_scale: 32.0 2023-06-21 11:15:06,914 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.856e+02 2.567e+02 3.027e+02 3.438e+02 5.567e+02, threshold=6.054e+02, percent-clipped=1.0 2023-06-21 11:16:58,408 INFO [train.py:996] (3/4) Epoch 5, batch 5100, loss[loss=0.1971, simple_loss=0.2742, pruned_loss=0.06003, over 21328.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3078, pruned_loss=0.07915, over 4281424.83 frames. ], batch size: 176, lr: 6.47e-03, grad_scale: 32.0 2023-06-21 11:17:13,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=762474.0, ans=0.125 2023-06-21 11:17:16,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=762474.0, ans=0.125 2023-06-21 11:17:50,221 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=762594.0, ans=0.125 2023-06-21 11:18:51,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=762714.0, ans=0.07 2023-06-21 11:19:20,184 INFO [train.py:996] (3/4) Epoch 5, batch 5150, loss[loss=0.242, simple_loss=0.3003, pruned_loss=0.09188, over 21428.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3055, pruned_loss=0.07975, over 4291122.53 frames. ], batch size: 177, lr: 6.47e-03, grad_scale: 16.0 2023-06-21 11:19:36,955 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.115e+02 2.594e+02 2.911e+02 3.354e+02 4.463e+02, threshold=5.822e+02, percent-clipped=0.0 2023-06-21 11:20:13,728 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=762834.0, ans=0.2 2023-06-21 11:21:02,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=762954.0, ans=0.125 2023-06-21 11:21:07,399 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=762954.0, ans=0.0 2023-06-21 11:21:39,071 INFO [train.py:996] (3/4) Epoch 5, batch 5200, loss[loss=0.2277, simple_loss=0.3251, pruned_loss=0.0651, over 21727.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3075, pruned_loss=0.07939, over 4287593.46 frames. ], batch size: 247, lr: 6.47e-03, grad_scale: 32.0 2023-06-21 11:22:25,427 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.98 vs. limit=15.0 2023-06-21 11:22:42,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=763194.0, ans=0.0 2023-06-21 11:23:39,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=763314.0, ans=0.05 2023-06-21 11:24:02,728 INFO [train.py:996] (3/4) Epoch 5, batch 5250, loss[loss=0.2849, simple_loss=0.3596, pruned_loss=0.1051, over 21485.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3085, pruned_loss=0.07843, over 4278846.98 frames. ], batch size: 471, lr: 6.47e-03, grad_scale: 32.0 2023-06-21 11:24:11,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=763374.0, ans=0.125 2023-06-21 11:24:23,979 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.701e+02 2.681e+02 2.958e+02 3.448e+02 5.597e+02, threshold=5.917e+02, percent-clipped=0.0 2023-06-21 11:24:30,414 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=763434.0, ans=0.2 2023-06-21 11:24:42,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=763494.0, ans=0.05 2023-06-21 11:25:38,041 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=763554.0, ans=0.04949747468305833 2023-06-21 11:25:57,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=763614.0, ans=0.125 2023-06-21 11:26:08,820 INFO [train.py:996] (3/4) Epoch 5, batch 5300, loss[loss=0.2439, simple_loss=0.3065, pruned_loss=0.09063, over 21972.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3086, pruned_loss=0.07946, over 4285314.88 frames. ], batch size: 415, lr: 6.47e-03, grad_scale: 32.0 2023-06-21 11:26:25,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=763674.0, ans=0.2 2023-06-21 11:26:41,860 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.33 vs. limit=15.0 2023-06-21 11:26:57,971 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=763734.0, ans=0.0 2023-06-21 11:27:08,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=763794.0, ans=0.125 2023-06-21 11:27:23,263 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.16 vs. limit=22.5 2023-06-21 11:27:56,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=763914.0, ans=0.0 2023-06-21 11:28:00,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=763914.0, ans=0.0 2023-06-21 11:28:24,468 INFO [train.py:996] (3/4) Epoch 5, batch 5350, loss[loss=0.2441, simple_loss=0.3102, pruned_loss=0.08906, over 21773.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3089, pruned_loss=0.08143, over 4286050.78 frames. ], batch size: 389, lr: 6.47e-03, grad_scale: 32.0 2023-06-21 11:28:44,829 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.947e+02 2.381e+02 2.626e+02 3.022e+02 5.115e+02, threshold=5.252e+02, percent-clipped=0.0 2023-06-21 11:28:53,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=764034.0, ans=0.0 2023-06-21 11:30:36,293 INFO [train.py:996] (3/4) Epoch 5, batch 5400, loss[loss=0.2325, simple_loss=0.3045, pruned_loss=0.08022, over 21859.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3083, pruned_loss=0.08185, over 4294101.07 frames. ], batch size: 124, lr: 6.46e-03, grad_scale: 16.0 2023-06-21 11:31:01,855 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=764334.0, ans=0.05 2023-06-21 11:31:12,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=764334.0, ans=0.1 2023-06-21 11:31:13,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=764334.0, ans=0.125 2023-06-21 11:31:35,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=764394.0, ans=0.125 2023-06-21 11:32:58,368 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=764514.0, ans=0.125 2023-06-21 11:33:01,000 INFO [train.py:996] (3/4) Epoch 5, batch 5450, loss[loss=0.1965, simple_loss=0.2823, pruned_loss=0.05531, over 21377.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3112, pruned_loss=0.08024, over 4286845.75 frames. ], batch size: 131, lr: 6.46e-03, grad_scale: 16.0 2023-06-21 11:33:09,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=764574.0, ans=0.125 2023-06-21 11:33:09,752 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.93 vs. limit=6.0 2023-06-21 11:33:18,145 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.835e+02 2.472e+02 2.911e+02 3.692e+02 6.272e+02, threshold=5.821e+02, percent-clipped=3.0 2023-06-21 11:33:38,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=764634.0, ans=0.1 2023-06-21 11:33:57,434 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=764694.0, ans=0.05 2023-06-21 11:34:03,967 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.26 vs. limit=5.0 2023-06-21 11:34:45,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=764754.0, ans=0.125 2023-06-21 11:34:57,307 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.71 vs. limit=15.0 2023-06-21 11:35:10,241 INFO [train.py:996] (3/4) Epoch 5, batch 5500, loss[loss=0.1958, simple_loss=0.292, pruned_loss=0.04983, over 21384.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3151, pruned_loss=0.07722, over 4291248.31 frames. ], batch size: 194, lr: 6.46e-03, grad_scale: 16.0 2023-06-21 11:35:17,421 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.61 vs. limit=12.0 2023-06-21 11:36:32,901 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=764994.0, ans=0.125 2023-06-21 11:37:00,229 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=765054.0, ans=0.2 2023-06-21 11:37:20,075 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=765174.0, ans=0.125 2023-06-21 11:37:21,023 INFO [train.py:996] (3/4) Epoch 5, batch 5550, loss[loss=0.2296, simple_loss=0.325, pruned_loss=0.06707, over 21478.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3141, pruned_loss=0.07472, over 4283165.50 frames. ], batch size: 471, lr: 6.46e-03, grad_scale: 16.0 2023-06-21 11:37:25,419 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.92 vs. limit=15.0 2023-06-21 11:37:45,646 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-21 11:38:09,589 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.602e+02 2.186e+02 2.465e+02 2.869e+02 4.676e+02, threshold=4.930e+02, percent-clipped=0.0 2023-06-21 11:39:21,165 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.75 vs. limit=15.0 2023-06-21 11:39:55,438 INFO [train.py:996] (3/4) Epoch 5, batch 5600, loss[loss=0.2711, simple_loss=0.366, pruned_loss=0.08807, over 21785.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3124, pruned_loss=0.07249, over 4278707.88 frames. ], batch size: 351, lr: 6.46e-03, grad_scale: 32.0 2023-06-21 11:39:56,381 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=15.0 2023-06-21 11:40:03,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=765474.0, ans=0.0 2023-06-21 11:40:52,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=765534.0, ans=0.2 2023-06-21 11:41:11,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=765594.0, ans=0.2 2023-06-21 11:41:16,767 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.80 vs. limit=15.0 2023-06-21 11:41:22,118 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=765654.0, ans=0.1 2023-06-21 11:41:29,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=765654.0, ans=0.125 2023-06-21 11:41:33,777 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=765654.0, ans=0.0 2023-06-21 11:42:01,155 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.90 vs. limit=22.5 2023-06-21 11:42:04,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=765714.0, ans=0.125 2023-06-21 11:42:13,678 INFO [train.py:996] (3/4) Epoch 5, batch 5650, loss[loss=0.2287, simple_loss=0.2976, pruned_loss=0.07997, over 21858.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.316, pruned_loss=0.07486, over 4283228.61 frames. ], batch size: 282, lr: 6.46e-03, grad_scale: 32.0 2023-06-21 11:42:36,584 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.871e+02 2.490e+02 2.930e+02 3.681e+02 6.971e+02, threshold=5.860e+02, percent-clipped=6.0 2023-06-21 11:43:37,017 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.63 vs. limit=22.5 2023-06-21 11:43:45,793 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=765954.0, ans=0.125 2023-06-21 11:44:07,378 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.31 vs. limit=15.0 2023-06-21 11:44:09,047 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=766014.0, ans=22.5 2023-06-21 11:44:27,068 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.26 vs. limit=10.0 2023-06-21 11:44:28,800 INFO [train.py:996] (3/4) Epoch 5, batch 5700, loss[loss=0.2304, simple_loss=0.3177, pruned_loss=0.07154, over 21799.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3169, pruned_loss=0.07701, over 4287319.17 frames. ], batch size: 371, lr: 6.46e-03, grad_scale: 32.0 2023-06-21 11:44:29,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=766074.0, ans=0.125 2023-06-21 11:44:42,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=766074.0, ans=0.125 2023-06-21 11:45:25,924 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=766134.0, ans=0.0 2023-06-21 11:45:34,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=766194.0, ans=0.125 2023-06-21 11:46:11,205 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=766254.0, ans=0.125 2023-06-21 11:47:17,574 INFO [train.py:996] (3/4) Epoch 5, batch 5750, loss[loss=0.2523, simple_loss=0.3102, pruned_loss=0.09713, over 19978.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3121, pruned_loss=0.0745, over 4272071.99 frames. ], batch size: 702, lr: 6.46e-03, grad_scale: 32.0 2023-06-21 11:47:40,198 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.775e+02 2.289e+02 2.668e+02 3.231e+02 5.394e+02, threshold=5.337e+02, percent-clipped=0.0 2023-06-21 11:47:40,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=766434.0, ans=0.0 2023-06-21 11:48:15,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=766494.0, ans=0.0 2023-06-21 11:49:01,404 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=766554.0, ans=0.2 2023-06-21 11:49:01,456 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=766554.0, ans=0.125 2023-06-21 11:49:57,048 INFO [train.py:996] (3/4) Epoch 5, batch 5800, loss[loss=0.2578, simple_loss=0.3502, pruned_loss=0.08264, over 20876.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3118, pruned_loss=0.07416, over 4262701.16 frames. ], batch size: 608, lr: 6.45e-03, grad_scale: 32.0 2023-06-21 11:50:58,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=766794.0, ans=0.125 2023-06-21 11:52:11,027 INFO [train.py:996] (3/4) Epoch 5, batch 5850, loss[loss=0.207, simple_loss=0.3119, pruned_loss=0.05104, over 21175.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.3082, pruned_loss=0.07005, over 4267289.63 frames. ], batch size: 548, lr: 6.45e-03, grad_scale: 32.0 2023-06-21 11:52:31,032 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 2.021e+02 2.544e+02 3.113e+02 4.412e+02, threshold=5.088e+02, percent-clipped=0.0 2023-06-21 11:52:33,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=767034.0, ans=0.125 2023-06-21 11:54:15,372 INFO [train.py:996] (3/4) Epoch 5, batch 5900, loss[loss=0.1613, simple_loss=0.2411, pruned_loss=0.04071, over 21456.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.3004, pruned_loss=0.06467, over 4273506.97 frames. ], batch size: 211, lr: 6.45e-03, grad_scale: 32.0 2023-06-21 11:54:15,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=767274.0, ans=0.0 2023-06-21 11:54:22,376 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=767274.0, ans=0.125 2023-06-21 11:54:39,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=767274.0, ans=0.125 2023-06-21 11:54:55,644 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=767334.0, ans=0.0 2023-06-21 11:55:12,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=767394.0, ans=0.1 2023-06-21 11:55:55,145 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=767454.0, ans=0.0 2023-06-21 11:56:22,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=767514.0, ans=0.04949747468305833 2023-06-21 11:56:30,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=767574.0, ans=0.1 2023-06-21 11:56:31,131 INFO [train.py:996] (3/4) Epoch 5, batch 5950, loss[loss=0.1964, simple_loss=0.2655, pruned_loss=0.06365, over 21785.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2994, pruned_loss=0.06793, over 4278055.89 frames. ], batch size: 283, lr: 6.45e-03, grad_scale: 32.0 2023-06-21 11:56:38,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=767574.0, ans=0.0 2023-06-21 11:56:49,873 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=767574.0, ans=0.05 2023-06-21 11:56:54,548 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.19 vs. limit=10.0 2023-06-21 11:56:54,944 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 2.360e+02 2.756e+02 3.348e+02 5.051e+02, threshold=5.512e+02, percent-clipped=0.0 2023-06-21 11:57:36,783 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.37 vs. limit=22.5 2023-06-21 11:58:02,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=767754.0, ans=0.0 2023-06-21 11:58:52,391 INFO [train.py:996] (3/4) Epoch 5, batch 6000, loss[loss=0.2461, simple_loss=0.2969, pruned_loss=0.09766, over 21572.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2959, pruned_loss=0.07068, over 4283096.86 frames. ], batch size: 441, lr: 6.45e-03, grad_scale: 32.0 2023-06-21 11:58:52,391 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-21 11:59:55,603 INFO [train.py:1028] (3/4) Epoch 5, validation: loss=0.2623, simple_loss=0.3577, pruned_loss=0.08348, over 1796401.00 frames. 2023-06-21 11:59:55,605 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-21 12:00:26,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=767934.0, ans=0.0 2023-06-21 12:00:47,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=767994.0, ans=0.125 2023-06-21 12:01:10,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=768054.0, ans=0.0 2023-06-21 12:01:24,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=768114.0, ans=0.0 2023-06-21 12:01:56,882 INFO [train.py:996] (3/4) Epoch 5, batch 6050, loss[loss=0.1837, simple_loss=0.2486, pruned_loss=0.05936, over 21208.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2909, pruned_loss=0.07178, over 4263686.65 frames. ], batch size: 159, lr: 6.45e-03, grad_scale: 32.0 2023-06-21 12:02:09,860 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.54 vs. limit=22.5 2023-06-21 12:02:30,528 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.754e+02 2.508e+02 2.739e+02 3.269e+02 4.730e+02, threshold=5.478e+02, percent-clipped=0.0 2023-06-21 12:02:34,371 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=768234.0, ans=0.0 2023-06-21 12:03:15,863 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.42 vs. limit=22.5 2023-06-21 12:03:25,324 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=768354.0, ans=0.04949747468305833 2023-06-21 12:03:37,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=768414.0, ans=0.0 2023-06-21 12:04:00,824 INFO [train.py:996] (3/4) Epoch 5, batch 6100, loss[loss=0.2114, simple_loss=0.3013, pruned_loss=0.06073, over 21678.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2891, pruned_loss=0.0698, over 4267378.52 frames. ], batch size: 389, lr: 6.45e-03, grad_scale: 32.0 2023-06-21 12:04:32,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=768534.0, ans=0.0 2023-06-21 12:04:34,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=768534.0, ans=0.0 2023-06-21 12:05:07,628 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=768594.0, ans=0.1 2023-06-21 12:05:49,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=768714.0, ans=0.0 2023-06-21 12:06:07,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=768714.0, ans=0.2 2023-06-21 12:06:19,078 INFO [train.py:996] (3/4) Epoch 5, batch 6150, loss[loss=0.211, simple_loss=0.2888, pruned_loss=0.06654, over 21693.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2934, pruned_loss=0.07222, over 4262174.23 frames. ], batch size: 332, lr: 6.45e-03, grad_scale: 32.0 2023-06-21 12:06:36,745 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.90 vs. limit=12.0 2023-06-21 12:06:40,064 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.765e+02 2.345e+02 2.949e+02 3.402e+02 5.805e+02, threshold=5.898e+02, percent-clipped=1.0 2023-06-21 12:06:44,154 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.34 vs. limit=15.0 2023-06-21 12:07:30,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=768894.0, ans=0.0 2023-06-21 12:08:15,961 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=769014.0, ans=0.0 2023-06-21 12:08:37,422 INFO [train.py:996] (3/4) Epoch 5, batch 6200, loss[loss=0.2747, simple_loss=0.3734, pruned_loss=0.08803, over 21678.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2975, pruned_loss=0.07372, over 4272493.94 frames. ], batch size: 414, lr: 6.44e-03, grad_scale: 32.0 2023-06-21 12:09:37,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=769194.0, ans=0.0 2023-06-21 12:10:51,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=769314.0, ans=0.1 2023-06-21 12:10:54,294 INFO [train.py:996] (3/4) Epoch 5, batch 6250, loss[loss=0.1902, simple_loss=0.2819, pruned_loss=0.04929, over 21452.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.3008, pruned_loss=0.07348, over 4266789.91 frames. ], batch size: 211, lr: 6.44e-03, grad_scale: 32.0 2023-06-21 12:11:21,593 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.649e+02 2.368e+02 2.704e+02 3.313e+02 4.790e+02, threshold=5.409e+02, percent-clipped=0.0 2023-06-21 12:11:42,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=769434.0, ans=0.0 2023-06-21 12:11:51,032 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=769494.0, ans=0.1 2023-06-21 12:12:06,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=769554.0, ans=0.125 2023-06-21 12:13:07,525 INFO [train.py:996] (3/4) Epoch 5, batch 6300, loss[loss=0.2326, simple_loss=0.3075, pruned_loss=0.07881, over 21838.00 frames. ], tot_loss[loss=0.224, simple_loss=0.3033, pruned_loss=0.07238, over 4265608.64 frames. ], batch size: 298, lr: 6.44e-03, grad_scale: 32.0 2023-06-21 12:14:22,441 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.06 vs. limit=10.0 2023-06-21 12:15:03,566 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=769914.0, ans=0.025 2023-06-21 12:15:18,879 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=769914.0, ans=0.125 2023-06-21 12:15:21,312 INFO [train.py:996] (3/4) Epoch 5, batch 6350, loss[loss=0.2565, simple_loss=0.3249, pruned_loss=0.0941, over 21382.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3067, pruned_loss=0.07686, over 4271327.74 frames. ], batch size: 176, lr: 6.44e-03, grad_scale: 32.0 2023-06-21 12:15:50,451 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.727e+02 2.600e+02 2.925e+02 3.648e+02 4.818e+02, threshold=5.851e+02, percent-clipped=0.0 2023-06-21 12:16:27,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=770094.0, ans=0.1 2023-06-21 12:16:32,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=770094.0, ans=0.1 2023-06-21 12:17:23,838 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=770214.0, ans=0.0 2023-06-21 12:17:48,482 INFO [train.py:996] (3/4) Epoch 5, batch 6400, loss[loss=0.2516, simple_loss=0.3273, pruned_loss=0.08797, over 21325.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3146, pruned_loss=0.08129, over 4273815.11 frames. ], batch size: 548, lr: 6.44e-03, grad_scale: 32.0 2023-06-21 12:19:15,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=770454.0, ans=0.125 2023-06-21 12:19:21,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=770454.0, ans=0.07 2023-06-21 12:19:21,841 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.26 vs. limit=22.5 2023-06-21 12:19:23,138 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2023-06-21 12:19:59,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=770514.0, ans=0.035 2023-06-21 12:20:01,645 INFO [train.py:996] (3/4) Epoch 5, batch 6450, loss[loss=0.1972, simple_loss=0.2722, pruned_loss=0.06107, over 21406.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3162, pruned_loss=0.08002, over 4277087.18 frames. ], batch size: 211, lr: 6.44e-03, grad_scale: 32.0 2023-06-21 12:20:12,951 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=770574.0, ans=0.0 2023-06-21 12:20:19,533 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.950e+02 2.431e+02 2.805e+02 3.198e+02 5.945e+02, threshold=5.611e+02, percent-clipped=1.0 2023-06-21 12:21:23,088 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.39 vs. limit=12.0 2023-06-21 12:22:15,289 INFO [train.py:996] (3/4) Epoch 5, batch 6500, loss[loss=0.2381, simple_loss=0.3111, pruned_loss=0.08255, over 21569.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3076, pruned_loss=0.07787, over 4272938.64 frames. ], batch size: 414, lr: 6.44e-03, grad_scale: 32.0 2023-06-21 12:22:15,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=770874.0, ans=0.0 2023-06-21 12:24:16,976 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.42 vs. limit=22.5 2023-06-21 12:24:41,661 INFO [train.py:996] (3/4) Epoch 5, batch 6550, loss[loss=0.2121, simple_loss=0.2799, pruned_loss=0.07218, over 21511.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3065, pruned_loss=0.07679, over 4276696.66 frames. ], batch size: 131, lr: 6.44e-03, grad_scale: 32.0 2023-06-21 12:24:58,472 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.011e+02 2.666e+02 3.007e+02 3.674e+02 6.110e+02, threshold=6.015e+02, percent-clipped=2.0 2023-06-21 12:25:21,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=771234.0, ans=15.0 2023-06-21 12:26:52,307 INFO [train.py:996] (3/4) Epoch 5, batch 6600, loss[loss=0.1899, simple_loss=0.2572, pruned_loss=0.06132, over 21693.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3008, pruned_loss=0.07583, over 4275706.77 frames. ], batch size: 282, lr: 6.43e-03, grad_scale: 16.0 2023-06-21 12:28:56,937 INFO [train.py:996] (3/4) Epoch 5, batch 6650, loss[loss=0.2136, simple_loss=0.2748, pruned_loss=0.07618, over 21370.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2945, pruned_loss=0.07292, over 4272401.58 frames. ], batch size: 160, lr: 6.43e-03, grad_scale: 16.0 2023-06-21 12:29:40,254 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.766e+02 2.317e+02 2.595e+02 2.999e+02 4.464e+02, threshold=5.189e+02, percent-clipped=0.0 2023-06-21 12:29:45,487 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.44 vs. limit=15.0 2023-06-21 12:30:44,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=771954.0, ans=0.125 2023-06-21 12:31:00,508 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=772014.0, ans=0.2 2023-06-21 12:31:04,832 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=772014.0, ans=0.0 2023-06-21 12:31:08,782 INFO [train.py:996] (3/4) Epoch 5, batch 6700, loss[loss=0.1982, simple_loss=0.2648, pruned_loss=0.0658, over 21857.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2901, pruned_loss=0.07226, over 4271644.70 frames. ], batch size: 107, lr: 6.43e-03, grad_scale: 16.0 2023-06-21 12:32:45,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=772254.0, ans=0.2 2023-06-21 12:33:12,787 INFO [train.py:996] (3/4) Epoch 5, batch 6750, loss[loss=0.222, simple_loss=0.2906, pruned_loss=0.07674, over 21870.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2888, pruned_loss=0.07305, over 4272532.04 frames. ], batch size: 316, lr: 6.43e-03, grad_scale: 16.0 2023-06-21 12:33:58,962 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.735e+02 2.401e+02 2.813e+02 3.332e+02 5.748e+02, threshold=5.626e+02, percent-clipped=2.0 2023-06-21 12:33:59,410 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=772434.0, ans=0.125 2023-06-21 12:34:22,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=772494.0, ans=0.07 2023-06-21 12:34:23,574 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=772494.0, ans=0.125 2023-06-21 12:34:25,032 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=772554.0, ans=0.2 2023-06-21 12:34:59,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=772614.0, ans=0.0 2023-06-21 12:35:16,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=772614.0, ans=0.125 2023-06-21 12:35:25,716 INFO [train.py:996] (3/4) Epoch 5, batch 6800, loss[loss=0.234, simple_loss=0.2924, pruned_loss=0.08776, over 21353.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2906, pruned_loss=0.075, over 4276218.93 frames. ], batch size: 144, lr: 6.43e-03, grad_scale: 32.0 2023-06-21 12:35:27,644 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=772674.0, ans=0.125 2023-06-21 12:35:48,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=772734.0, ans=0.0 2023-06-21 12:36:11,058 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=772734.0, ans=0.0 2023-06-21 12:36:16,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=772734.0, ans=0.1 2023-06-21 12:36:21,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=772794.0, ans=0.0 2023-06-21 12:37:02,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=772914.0, ans=0.125 2023-06-21 12:37:11,690 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.97 vs. limit=15.0 2023-06-21 12:37:19,276 INFO [train.py:996] (3/4) Epoch 5, batch 6850, loss[loss=0.2179, simple_loss=0.2776, pruned_loss=0.07914, over 21659.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2896, pruned_loss=0.07694, over 4278384.22 frames. ], batch size: 230, lr: 6.43e-03, grad_scale: 32.0 2023-06-21 12:38:02,154 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.985e+02 2.533e+02 2.911e+02 3.331e+02 6.135e+02, threshold=5.822e+02, percent-clipped=1.0 2023-06-21 12:38:09,604 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_na.min_abs, batch_count=773034.0, ans=0.02 2023-06-21 12:38:26,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=773094.0, ans=0.1 2023-06-21 12:39:39,720 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=773214.0, ans=0.07 2023-06-21 12:39:46,838 INFO [train.py:996] (3/4) Epoch 5, batch 6900, loss[loss=0.19, simple_loss=0.2714, pruned_loss=0.05433, over 21220.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2921, pruned_loss=0.07715, over 4284957.28 frames. ], batch size: 159, lr: 6.43e-03, grad_scale: 32.0 2023-06-21 12:40:17,924 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.53 vs. limit=22.5 2023-06-21 12:40:26,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=773334.0, ans=0.125 2023-06-21 12:41:50,098 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.34 vs. limit=15.0 2023-06-21 12:42:07,209 INFO [train.py:996] (3/4) Epoch 5, batch 6950, loss[loss=0.239, simple_loss=0.3131, pruned_loss=0.08248, over 21234.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2942, pruned_loss=0.07435, over 4286975.45 frames. ], batch size: 159, lr: 6.43e-03, grad_scale: 16.0 2023-06-21 12:42:25,342 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.12 vs. limit=15.0 2023-06-21 12:42:45,185 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.920e+02 2.439e+02 2.793e+02 3.180e+02 5.285e+02, threshold=5.586e+02, percent-clipped=0.0 2023-06-21 12:43:40,510 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.73 vs. limit=10.0 2023-06-21 12:43:46,109 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=773754.0, ans=0.1 2023-06-21 12:43:50,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=773754.0, ans=0.0 2023-06-21 12:44:08,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=773814.0, ans=0.0 2023-06-21 12:44:15,866 INFO [train.py:996] (3/4) Epoch 5, batch 7000, loss[loss=0.2477, simple_loss=0.3133, pruned_loss=0.091, over 21297.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.2978, pruned_loss=0.07742, over 4284858.89 frames. ], batch size: 548, lr: 6.42e-03, grad_scale: 16.0 2023-06-21 12:44:49,846 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=773934.0, ans=0.125 2023-06-21 12:45:04,199 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.33 vs. limit=15.0 2023-06-21 12:45:05,068 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=773934.0, ans=0.2 2023-06-21 12:45:11,691 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=773934.0, ans=0.1 2023-06-21 12:45:35,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=773994.0, ans=0.125 2023-06-21 12:45:48,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=774054.0, ans=0.025 2023-06-21 12:46:03,515 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=774114.0, ans=0.1 2023-06-21 12:46:09,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=774114.0, ans=0.0 2023-06-21 12:46:29,990 INFO [train.py:996] (3/4) Epoch 5, batch 7050, loss[loss=0.2111, simple_loss=0.2956, pruned_loss=0.0633, over 21026.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2948, pruned_loss=0.076, over 4281251.93 frames. ], batch size: 607, lr: 6.42e-03, grad_scale: 16.0 2023-06-21 12:47:02,212 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=774234.0, ans=0.125 2023-06-21 12:47:02,900 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.39 vs. limit=15.0 2023-06-21 12:47:07,612 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.826e+02 2.342e+02 2.952e+02 3.730e+02 6.285e+02, threshold=5.903e+02, percent-clipped=1.0 2023-06-21 12:48:14,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=774354.0, ans=0.07 2023-06-21 12:48:19,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=774354.0, ans=0.125 2023-06-21 12:48:43,392 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=774414.0, ans=0.0 2023-06-21 12:48:50,490 INFO [train.py:996] (3/4) Epoch 5, batch 7100, loss[loss=0.2014, simple_loss=0.2746, pruned_loss=0.06408, over 21634.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3004, pruned_loss=0.07774, over 4283021.28 frames. ], batch size: 263, lr: 6.42e-03, grad_scale: 16.0 2023-06-21 12:49:39,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=774534.0, ans=0.125 2023-06-21 12:49:54,116 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=774594.0, ans=0.0 2023-06-21 12:50:55,941 INFO [train.py:996] (3/4) Epoch 5, batch 7150, loss[loss=0.2435, simple_loss=0.3138, pruned_loss=0.08658, over 21359.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3012, pruned_loss=0.07744, over 4277068.28 frames. ], batch size: 549, lr: 6.42e-03, grad_scale: 16.0 2023-06-21 12:51:26,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=774834.0, ans=0.125 2023-06-21 12:51:34,350 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.837e+02 2.394e+02 2.698e+02 3.391e+02 6.183e+02, threshold=5.396e+02, percent-clipped=2.0 2023-06-21 12:51:35,318 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.30 vs. limit=15.0 2023-06-21 12:52:04,288 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=774894.0, ans=0.125 2023-06-21 12:52:08,677 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=774894.0, ans=0.0 2023-06-21 12:52:41,114 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=774954.0, ans=0.125 2023-06-21 12:53:08,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=775014.0, ans=0.125 2023-06-21 12:53:11,031 INFO [train.py:996] (3/4) Epoch 5, batch 7200, loss[loss=0.2311, simple_loss=0.2934, pruned_loss=0.08446, over 21547.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3036, pruned_loss=0.07928, over 4268076.19 frames. ], batch size: 391, lr: 6.42e-03, grad_scale: 32.0 2023-06-21 12:53:29,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=775074.0, ans=0.0 2023-06-21 12:54:01,635 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.16 vs. limit=6.0 2023-06-21 12:55:12,549 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.62 vs. limit=15.0 2023-06-21 12:55:20,756 INFO [train.py:996] (3/4) Epoch 5, batch 7250, loss[loss=0.261, simple_loss=0.2922, pruned_loss=0.115, over 21367.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.2997, pruned_loss=0.07828, over 4270294.28 frames. ], batch size: 509, lr: 6.42e-03, grad_scale: 32.0 2023-06-21 12:56:02,053 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.038e+02 2.505e+02 2.857e+02 3.370e+02 5.311e+02, threshold=5.714e+02, percent-clipped=0.0 2023-06-21 12:56:42,949 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=775554.0, ans=0.0 2023-06-21 12:57:38,539 INFO [train.py:996] (3/4) Epoch 5, batch 7300, loss[loss=0.192, simple_loss=0.2511, pruned_loss=0.06645, over 21493.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2931, pruned_loss=0.07732, over 4264678.18 frames. ], batch size: 230, lr: 6.42e-03, grad_scale: 32.0 2023-06-21 12:57:45,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=775674.0, ans=0.0 2023-06-21 12:58:17,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=775734.0, ans=0.07 2023-06-21 12:58:19,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=775734.0, ans=0.0 2023-06-21 12:58:48,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=775854.0, ans=0.05 2023-06-21 12:58:54,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=775854.0, ans=10.0 2023-06-21 12:59:42,991 INFO [train.py:996] (3/4) Epoch 5, batch 7350, loss[loss=0.2352, simple_loss=0.2811, pruned_loss=0.09465, over 21423.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2909, pruned_loss=0.07793, over 4259120.85 frames. ], batch size: 476, lr: 6.42e-03, grad_scale: 32.0 2023-06-21 12:59:45,008 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=775974.0, ans=0.125 2023-06-21 13:00:25,514 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.952e+02 2.511e+02 2.899e+02 3.601e+02 8.387e+02, threshold=5.798e+02, percent-clipped=4.0 2023-06-21 13:00:40,226 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.22 vs. limit=22.5 2023-06-21 13:01:55,456 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 13:01:59,271 INFO [train.py:996] (3/4) Epoch 5, batch 7400, loss[loss=0.1829, simple_loss=0.2711, pruned_loss=0.04741, over 19712.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.2967, pruned_loss=0.08008, over 4267665.55 frames. ], batch size: 702, lr: 6.41e-03, grad_scale: 32.0 2023-06-21 13:02:16,468 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=776274.0, ans=0.035 2023-06-21 13:02:54,413 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.76 vs. limit=15.0 2023-06-21 13:03:13,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=776394.0, ans=0.0 2023-06-21 13:03:39,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=776454.0, ans=0.04949747468305833 2023-06-21 13:04:15,328 INFO [train.py:996] (3/4) Epoch 5, batch 7450, loss[loss=0.1975, simple_loss=0.2436, pruned_loss=0.0757, over 20816.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.296, pruned_loss=0.07884, over 4273674.38 frames. ], batch size: 609, lr: 6.41e-03, grad_scale: 32.0 2023-06-21 13:04:29,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=776574.0, ans=0.125 2023-06-21 13:04:36,134 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.89 vs. limit=15.0 2023-06-21 13:04:40,925 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.920e+02 2.565e+02 3.143e+02 4.189e+02 6.806e+02, threshold=6.287e+02, percent-clipped=2.0 2023-06-21 13:05:04,339 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 13:05:07,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=776694.0, ans=0.125 2023-06-21 13:05:26,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=776754.0, ans=0.2 2023-06-21 13:05:27,523 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 13:05:52,005 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=776814.0, ans=0.2 2023-06-21 13:06:23,701 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.40 vs. limit=6.0 2023-06-21 13:06:30,092 INFO [train.py:996] (3/4) Epoch 5, batch 7500, loss[loss=0.2382, simple_loss=0.3343, pruned_loss=0.07104, over 21538.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3005, pruned_loss=0.08078, over 4273767.81 frames. ], batch size: 230, lr: 6.41e-03, grad_scale: 32.0 2023-06-21 13:06:44,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=776934.0, ans=0.125 2023-06-21 13:06:48,362 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=10.15 vs. limit=15.0 2023-06-21 13:07:11,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=776934.0, ans=0.0 2023-06-21 13:07:21,367 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=776994.0, ans=0.0 2023-06-21 13:08:38,071 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=777114.0, ans=0.125 2023-06-21 13:08:49,605 INFO [train.py:996] (3/4) Epoch 5, batch 7550, loss[loss=0.2246, simple_loss=0.2826, pruned_loss=0.08331, over 21122.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3101, pruned_loss=0.08064, over 4273503.02 frames. ], batch size: 608, lr: 6.41e-03, grad_scale: 32.0 2023-06-21 13:09:24,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=777234.0, ans=0.0 2023-06-21 13:09:33,353 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 2.666e+02 3.013e+02 3.483e+02 5.379e+02, threshold=6.026e+02, percent-clipped=0.0 2023-06-21 13:09:33,792 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=777234.0, ans=0.035 2023-06-21 13:10:03,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=777294.0, ans=0.2 2023-06-21 13:10:51,651 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=777414.0, ans=0.125 2023-06-21 13:10:58,792 INFO [train.py:996] (3/4) Epoch 5, batch 7600, loss[loss=0.2409, simple_loss=0.3091, pruned_loss=0.08633, over 21305.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3082, pruned_loss=0.07962, over 4277052.19 frames. ], batch size: 159, lr: 6.41e-03, grad_scale: 32.0 2023-06-21 13:11:02,097 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=777474.0, ans=0.125 2023-06-21 13:11:15,947 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=777474.0, ans=0.0 2023-06-21 13:11:56,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=777534.0, ans=0.025 2023-06-21 13:12:00,332 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.46 vs. limit=6.0 2023-06-21 13:12:28,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=777594.0, ans=0.0 2023-06-21 13:13:19,292 INFO [train.py:996] (3/4) Epoch 5, batch 7650, loss[loss=0.2359, simple_loss=0.3082, pruned_loss=0.08177, over 21824.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.306, pruned_loss=0.08027, over 4276731.46 frames. ], batch size: 112, lr: 6.41e-03, grad_scale: 32.0 2023-06-21 13:13:19,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=777774.0, ans=0.1 2023-06-21 13:14:08,485 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.049e+02 2.535e+02 2.921e+02 3.505e+02 6.410e+02, threshold=5.842e+02, percent-clipped=1.0 2023-06-21 13:15:11,686 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 13:15:31,824 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 13:15:42,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=778014.0, ans=0.125 2023-06-21 13:15:45,151 INFO [train.py:996] (3/4) Epoch 5, batch 7700, loss[loss=0.3189, simple_loss=0.3991, pruned_loss=0.1193, over 21790.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3087, pruned_loss=0.08351, over 4277233.04 frames. ], batch size: 118, lr: 6.41e-03, grad_scale: 32.0 2023-06-21 13:15:47,324 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=778074.0, ans=0.125 2023-06-21 13:16:34,886 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.49 vs. limit=12.0 2023-06-21 13:17:54,129 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=778374.0, ans=0.07 2023-06-21 13:17:55,139 INFO [train.py:996] (3/4) Epoch 5, batch 7750, loss[loss=0.3954, simple_loss=0.4706, pruned_loss=0.1601, over 21439.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3137, pruned_loss=0.08342, over 4272479.74 frames. ], batch size: 507, lr: 6.41e-03, grad_scale: 16.0 2023-06-21 13:18:43,591 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.999e+02 2.478e+02 2.745e+02 3.092e+02 4.488e+02, threshold=5.489e+02, percent-clipped=0.0 2023-06-21 13:19:15,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=778494.0, ans=0.0 2023-06-21 13:20:03,831 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.95 vs. limit=6.0 2023-06-21 13:20:24,294 INFO [train.py:996] (3/4) Epoch 5, batch 7800, loss[loss=0.205, simple_loss=0.282, pruned_loss=0.06402, over 21647.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3156, pruned_loss=0.08353, over 4265790.34 frames. ], batch size: 263, lr: 6.40e-03, grad_scale: 16.0 2023-06-21 13:20:30,330 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=778674.0, ans=0.125 2023-06-21 13:21:45,435 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=778854.0, ans=0.1 2023-06-21 13:21:50,706 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.97 vs. limit=15.0 2023-06-21 13:22:29,700 INFO [train.py:996] (3/4) Epoch 5, batch 7850, loss[loss=0.2346, simple_loss=0.2913, pruned_loss=0.08897, over 22025.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3092, pruned_loss=0.0815, over 4258078.01 frames. ], batch size: 103, lr: 6.40e-03, grad_scale: 16.0 2023-06-21 13:23:06,460 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.955e+02 2.498e+02 2.821e+02 3.538e+02 6.560e+02, threshold=5.643e+02, percent-clipped=3.0 2023-06-21 13:23:25,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=779094.0, ans=0.2 2023-06-21 13:23:59,283 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.96 vs. limit=15.0 2023-06-21 13:23:59,311 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.23 vs. limit=22.5 2023-06-21 13:24:21,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=779214.0, ans=0.1 2023-06-21 13:24:34,385 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=779214.0, ans=0.125 2023-06-21 13:24:40,948 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.63 vs. limit=10.0 2023-06-21 13:24:41,376 INFO [train.py:996] (3/4) Epoch 5, batch 7900, loss[loss=0.2242, simple_loss=0.3162, pruned_loss=0.06609, over 21804.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3049, pruned_loss=0.08034, over 4255222.52 frames. ], batch size: 282, lr: 6.40e-03, grad_scale: 16.0 2023-06-21 13:25:15,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=779274.0, ans=0.0 2023-06-21 13:25:39,185 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=779334.0, ans=0.0 2023-06-21 13:26:00,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=779394.0, ans=0.125 2023-06-21 13:26:46,201 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=779514.0, ans=0.0 2023-06-21 13:26:56,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=779514.0, ans=0.1 2023-06-21 13:27:14,814 INFO [train.py:996] (3/4) Epoch 5, batch 7950, loss[loss=0.2885, simple_loss=0.3574, pruned_loss=0.1097, over 21738.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3087, pruned_loss=0.08032, over 4253729.68 frames. ], batch size: 441, lr: 6.40e-03, grad_scale: 16.0 2023-06-21 13:27:23,818 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.44 vs. limit=15.0 2023-06-21 13:27:42,673 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 2.594e+02 2.803e+02 3.768e+02 5.185e+02, threshold=5.606e+02, percent-clipped=0.0 2023-06-21 13:27:55,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=779634.0, ans=0.05 2023-06-21 13:28:34,742 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.06 vs. limit=22.5 2023-06-21 13:29:15,343 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=779814.0, ans=0.2 2023-06-21 13:29:26,312 INFO [train.py:996] (3/4) Epoch 5, batch 8000, loss[loss=0.3207, simple_loss=0.3841, pruned_loss=0.1286, over 21365.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.313, pruned_loss=0.08343, over 4254422.67 frames. ], batch size: 507, lr: 6.40e-03, grad_scale: 32.0 2023-06-21 13:30:15,644 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=779934.0, ans=0.125 2023-06-21 13:30:40,046 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.81 vs. limit=6.0 2023-06-21 13:32:03,746 INFO [train.py:996] (3/4) Epoch 5, batch 8050, loss[loss=0.2241, simple_loss=0.2931, pruned_loss=0.07751, over 21484.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3137, pruned_loss=0.08381, over 4255985.25 frames. ], batch size: 211, lr: 6.40e-03, grad_scale: 32.0 2023-06-21 13:32:16,622 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.19 vs. limit=12.0 2023-06-21 13:32:24,834 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.141e+02 2.993e+02 3.544e+02 4.620e+02 9.797e+02, threshold=7.088e+02, percent-clipped=13.0 2023-06-21 13:32:48,897 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=780234.0, ans=0.125 2023-06-21 13:33:44,150 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=780354.0, ans=0.125 2023-06-21 13:34:12,509 INFO [train.py:996] (3/4) Epoch 5, batch 8100, loss[loss=0.2401, simple_loss=0.2968, pruned_loss=0.09166, over 21382.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3115, pruned_loss=0.08346, over 4254771.29 frames. ], batch size: 159, lr: 6.40e-03, grad_scale: 32.0 2023-06-21 13:34:15,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=780474.0, ans=0.125 2023-06-21 13:34:43,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=780534.0, ans=0.125 2023-06-21 13:35:34,093 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.10 vs. limit=22.5 2023-06-21 13:35:36,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=780594.0, ans=0.125 2023-06-21 13:35:55,231 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.41 vs. limit=15.0 2023-06-21 13:36:52,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=780774.0, ans=0.025 2023-06-21 13:36:57,683 INFO [train.py:996] (3/4) Epoch 5, batch 8150, loss[loss=0.3455, simple_loss=0.4233, pruned_loss=0.1339, over 21474.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3219, pruned_loss=0.08564, over 4257401.99 frames. ], batch size: 507, lr: 6.40e-03, grad_scale: 32.0 2023-06-21 13:37:42,129 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.926e+02 2.573e+02 2.921e+02 3.508e+02 5.879e+02, threshold=5.842e+02, percent-clipped=0.0 2023-06-21 13:38:51,857 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=781014.0, ans=0.1 2023-06-21 13:38:53,702 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.61 vs. limit=12.0 2023-06-21 13:38:58,420 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.56 vs. limit=22.5 2023-06-21 13:39:08,747 INFO [train.py:996] (3/4) Epoch 5, batch 8200, loss[loss=0.229, simple_loss=0.2871, pruned_loss=0.08541, over 21812.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3154, pruned_loss=0.08339, over 4268932.36 frames. ], batch size: 102, lr: 6.40e-03, grad_scale: 32.0 2023-06-21 13:39:42,763 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=781134.0, ans=0.125 2023-06-21 13:40:17,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=781194.0, ans=0.125 2023-06-21 13:40:20,758 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=781194.0, ans=0.2 2023-06-21 13:40:21,419 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.72 vs. limit=10.0 2023-06-21 13:40:22,988 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-21 13:40:39,953 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=781254.0, ans=0.125 2023-06-21 13:41:33,485 INFO [train.py:996] (3/4) Epoch 5, batch 8250, loss[loss=0.2035, simple_loss=0.293, pruned_loss=0.05704, over 21659.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3142, pruned_loss=0.08263, over 4272578.50 frames. ], batch size: 247, lr: 6.39e-03, grad_scale: 32.0 2023-06-21 13:41:42,899 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=781374.0, ans=0.0 2023-06-21 13:41:42,961 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=781374.0, ans=0.125 2023-06-21 13:41:53,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=781434.0, ans=0.125 2023-06-21 13:42:06,531 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.991e+02 2.500e+02 2.988e+02 3.537e+02 7.334e+02, threshold=5.975e+02, percent-clipped=1.0 2023-06-21 13:43:33,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=781614.0, ans=0.0 2023-06-21 13:43:44,129 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=781614.0, ans=0.125 2023-06-21 13:43:44,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=781614.0, ans=10.0 2023-06-21 13:43:46,531 INFO [train.py:996] (3/4) Epoch 5, batch 8300, loss[loss=0.2108, simple_loss=0.2925, pruned_loss=0.06458, over 21656.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3126, pruned_loss=0.08086, over 4274334.71 frames. ], batch size: 230, lr: 6.39e-03, grad_scale: 32.0 2023-06-21 13:44:03,856 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.29 vs. limit=10.0 2023-06-21 13:45:02,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=781854.0, ans=0.125 2023-06-21 13:46:00,685 INFO [train.py:996] (3/4) Epoch 5, batch 8350, loss[loss=0.1843, simple_loss=0.2695, pruned_loss=0.04953, over 21357.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3097, pruned_loss=0.07837, over 4273454.30 frames. ], batch size: 131, lr: 6.39e-03, grad_scale: 16.0 2023-06-21 13:46:38,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=782034.0, ans=0.0 2023-06-21 13:46:45,201 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.722e+02 2.563e+02 2.962e+02 3.725e+02 6.327e+02, threshold=5.925e+02, percent-clipped=2.0 2023-06-21 13:48:19,768 INFO [train.py:996] (3/4) Epoch 5, batch 8400, loss[loss=0.208, simple_loss=0.2898, pruned_loss=0.06313, over 21736.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.306, pruned_loss=0.07534, over 4266109.31 frames. ], batch size: 316, lr: 6.39e-03, grad_scale: 32.0 2023-06-21 13:49:20,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=782394.0, ans=0.1 2023-06-21 13:49:47,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=782454.0, ans=0.0 2023-06-21 13:50:33,091 INFO [train.py:996] (3/4) Epoch 5, batch 8450, loss[loss=0.2056, simple_loss=0.2897, pruned_loss=0.06073, over 21613.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3046, pruned_loss=0.07519, over 4269383.98 frames. ], batch size: 263, lr: 6.39e-03, grad_scale: 32.0 2023-06-21 13:50:55,107 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=782634.0, ans=0.09899494936611666 2023-06-21 13:51:01,845 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.706e+02 2.311e+02 2.681e+02 3.207e+02 6.839e+02, threshold=5.362e+02, percent-clipped=1.0 2023-06-21 13:52:31,599 INFO [train.py:996] (3/4) Epoch 5, batch 8500, loss[loss=0.2469, simple_loss=0.3176, pruned_loss=0.08811, over 21802.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3006, pruned_loss=0.07607, over 4265923.24 frames. ], batch size: 124, lr: 6.39e-03, grad_scale: 16.0 2023-06-21 13:52:57,234 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=782934.0, ans=0.125 2023-06-21 13:53:16,045 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=782994.0, ans=0.0 2023-06-21 13:53:56,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=783054.0, ans=0.125 2023-06-21 13:54:37,521 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=783114.0, ans=0.1 2023-06-21 13:54:41,556 INFO [train.py:996] (3/4) Epoch 5, batch 8550, loss[loss=0.2312, simple_loss=0.3086, pruned_loss=0.07693, over 21110.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3063, pruned_loss=0.07887, over 4268780.00 frames. ], batch size: 143, lr: 6.39e-03, grad_scale: 16.0 2023-06-21 13:55:31,905 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.064e+02 2.687e+02 3.119e+02 3.831e+02 5.921e+02, threshold=6.237e+02, percent-clipped=6.0 2023-06-21 13:56:31,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=783354.0, ans=0.1 2023-06-21 13:56:38,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=783414.0, ans=0.0 2023-06-21 13:57:03,916 INFO [train.py:996] (3/4) Epoch 5, batch 8600, loss[loss=0.2908, simple_loss=0.3621, pruned_loss=0.1097, over 21523.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3126, pruned_loss=0.0809, over 4268118.22 frames. ], batch size: 414, lr: 6.39e-03, grad_scale: 16.0 2023-06-21 13:57:35,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=783534.0, ans=0.0 2023-06-21 13:58:10,007 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.11 vs. limit=15.0 2023-06-21 13:59:09,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=783714.0, ans=0.125 2023-06-21 13:59:28,622 INFO [train.py:996] (3/4) Epoch 5, batch 8650, loss[loss=0.2227, simple_loss=0.3135, pruned_loss=0.06597, over 21638.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3195, pruned_loss=0.08208, over 4270033.44 frames. ], batch size: 263, lr: 6.38e-03, grad_scale: 16.0 2023-06-21 13:59:59,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=783834.0, ans=0.125 2023-06-21 14:00:03,579 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.745e+02 2.535e+02 2.923e+02 3.243e+02 4.542e+02, threshold=5.846e+02, percent-clipped=0.0 2023-06-21 14:01:27,563 INFO [train.py:996] (3/4) Epoch 5, batch 8700, loss[loss=0.184, simple_loss=0.2782, pruned_loss=0.04494, over 21656.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3111, pruned_loss=0.07827, over 4274457.85 frames. ], batch size: 263, lr: 6.38e-03, grad_scale: 16.0 2023-06-21 14:01:49,213 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.18 vs. limit=15.0 2023-06-21 14:02:50,466 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.08 vs. limit=6.0 2023-06-21 14:03:24,855 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=784314.0, ans=15.0 2023-06-21 14:03:24,976 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.36 vs. limit=15.0 2023-06-21 14:03:36,330 INFO [train.py:996] (3/4) Epoch 5, batch 8750, loss[loss=0.2223, simple_loss=0.2875, pruned_loss=0.07852, over 21631.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3074, pruned_loss=0.07856, over 4276151.70 frames. ], batch size: 263, lr: 6.38e-03, grad_scale: 16.0 2023-06-21 14:04:08,566 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.869e+02 2.564e+02 3.026e+02 3.702e+02 5.969e+02, threshold=6.051e+02, percent-clipped=2.0 2023-06-21 14:04:10,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=784434.0, ans=0.05 2023-06-21 14:04:21,838 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.42 vs. limit=22.5 2023-06-21 14:04:43,679 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=784494.0, ans=0.2 2023-06-21 14:05:26,914 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=784554.0, ans=0.0 2023-06-21 14:06:02,652 INFO [train.py:996] (3/4) Epoch 5, batch 8800, loss[loss=0.278, simple_loss=0.3579, pruned_loss=0.09906, over 21765.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.316, pruned_loss=0.08256, over 4283098.86 frames. ], batch size: 332, lr: 6.38e-03, grad_scale: 32.0 2023-06-21 14:06:28,485 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=784674.0, ans=0.0 2023-06-21 14:07:01,498 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=784794.0, ans=0.125 2023-06-21 14:07:41,947 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=784854.0, ans=0.0 2023-06-21 14:08:15,089 INFO [train.py:996] (3/4) Epoch 5, batch 8850, loss[loss=0.2331, simple_loss=0.3138, pruned_loss=0.07623, over 21663.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.324, pruned_loss=0.08441, over 4279689.44 frames. ], batch size: 332, lr: 6.38e-03, grad_scale: 32.0 2023-06-21 14:08:30,071 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=784974.0, ans=0.0 2023-06-21 14:08:55,147 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=785034.0, ans=0.0 2023-06-21 14:08:56,127 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.156e+02 2.643e+02 2.984e+02 3.583e+02 6.158e+02, threshold=5.968e+02, percent-clipped=1.0 2023-06-21 14:09:12,784 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=785034.0, ans=0.0 2023-06-21 14:09:12,838 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=785034.0, ans=0.0 2023-06-21 14:09:21,879 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=785094.0, ans=0.1 2023-06-21 14:09:50,502 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=785154.0, ans=0.125 2023-06-21 14:10:42,214 INFO [train.py:996] (3/4) Epoch 5, batch 8900, loss[loss=0.2111, simple_loss=0.2717, pruned_loss=0.07524, over 21377.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3194, pruned_loss=0.08319, over 4279153.17 frames. ], batch size: 194, lr: 6.38e-03, grad_scale: 32.0 2023-06-21 14:11:43,568 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=785394.0, ans=0.0 2023-06-21 14:12:45,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=785514.0, ans=0.125 2023-06-21 14:13:06,495 INFO [train.py:996] (3/4) Epoch 5, batch 8950, loss[loss=0.2528, simple_loss=0.3598, pruned_loss=0.07291, over 20838.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3197, pruned_loss=0.0823, over 4272918.83 frames. ], batch size: 608, lr: 6.38e-03, grad_scale: 32.0 2023-06-21 14:13:52,405 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.110e+02 2.821e+02 3.330e+02 4.267e+02 7.491e+02, threshold=6.660e+02, percent-clipped=6.0 2023-06-21 14:14:52,207 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 14:15:10,355 INFO [train.py:996] (3/4) Epoch 5, batch 9000, loss[loss=0.1991, simple_loss=0.2597, pruned_loss=0.06928, over 21356.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3128, pruned_loss=0.0816, over 4278123.52 frames. ], batch size: 131, lr: 6.38e-03, grad_scale: 32.0 2023-06-21 14:15:10,355 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-21 14:16:12,921 INFO [train.py:1028] (3/4) Epoch 5, validation: loss=0.2688, simple_loss=0.3596, pruned_loss=0.08904, over 1796401.00 frames. 2023-06-21 14:16:12,924 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-21 14:17:19,709 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=786054.0, ans=0.125 2023-06-21 14:17:20,317 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.20 vs. limit=15.0 2023-06-21 14:18:15,690 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.39 vs. limit=15.0 2023-06-21 14:18:18,020 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=786174.0, ans=0.125 2023-06-21 14:18:19,062 INFO [train.py:996] (3/4) Epoch 5, batch 9050, loss[loss=0.2937, simple_loss=0.3721, pruned_loss=0.1076, over 21434.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3076, pruned_loss=0.07764, over 4276384.77 frames. ], batch size: 131, lr: 6.37e-03, grad_scale: 32.0 2023-06-21 14:18:52,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=786234.0, ans=0.09899494936611666 2023-06-21 14:19:10,724 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.897e+02 2.633e+02 2.951e+02 3.520e+02 5.679e+02, threshold=5.902e+02, percent-clipped=0.0 2023-06-21 14:20:10,790 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=786414.0, ans=0.125 2023-06-21 14:20:50,357 INFO [train.py:996] (3/4) Epoch 5, batch 9100, loss[loss=0.2371, simple_loss=0.3192, pruned_loss=0.07745, over 21194.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3143, pruned_loss=0.08025, over 4273687.62 frames. ], batch size: 143, lr: 6.37e-03, grad_scale: 32.0 2023-06-21 14:20:55,651 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=786474.0, ans=0.0 2023-06-21 14:21:04,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=786534.0, ans=0.125 2023-06-21 14:21:30,219 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=786534.0, ans=0.125 2023-06-21 14:22:14,771 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=786654.0, ans=0.125 2023-06-21 14:23:17,967 INFO [train.py:996] (3/4) Epoch 5, batch 9150, loss[loss=0.2302, simple_loss=0.313, pruned_loss=0.07372, over 21722.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3149, pruned_loss=0.07789, over 4276908.72 frames. ], batch size: 247, lr: 6.37e-03, grad_scale: 32.0 2023-06-21 14:23:53,854 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.847e+02 2.515e+02 3.021e+02 3.547e+02 6.572e+02, threshold=6.043e+02, percent-clipped=1.0 2023-06-21 14:24:41,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=786894.0, ans=0.2 2023-06-21 14:24:46,782 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.37 vs. limit=15.0 2023-06-21 14:25:26,997 INFO [train.py:996] (3/4) Epoch 5, batch 9200, loss[loss=0.2562, simple_loss=0.3349, pruned_loss=0.08873, over 21695.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3163, pruned_loss=0.07698, over 4278271.58 frames. ], batch size: 298, lr: 6.37e-03, grad_scale: 32.0 2023-06-21 14:26:27,266 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=787194.0, ans=0.1 2023-06-21 14:27:41,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=787314.0, ans=0.125 2023-06-21 14:27:45,227 INFO [train.py:996] (3/4) Epoch 5, batch 9250, loss[loss=0.2349, simple_loss=0.3113, pruned_loss=0.0792, over 16079.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3199, pruned_loss=0.08097, over 4271152.80 frames. ], batch size: 60, lr: 6.37e-03, grad_scale: 32.0 2023-06-21 14:27:59,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=787374.0, ans=0.025 2023-06-21 14:28:28,354 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.689e+02 2.551e+02 3.044e+02 3.631e+02 5.502e+02, threshold=6.089e+02, percent-clipped=0.0 2023-06-21 14:28:38,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=787494.0, ans=15.0 2023-06-21 14:29:59,372 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=787674.0, ans=0.125 2023-06-21 14:30:00,551 INFO [train.py:996] (3/4) Epoch 5, batch 9300, loss[loss=0.2056, simple_loss=0.2766, pruned_loss=0.06727, over 21773.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3143, pruned_loss=0.08033, over 4267764.67 frames. ], batch size: 112, lr: 6.37e-03, grad_scale: 16.0 2023-06-21 14:30:22,720 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=787734.0, ans=0.125 2023-06-21 14:30:56,554 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=787794.0, ans=0.125 2023-06-21 14:31:17,073 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=13.91 vs. limit=15.0 2023-06-21 14:31:49,254 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=787854.0, ans=0.125 2023-06-21 14:32:05,792 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=787914.0, ans=0.0 2023-06-21 14:32:27,965 INFO [train.py:996] (3/4) Epoch 5, batch 9350, loss[loss=0.2723, simple_loss=0.3469, pruned_loss=0.09881, over 21285.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3222, pruned_loss=0.08191, over 4266900.28 frames. ], batch size: 143, lr: 6.37e-03, grad_scale: 16.0 2023-06-21 14:33:20,697 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.137e+02 2.848e+02 3.302e+02 4.167e+02 7.769e+02, threshold=6.603e+02, percent-clipped=5.0 2023-06-21 14:34:31,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=788214.0, ans=0.125 2023-06-21 14:34:49,348 INFO [train.py:996] (3/4) Epoch 5, batch 9400, loss[loss=0.2073, simple_loss=0.2692, pruned_loss=0.0727, over 21560.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3215, pruned_loss=0.08305, over 4270412.28 frames. ], batch size: 263, lr: 6.37e-03, grad_scale: 16.0 2023-06-21 14:35:07,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=788274.0, ans=0.125 2023-06-21 14:35:38,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=788334.0, ans=0.1 2023-06-21 14:35:40,448 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=788394.0, ans=0.0 2023-06-21 14:36:42,904 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.67 vs. limit=15.0 2023-06-21 14:37:20,230 INFO [train.py:996] (3/4) Epoch 5, batch 9450, loss[loss=0.1864, simple_loss=0.2427, pruned_loss=0.06506, over 21253.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3134, pruned_loss=0.08113, over 4260164.57 frames. ], batch size: 549, lr: 6.36e-03, grad_scale: 16.0 2023-06-21 14:37:20,741 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=788574.0, ans=0.07 2023-06-21 14:37:27,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=788574.0, ans=0.125 2023-06-21 14:37:43,285 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.12 vs. limit=22.5 2023-06-21 14:37:45,132 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.057e+02 2.567e+02 2.945e+02 3.778e+02 6.288e+02, threshold=5.890e+02, percent-clipped=0.0 2023-06-21 14:39:23,910 INFO [train.py:996] (3/4) Epoch 5, batch 9500, loss[loss=0.26, simple_loss=0.3283, pruned_loss=0.09579, over 21464.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3061, pruned_loss=0.07899, over 4263472.06 frames. ], batch size: 471, lr: 6.36e-03, grad_scale: 16.0 2023-06-21 14:41:45,119 INFO [train.py:996] (3/4) Epoch 5, batch 9550, loss[loss=0.2576, simple_loss=0.3441, pruned_loss=0.08553, over 21741.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3088, pruned_loss=0.08109, over 4262299.14 frames. ], batch size: 298, lr: 6.36e-03, grad_scale: 16.0 2023-06-21 14:42:07,953 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.75 vs. limit=15.0 2023-06-21 14:42:09,711 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.128e+02 2.687e+02 3.290e+02 3.942e+02 9.010e+02, threshold=6.580e+02, percent-clipped=4.0 2023-06-21 14:42:42,937 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=789294.0, ans=0.0 2023-06-21 14:44:03,882 INFO [train.py:996] (3/4) Epoch 5, batch 9600, loss[loss=0.2434, simple_loss=0.314, pruned_loss=0.0864, over 21867.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3117, pruned_loss=0.08305, over 4274304.99 frames. ], batch size: 391, lr: 6.36e-03, grad_scale: 32.0 2023-06-21 14:44:06,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=789474.0, ans=0.1 2023-06-21 14:45:03,296 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.71 vs. limit=12.0 2023-06-21 14:45:35,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=789654.0, ans=0.0 2023-06-21 14:46:22,015 INFO [train.py:996] (3/4) Epoch 5, batch 9650, loss[loss=0.2796, simple_loss=0.3452, pruned_loss=0.107, over 21769.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3105, pruned_loss=0.08252, over 4272679.47 frames. ], batch size: 441, lr: 6.36e-03, grad_scale: 32.0 2023-06-21 14:46:46,070 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.41 vs. limit=15.0 2023-06-21 14:47:07,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=789834.0, ans=0.125 2023-06-21 14:47:15,179 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 2.578e+02 2.899e+02 3.353e+02 5.343e+02, threshold=5.797e+02, percent-clipped=0.0 2023-06-21 14:47:53,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=789894.0, ans=0.0 2023-06-21 14:48:07,745 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.98 vs. limit=15.0 2023-06-21 14:48:28,357 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=790014.0, ans=0.125 2023-06-21 14:48:42,484 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.50 vs. limit=22.5 2023-06-21 14:48:44,246 INFO [train.py:996] (3/4) Epoch 5, batch 9700, loss[loss=0.2033, simple_loss=0.2793, pruned_loss=0.06369, over 21425.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3133, pruned_loss=0.08246, over 4273811.24 frames. ], batch size: 211, lr: 6.36e-03, grad_scale: 32.0 2023-06-21 14:49:54,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=790194.0, ans=0.0 2023-06-21 14:49:54,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=790194.0, ans=0.2 2023-06-21 14:50:37,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=790314.0, ans=0.2 2023-06-21 14:50:41,659 INFO [train.py:996] (3/4) Epoch 5, batch 9750, loss[loss=0.2523, simple_loss=0.3189, pruned_loss=0.0928, over 15109.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3074, pruned_loss=0.08128, over 4272071.58 frames. ], batch size: 60, lr: 6.36e-03, grad_scale: 32.0 2023-06-21 14:50:52,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=790374.0, ans=0.0 2023-06-21 14:51:23,563 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 14:51:25,938 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.076e+02 2.430e+02 2.788e+02 3.260e+02 5.197e+02, threshold=5.575e+02, percent-clipped=0.0 2023-06-21 14:51:38,524 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.06 vs. limit=12.0 2023-06-21 14:52:39,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=790614.0, ans=0.5 2023-06-21 14:52:45,329 INFO [train.py:996] (3/4) Epoch 5, batch 9800, loss[loss=0.2405, simple_loss=0.3004, pruned_loss=0.09033, over 21292.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3081, pruned_loss=0.08148, over 4279287.16 frames. ], batch size: 143, lr: 6.36e-03, grad_scale: 32.0 2023-06-21 14:53:14,588 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=12.0 2023-06-21 14:53:20,529 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=15.0 2023-06-21 14:53:50,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=790794.0, ans=0.2 2023-06-21 14:54:51,757 INFO [train.py:996] (3/4) Epoch 5, batch 9850, loss[loss=0.2062, simple_loss=0.2609, pruned_loss=0.07577, over 21261.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3062, pruned_loss=0.08174, over 4260946.80 frames. ], batch size: 160, lr: 6.35e-03, grad_scale: 32.0 2023-06-21 14:55:26,740 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 14:55:42,146 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.409e+02 2.659e+02 3.112e+02 4.458e+02, threshold=5.319e+02, percent-clipped=0.0 2023-06-21 14:56:00,234 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=791094.0, ans=0.0 2023-06-21 14:56:48,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=791214.0, ans=0.09899494936611666 2023-06-21 14:57:02,496 INFO [train.py:996] (3/4) Epoch 5, batch 9900, loss[loss=0.2807, simple_loss=0.3515, pruned_loss=0.105, over 21798.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3026, pruned_loss=0.08164, over 4265879.82 frames. ], batch size: 118, lr: 6.35e-03, grad_scale: 32.0 2023-06-21 14:57:31,153 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.89 vs. limit=10.0 2023-06-21 14:57:58,002 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.41 vs. limit=15.0 2023-06-21 14:58:30,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=791454.0, ans=0.2 2023-06-21 14:58:48,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=791514.0, ans=0.1 2023-06-21 14:59:03,121 INFO [train.py:996] (3/4) Epoch 5, batch 9950, loss[loss=0.2264, simple_loss=0.2906, pruned_loss=0.08116, over 21641.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3049, pruned_loss=0.0833, over 4269154.62 frames. ], batch size: 332, lr: 6.35e-03, grad_scale: 32.0 2023-06-21 14:59:31,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=791574.0, ans=0.1 2023-06-21 15:00:12,916 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.060e+02 2.669e+02 3.087e+02 3.517e+02 5.049e+02, threshold=6.174e+02, percent-clipped=0.0 2023-06-21 15:00:52,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=791754.0, ans=0.0 2023-06-21 15:01:19,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=791814.0, ans=0.125 2023-06-21 15:01:24,092 INFO [train.py:996] (3/4) Epoch 5, batch 10000, loss[loss=0.2274, simple_loss=0.3049, pruned_loss=0.07494, over 21559.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3005, pruned_loss=0.08206, over 4262554.87 frames. ], batch size: 414, lr: 6.35e-03, grad_scale: 32.0 2023-06-21 15:01:24,534 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=791874.0, ans=0.125 2023-06-21 15:01:57,392 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=791874.0, ans=0.0 2023-06-21 15:02:51,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=791994.0, ans=0.125 2023-06-21 15:03:33,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=792174.0, ans=0.0 2023-06-21 15:03:34,545 INFO [train.py:996] (3/4) Epoch 5, batch 10050, loss[loss=0.233, simple_loss=0.2981, pruned_loss=0.08394, over 21519.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3026, pruned_loss=0.08244, over 4262241.80 frames. ], batch size: 441, lr: 6.35e-03, grad_scale: 32.0 2023-06-21 15:04:41,625 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.869e+02 2.500e+02 2.849e+02 3.392e+02 5.365e+02, threshold=5.698e+02, percent-clipped=0.0 2023-06-21 15:04:44,089 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.45 vs. limit=15.0 2023-06-21 15:05:32,536 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=15.0 2023-06-21 15:06:11,542 INFO [train.py:996] (3/4) Epoch 5, batch 10100, loss[loss=0.181, simple_loss=0.2644, pruned_loss=0.04883, over 20786.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.2972, pruned_loss=0.07993, over 4254364.85 frames. ], batch size: 608, lr: 6.35e-03, grad_scale: 32.0 2023-06-21 15:08:30,818 INFO [train.py:996] (3/4) Epoch 5, batch 10150, loss[loss=0.2182, simple_loss=0.2771, pruned_loss=0.07969, over 21830.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3037, pruned_loss=0.08253, over 4264073.68 frames. ], batch size: 98, lr: 6.35e-03, grad_scale: 32.0 2023-06-21 15:08:31,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=792774.0, ans=0.2 2023-06-21 15:08:31,444 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=792774.0, ans=0.1 2023-06-21 15:09:16,250 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.922e+02 2.586e+02 2.981e+02 3.713e+02 5.514e+02, threshold=5.962e+02, percent-clipped=0.0 2023-06-21 15:10:35,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=793014.0, ans=0.1 2023-06-21 15:10:42,462 INFO [train.py:996] (3/4) Epoch 5, batch 10200, loss[loss=0.2401, simple_loss=0.3194, pruned_loss=0.08038, over 21550.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3035, pruned_loss=0.08029, over 4257174.74 frames. ], batch size: 389, lr: 6.35e-03, grad_scale: 32.0 2023-06-21 15:11:14,425 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 15:11:26,051 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=793134.0, ans=0.04949747468305833 2023-06-21 15:12:58,744 INFO [train.py:996] (3/4) Epoch 5, batch 10250, loss[loss=0.1993, simple_loss=0.2843, pruned_loss=0.05714, over 21794.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2976, pruned_loss=0.07396, over 4261787.50 frames. ], batch size: 282, lr: 6.35e-03, grad_scale: 32.0 2023-06-21 15:13:36,860 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.55 vs. limit=15.0 2023-06-21 15:13:43,063 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 2.089e+02 2.610e+02 3.127e+02 4.884e+02, threshold=5.220e+02, percent-clipped=0.0 2023-06-21 15:13:43,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=793434.0, ans=0.125 2023-06-21 15:13:44,152 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.16 vs. limit=12.0 2023-06-21 15:15:13,521 INFO [train.py:996] (3/4) Epoch 5, batch 10300, loss[loss=0.238, simple_loss=0.3156, pruned_loss=0.08025, over 21491.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3012, pruned_loss=0.07492, over 4263904.25 frames. ], batch size: 194, lr: 6.34e-03, grad_scale: 32.0 2023-06-21 15:15:23,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=793674.0, ans=0.1 2023-06-21 15:16:32,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=793854.0, ans=0.04949747468305833 2023-06-21 15:16:56,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=793854.0, ans=0.125 2023-06-21 15:17:23,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=793914.0, ans=0.125 2023-06-21 15:17:24,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=793914.0, ans=0.1 2023-06-21 15:17:26,978 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.10 vs. limit=15.0 2023-06-21 15:17:40,388 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=793974.0, ans=0.0 2023-06-21 15:17:41,405 INFO [train.py:996] (3/4) Epoch 5, batch 10350, loss[loss=0.1707, simple_loss=0.2246, pruned_loss=0.05844, over 21188.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3032, pruned_loss=0.07566, over 4261894.01 frames. ], batch size: 143, lr: 6.34e-03, grad_scale: 32.0 2023-06-21 15:18:08,719 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.795e+02 2.871e+02 3.499e+02 4.355e+02 9.193e+02, threshold=6.998e+02, percent-clipped=17.0 2023-06-21 15:18:11,357 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.25 vs. limit=10.0 2023-06-21 15:19:09,156 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.73 vs. limit=6.0 2023-06-21 15:19:49,805 INFO [train.py:996] (3/4) Epoch 5, batch 10400, loss[loss=0.2384, simple_loss=0.3143, pruned_loss=0.08127, over 21693.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2963, pruned_loss=0.07433, over 4265501.04 frames. ], batch size: 415, lr: 6.34e-03, grad_scale: 32.0 2023-06-21 15:20:51,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=794394.0, ans=0.0 2023-06-21 15:22:05,754 INFO [train.py:996] (3/4) Epoch 5, batch 10450, loss[loss=0.2792, simple_loss=0.34, pruned_loss=0.1092, over 21404.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3008, pruned_loss=0.07719, over 4269857.90 frames. ], batch size: 549, lr: 6.34e-03, grad_scale: 32.0 2023-06-21 15:22:28,329 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=794574.0, ans=0.1 2023-06-21 15:22:45,502 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=794634.0, ans=0.1 2023-06-21 15:23:10,549 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=794634.0, ans=0.0 2023-06-21 15:23:13,422 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.562e+02 2.841e+02 3.622e+02 6.027e+02, threshold=5.681e+02, percent-clipped=0.0 2023-06-21 15:23:53,496 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=794754.0, ans=0.125 2023-06-21 15:24:05,105 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=794814.0, ans=0.125 2023-06-21 15:24:20,791 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=794814.0, ans=0.125 2023-06-21 15:24:25,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=794814.0, ans=0.125 2023-06-21 15:24:27,804 INFO [train.py:996] (3/4) Epoch 5, batch 10500, loss[loss=0.2303, simple_loss=0.2925, pruned_loss=0.08409, over 21471.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3018, pruned_loss=0.07685, over 4262485.73 frames. ], batch size: 441, lr: 6.34e-03, grad_scale: 32.0 2023-06-21 15:25:57,931 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=795054.0, ans=0.0 2023-06-21 15:26:05,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=795054.0, ans=0.2 2023-06-21 15:26:36,375 INFO [train.py:996] (3/4) Epoch 5, batch 10550, loss[loss=0.2504, simple_loss=0.3186, pruned_loss=0.09107, over 20736.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2974, pruned_loss=0.07621, over 4251560.64 frames. ], batch size: 607, lr: 6.34e-03, grad_scale: 32.0 2023-06-21 15:27:31,062 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.952e+02 2.401e+02 2.781e+02 3.246e+02 4.477e+02, threshold=5.561e+02, percent-clipped=0.0 2023-06-21 15:28:39,019 INFO [train.py:996] (3/4) Epoch 5, batch 10600, loss[loss=0.1981, simple_loss=0.2784, pruned_loss=0.05891, over 21670.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2928, pruned_loss=0.07488, over 4242148.77 frames. ], batch size: 332, lr: 6.34e-03, grad_scale: 32.0 2023-06-21 15:29:08,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=795534.0, ans=0.125 2023-06-21 15:29:18,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=795534.0, ans=0.2 2023-06-21 15:29:57,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=795594.0, ans=0.1 2023-06-21 15:30:00,780 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.20 vs. limit=15.0 2023-06-21 15:30:01,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=795594.0, ans=0.125 2023-06-21 15:30:40,173 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=795654.0, ans=0.1 2023-06-21 15:31:11,936 INFO [train.py:996] (3/4) Epoch 5, batch 10650, loss[loss=0.162, simple_loss=0.2393, pruned_loss=0.04231, over 21406.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2963, pruned_loss=0.07425, over 4249921.27 frames. ], batch size: 211, lr: 6.34e-03, grad_scale: 32.0 2023-06-21 15:31:13,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=795774.0, ans=0.125 2023-06-21 15:31:34,389 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=795774.0, ans=0.1 2023-06-21 15:31:48,298 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=15.0 2023-06-21 15:31:49,279 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=795834.0, ans=0.1 2023-06-21 15:31:50,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=795834.0, ans=0.125 2023-06-21 15:32:03,585 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-06-21 15:32:11,559 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.758e+02 2.303e+02 2.833e+02 3.261e+02 4.754e+02, threshold=5.666e+02, percent-clipped=0.0 2023-06-21 15:32:24,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=795894.0, ans=0.0 2023-06-21 15:32:25,558 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.95 vs. limit=15.0 2023-06-21 15:32:34,671 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.93 vs. limit=15.0 2023-06-21 15:33:24,933 INFO [train.py:996] (3/4) Epoch 5, batch 10700, loss[loss=0.2666, simple_loss=0.3376, pruned_loss=0.09783, over 21514.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2945, pruned_loss=0.0743, over 4253463.68 frames. ], batch size: 131, lr: 6.33e-03, grad_scale: 32.0 2023-06-21 15:34:25,927 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.78 vs. limit=10.0 2023-06-21 15:35:07,011 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 15:35:54,034 INFO [train.py:996] (3/4) Epoch 5, batch 10750, loss[loss=0.2382, simple_loss=0.3269, pruned_loss=0.07471, over 21602.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3047, pruned_loss=0.07873, over 4261458.31 frames. ], batch size: 263, lr: 6.33e-03, grad_scale: 32.0 2023-06-21 15:36:04,883 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=796374.0, ans=0.0 2023-06-21 15:36:39,166 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.995e+02 2.624e+02 2.949e+02 3.817e+02 5.681e+02, threshold=5.899e+02, percent-clipped=1.0 2023-06-21 15:37:38,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=796554.0, ans=0.2 2023-06-21 15:38:00,054 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=796614.0, ans=0.05 2023-06-21 15:38:28,030 INFO [train.py:996] (3/4) Epoch 5, batch 10800, loss[loss=0.2686, simple_loss=0.3364, pruned_loss=0.1003, over 21570.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3101, pruned_loss=0.07953, over 4260640.15 frames. ], batch size: 389, lr: 6.33e-03, grad_scale: 32.0 2023-06-21 15:40:04,931 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 15:40:07,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=796854.0, ans=0.0 2023-06-21 15:40:12,576 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.72 vs. limit=12.0 2023-06-21 15:40:35,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=796914.0, ans=0.2 2023-06-21 15:40:43,271 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 15:40:44,831 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=796914.0, ans=0.125 2023-06-21 15:40:48,675 INFO [train.py:996] (3/4) Epoch 5, batch 10850, loss[loss=0.2145, simple_loss=0.2808, pruned_loss=0.07408, over 21813.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3114, pruned_loss=0.07998, over 4264436.02 frames. ], batch size: 98, lr: 6.33e-03, grad_scale: 32.0 2023-06-21 15:40:50,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=796974.0, ans=0.125 2023-06-21 15:41:05,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=796974.0, ans=0.125 2023-06-21 15:41:10,839 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=797034.0, ans=0.0 2023-06-21 15:41:17,084 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=797034.0, ans=0.125 2023-06-21 15:41:20,540 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=15.0 2023-06-21 15:41:27,121 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.048e+02 2.616e+02 2.809e+02 3.256e+02 4.598e+02, threshold=5.618e+02, percent-clipped=0.0 2023-06-21 15:42:32,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=797154.0, ans=0.5 2023-06-21 15:42:39,128 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.02 vs. limit=15.0 2023-06-21 15:42:50,797 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.66 vs. limit=15.0 2023-06-21 15:42:56,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_na.min_abs, batch_count=797214.0, ans=0.02 2023-06-21 15:43:00,599 INFO [train.py:996] (3/4) Epoch 5, batch 10900, loss[loss=0.2133, simple_loss=0.3142, pruned_loss=0.05619, over 21778.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3051, pruned_loss=0.07829, over 4270578.63 frames. ], batch size: 351, lr: 6.33e-03, grad_scale: 32.0 2023-06-21 15:43:22,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=797334.0, ans=0.2 2023-06-21 15:44:02,596 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=797394.0, ans=0.125 2023-06-21 15:44:30,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=797454.0, ans=0.1 2023-06-21 15:44:34,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=797454.0, ans=0.125 2023-06-21 15:45:02,683 INFO [train.py:996] (3/4) Epoch 5, batch 10950, loss[loss=0.2077, simple_loss=0.2711, pruned_loss=0.07211, over 21626.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3002, pruned_loss=0.07602, over 4266184.12 frames. ], batch size: 298, lr: 6.33e-03, grad_scale: 32.0 2023-06-21 15:45:19,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=797574.0, ans=0.2 2023-06-21 15:45:28,005 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=797634.0, ans=0.025 2023-06-21 15:45:30,712 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 15:45:44,750 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.884e+02 2.495e+02 2.960e+02 3.280e+02 5.814e+02, threshold=5.920e+02, percent-clipped=2.0 2023-06-21 15:46:45,735 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.72 vs. limit=10.0 2023-06-21 15:47:00,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=797814.0, ans=0.0 2023-06-21 15:47:13,638 INFO [train.py:996] (3/4) Epoch 5, batch 11000, loss[loss=0.2023, simple_loss=0.279, pruned_loss=0.06283, over 21709.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.2984, pruned_loss=0.07647, over 4277487.93 frames. ], batch size: 263, lr: 6.33e-03, grad_scale: 32.0 2023-06-21 15:47:14,074 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=797874.0, ans=0.0 2023-06-21 15:47:59,265 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=797934.0, ans=0.2 2023-06-21 15:48:01,332 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.22 vs. limit=10.0 2023-06-21 15:48:14,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=797994.0, ans=0.0 2023-06-21 15:48:51,928 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=798054.0, ans=0.1 2023-06-21 15:49:29,150 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=798114.0, ans=0.125 2023-06-21 15:49:31,656 INFO [train.py:996] (3/4) Epoch 5, batch 11050, loss[loss=0.2209, simple_loss=0.2797, pruned_loss=0.081, over 21841.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2949, pruned_loss=0.07665, over 4270159.65 frames. ], batch size: 98, lr: 6.33e-03, grad_scale: 32.0 2023-06-21 15:49:45,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=798174.0, ans=0.125 2023-06-21 15:49:46,078 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.76 vs. limit=12.0 2023-06-21 15:50:07,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=798234.0, ans=0.2 2023-06-21 15:50:25,827 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 2.567e+02 2.755e+02 3.188e+02 5.366e+02, threshold=5.510e+02, percent-clipped=0.0 2023-06-21 15:51:43,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=798474.0, ans=0.125 2023-06-21 15:51:44,990 INFO [train.py:996] (3/4) Epoch 5, batch 11100, loss[loss=0.2114, simple_loss=0.2782, pruned_loss=0.07235, over 21817.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.2943, pruned_loss=0.07705, over 4262837.77 frames. ], batch size: 98, lr: 6.33e-03, grad_scale: 32.0 2023-06-21 15:52:39,574 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=798594.0, ans=0.125 2023-06-21 15:52:49,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=798594.0, ans=0.125 2023-06-21 15:53:18,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=798654.0, ans=0.0 2023-06-21 15:53:18,653 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=798654.0, ans=0.05 2023-06-21 15:53:50,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=798714.0, ans=0.1 2023-06-21 15:53:53,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=798714.0, ans=0.0 2023-06-21 15:53:59,001 INFO [train.py:996] (3/4) Epoch 5, batch 11150, loss[loss=0.2273, simple_loss=0.311, pruned_loss=0.07177, over 21786.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.2927, pruned_loss=0.07742, over 4261450.61 frames. ], batch size: 371, lr: 6.32e-03, grad_scale: 32.0 2023-06-21 15:54:41,042 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=798834.0, ans=0.1 2023-06-21 15:54:41,044 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=798834.0, ans=0.125 2023-06-21 15:54:52,811 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.895e+02 2.453e+02 2.788e+02 3.224e+02 5.463e+02, threshold=5.576e+02, percent-clipped=0.0 2023-06-21 15:54:58,379 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.80 vs. limit=10.0 2023-06-21 15:55:34,888 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=798954.0, ans=0.125 2023-06-21 15:55:47,609 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=798954.0, ans=0.125 2023-06-21 15:56:03,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=799014.0, ans=0.0 2023-06-21 15:56:10,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=799014.0, ans=0.2 2023-06-21 15:56:13,273 INFO [train.py:996] (3/4) Epoch 5, batch 11200, loss[loss=0.2207, simple_loss=0.2894, pruned_loss=0.07599, over 21800.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2913, pruned_loss=0.07667, over 4263587.62 frames. ], batch size: 352, lr: 6.32e-03, grad_scale: 32.0 2023-06-21 15:56:46,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=799134.0, ans=0.125 2023-06-21 15:57:04,747 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.47 vs. limit=22.5 2023-06-21 15:57:25,615 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=799194.0, ans=0.0 2023-06-21 15:58:00,260 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=799254.0, ans=0.2 2023-06-21 15:58:16,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=799314.0, ans=0.07 2023-06-21 15:58:25,662 INFO [train.py:996] (3/4) Epoch 5, batch 11250, loss[loss=0.2441, simple_loss=0.313, pruned_loss=0.08759, over 21438.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2921, pruned_loss=0.07642, over 4267920.24 frames. ], batch size: 473, lr: 6.32e-03, grad_scale: 32.0 2023-06-21 15:58:36,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=799374.0, ans=0.1 2023-06-21 15:59:15,275 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.984e+02 2.425e+02 2.726e+02 3.130e+02 5.032e+02, threshold=5.452e+02, percent-clipped=0.0 2023-06-21 15:59:18,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=799494.0, ans=0.2 2023-06-21 15:59:56,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=799554.0, ans=0.0 2023-06-21 16:00:26,728 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=799614.0, ans=0.0 2023-06-21 16:00:34,386 INFO [train.py:996] (3/4) Epoch 5, batch 11300, loss[loss=0.2097, simple_loss=0.289, pruned_loss=0.06519, over 21830.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.295, pruned_loss=0.07783, over 4274079.05 frames. ], batch size: 414, lr: 6.32e-03, grad_scale: 32.0 2023-06-21 16:01:04,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=799674.0, ans=0.0 2023-06-21 16:01:41,775 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.96 vs. limit=15.0 2023-06-21 16:02:50,115 INFO [train.py:996] (3/4) Epoch 5, batch 11350, loss[loss=0.214, simple_loss=0.2845, pruned_loss=0.07174, over 21848.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.297, pruned_loss=0.07796, over 4271104.72 frames. ], batch size: 107, lr: 6.32e-03, grad_scale: 16.0 2023-06-21 16:03:44,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=800034.0, ans=0.125 2023-06-21 16:03:49,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=800034.0, ans=0.125 2023-06-21 16:03:52,039 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.858e+02 2.490e+02 2.826e+02 3.230e+02 4.921e+02, threshold=5.651e+02, percent-clipped=0.0 2023-06-21 16:04:46,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=800154.0, ans=0.1 2023-06-21 16:04:46,274 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=800154.0, ans=0.125 2023-06-21 16:05:04,660 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=800214.0, ans=0.2 2023-06-21 16:05:18,314 INFO [train.py:996] (3/4) Epoch 5, batch 11400, loss[loss=0.2682, simple_loss=0.3368, pruned_loss=0.09981, over 21353.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3019, pruned_loss=0.08026, over 4274717.19 frames. ], batch size: 131, lr: 6.32e-03, grad_scale: 16.0 2023-06-21 16:06:35,562 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=800394.0, ans=0.0 2023-06-21 16:07:37,422 INFO [train.py:996] (3/4) Epoch 5, batch 11450, loss[loss=0.2491, simple_loss=0.3241, pruned_loss=0.08707, over 21862.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3023, pruned_loss=0.07901, over 4273531.43 frames. ], batch size: 371, lr: 6.32e-03, grad_scale: 16.0 2023-06-21 16:08:44,906 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.947e+02 2.457e+02 2.800e+02 3.196e+02 5.475e+02, threshold=5.600e+02, percent-clipped=0.0 2023-06-21 16:08:55,568 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=800694.0, ans=0.2 2023-06-21 16:08:58,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=800694.0, ans=0.0 2023-06-21 16:09:50,527 INFO [train.py:996] (3/4) Epoch 5, batch 11500, loss[loss=0.2061, simple_loss=0.2979, pruned_loss=0.05709, over 21608.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3067, pruned_loss=0.08055, over 4277989.03 frames. ], batch size: 263, lr: 6.32e-03, grad_scale: 16.0 2023-06-21 16:10:23,452 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=800874.0, ans=0.1 2023-06-21 16:11:57,053 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 16:12:24,366 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.98 vs. limit=22.5 2023-06-21 16:12:26,606 INFO [train.py:996] (3/4) Epoch 5, batch 11550, loss[loss=0.2995, simple_loss=0.3866, pruned_loss=0.1062, over 21644.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3118, pruned_loss=0.08019, over 4273252.51 frames. ], batch size: 441, lr: 6.31e-03, grad_scale: 16.0 2023-06-21 16:13:41,597 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.790e+02 2.641e+02 3.057e+02 3.432e+02 5.620e+02, threshold=6.114e+02, percent-clipped=1.0 2023-06-21 16:13:51,254 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=801294.0, ans=0.125 2023-06-21 16:14:50,500 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.59 vs. limit=15.0 2023-06-21 16:15:01,401 INFO [train.py:996] (3/4) Epoch 5, batch 11600, loss[loss=0.2535, simple_loss=0.3407, pruned_loss=0.0831, over 21360.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3278, pruned_loss=0.08261, over 4266558.15 frames. ], batch size: 194, lr: 6.31e-03, grad_scale: 32.0 2023-06-21 16:16:04,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=801594.0, ans=0.125 2023-06-21 16:16:29,592 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 16:16:41,288 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.86 vs. limit=6.0 2023-06-21 16:17:18,047 INFO [train.py:996] (3/4) Epoch 5, batch 11650, loss[loss=0.2268, simple_loss=0.3015, pruned_loss=0.07607, over 21158.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.333, pruned_loss=0.08326, over 4258697.42 frames. ], batch size: 143, lr: 6.31e-03, grad_scale: 32.0 2023-06-21 16:17:43,012 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=801774.0, ans=0.0 2023-06-21 16:18:02,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=801774.0, ans=0.125 2023-06-21 16:18:09,406 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.22 vs. limit=15.0 2023-06-21 16:18:27,547 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.073e+02 2.640e+02 3.062e+02 3.776e+02 6.699e+02, threshold=6.124e+02, percent-clipped=2.0 2023-06-21 16:19:09,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=801954.0, ans=0.125 2023-06-21 16:19:48,623 INFO [train.py:996] (3/4) Epoch 5, batch 11700, loss[loss=0.2447, simple_loss=0.2882, pruned_loss=0.1006, over 21435.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3233, pruned_loss=0.08216, over 4259352.79 frames. ], batch size: 441, lr: 6.31e-03, grad_scale: 16.0 2023-06-21 16:20:08,715 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=802074.0, ans=0.2 2023-06-21 16:20:30,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=802134.0, ans=0.125 2023-06-21 16:21:49,437 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=802314.0, ans=0.125 2023-06-21 16:21:59,043 INFO [train.py:996] (3/4) Epoch 5, batch 11750, loss[loss=0.2067, simple_loss=0.2705, pruned_loss=0.0714, over 21881.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3147, pruned_loss=0.08175, over 4263527.71 frames. ], batch size: 98, lr: 6.31e-03, grad_scale: 16.0 2023-06-21 16:22:47,696 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.968e+02 2.493e+02 2.846e+02 3.170e+02 4.478e+02, threshold=5.693e+02, percent-clipped=0.0 2023-06-21 16:23:38,248 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=802554.0, ans=0.125 2023-06-21 16:24:20,705 INFO [train.py:996] (3/4) Epoch 5, batch 11800, loss[loss=0.2363, simple_loss=0.317, pruned_loss=0.0778, over 21591.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3163, pruned_loss=0.08352, over 4267662.45 frames. ], batch size: 230, lr: 6.31e-03, grad_scale: 16.0 2023-06-21 16:25:32,161 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.19 vs. limit=22.5 2023-06-21 16:26:30,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=802974.0, ans=0.0 2023-06-21 16:26:30,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=802974.0, ans=0.0 2023-06-21 16:26:37,616 INFO [train.py:996] (3/4) Epoch 5, batch 11850, loss[loss=0.2096, simple_loss=0.3101, pruned_loss=0.05449, over 21808.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.318, pruned_loss=0.08234, over 4275582.21 frames. ], batch size: 282, lr: 6.31e-03, grad_scale: 16.0 2023-06-21 16:26:38,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=802974.0, ans=0.07 2023-06-21 16:26:39,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=802974.0, ans=0.125 2023-06-21 16:27:03,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=803034.0, ans=0.1 2023-06-21 16:27:35,197 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.786e+02 2.385e+02 2.718e+02 3.148e+02 5.334e+02, threshold=5.436e+02, percent-clipped=0.0 2023-06-21 16:28:03,733 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.64 vs. limit=15.0 2023-06-21 16:29:11,174 INFO [train.py:996] (3/4) Epoch 5, batch 11900, loss[loss=0.2286, simple_loss=0.3199, pruned_loss=0.06862, over 21618.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3166, pruned_loss=0.07973, over 4273924.83 frames. ], batch size: 414, lr: 6.31e-03, grad_scale: 16.0 2023-06-21 16:29:35,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=803334.0, ans=0.2 2023-06-21 16:29:38,212 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=803334.0, ans=0.125 2023-06-21 16:29:50,365 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=803334.0, ans=0.0 2023-06-21 16:29:51,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=803334.0, ans=0.125 2023-06-21 16:30:50,291 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.24 vs. limit=12.0 2023-06-21 16:31:24,008 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.19 vs. limit=15.0 2023-06-21 16:31:26,065 INFO [train.py:996] (3/4) Epoch 5, batch 11950, loss[loss=0.2889, simple_loss=0.4298, pruned_loss=0.074, over 20753.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3184, pruned_loss=0.07652, over 4266550.38 frames. ], batch size: 607, lr: 6.31e-03, grad_scale: 16.0 2023-06-21 16:31:48,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=803574.0, ans=0.125 2023-06-21 16:31:56,619 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.02 vs. limit=6.0 2023-06-21 16:32:21,843 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.695e+02 2.301e+02 2.718e+02 3.108e+02 3.993e+02, threshold=5.436e+02, percent-clipped=0.0 2023-06-21 16:32:38,687 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.27 vs. limit=22.5 2023-06-21 16:32:41,148 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=803694.0, ans=0.125 2023-06-21 16:32:48,682 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-21 16:33:24,082 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=803814.0, ans=0.0 2023-06-21 16:33:39,362 INFO [train.py:996] (3/4) Epoch 5, batch 12000, loss[loss=0.2008, simple_loss=0.2644, pruned_loss=0.0686, over 21497.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3138, pruned_loss=0.07479, over 4265972.75 frames. ], batch size: 195, lr: 6.30e-03, grad_scale: 32.0 2023-06-21 16:33:39,362 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-21 16:34:34,985 INFO [train.py:1028] (3/4) Epoch 5, validation: loss=0.2672, simple_loss=0.3583, pruned_loss=0.08803, over 1796401.00 frames. 2023-06-21 16:34:34,987 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-21 16:34:39,980 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=803874.0, ans=0.09899494936611666 2023-06-21 16:34:46,008 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=803874.0, ans=0.125 2023-06-21 16:34:50,847 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.02 vs. limit=15.0 2023-06-21 16:35:36,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=804054.0, ans=0.125 2023-06-21 16:36:25,262 INFO [train.py:996] (3/4) Epoch 5, batch 12050, loss[loss=0.2882, simple_loss=0.3331, pruned_loss=0.1216, over 21793.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3103, pruned_loss=0.07776, over 4269696.11 frames. ], batch size: 508, lr: 6.30e-03, grad_scale: 32.0 2023-06-21 16:36:30,001 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=804174.0, ans=0.2 2023-06-21 16:36:38,187 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=804174.0, ans=0.125 2023-06-21 16:37:03,789 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=804234.0, ans=0.125 2023-06-21 16:37:26,745 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.847e+02 2.690e+02 3.066e+02 3.586e+02 5.948e+02, threshold=6.132e+02, percent-clipped=2.0 2023-06-21 16:37:45,211 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=804294.0, ans=0.0 2023-06-21 16:38:29,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=804414.0, ans=0.125 2023-06-21 16:38:41,152 INFO [train.py:996] (3/4) Epoch 5, batch 12100, loss[loss=0.2311, simple_loss=0.2913, pruned_loss=0.08543, over 20650.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3144, pruned_loss=0.08257, over 4269959.35 frames. ], batch size: 607, lr: 6.30e-03, grad_scale: 32.0 2023-06-21 16:40:20,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=804594.0, ans=0.015 2023-06-21 16:40:36,731 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.97 vs. limit=15.0 2023-06-21 16:40:39,442 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=804654.0, ans=0.2 2023-06-21 16:41:07,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=804714.0, ans=0.1 2023-06-21 16:41:30,846 INFO [train.py:996] (3/4) Epoch 5, batch 12150, loss[loss=0.2426, simple_loss=0.3565, pruned_loss=0.06437, over 19828.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3173, pruned_loss=0.08176, over 4266964.44 frames. ], batch size: 702, lr: 6.30e-03, grad_scale: 32.0 2023-06-21 16:41:55,219 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=804834.0, ans=0.1 2023-06-21 16:42:01,385 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=804834.0, ans=0.125 2023-06-21 16:42:14,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=804834.0, ans=0.0 2023-06-21 16:42:18,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=804894.0, ans=0.0 2023-06-21 16:42:19,168 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.923e+02 2.609e+02 3.178e+02 3.769e+02 6.443e+02, threshold=6.356e+02, percent-clipped=2.0 2023-06-21 16:43:41,192 INFO [train.py:996] (3/4) Epoch 5, batch 12200, loss[loss=0.1978, simple_loss=0.2651, pruned_loss=0.06524, over 21731.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3127, pruned_loss=0.08005, over 4264258.78 frames. ], batch size: 316, lr: 6.30e-03, grad_scale: 32.0 2023-06-21 16:44:21,683 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.85 vs. limit=15.0 2023-06-21 16:45:53,347 INFO [train.py:996] (3/4) Epoch 5, batch 12250, loss[loss=0.2164, simple_loss=0.2908, pruned_loss=0.07094, over 21406.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3039, pruned_loss=0.0763, over 4265358.95 frames. ], batch size: 471, lr: 6.30e-03, grad_scale: 32.0 2023-06-21 16:45:58,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=805374.0, ans=0.1 2023-06-21 16:46:04,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=805374.0, ans=0.125 2023-06-21 16:46:35,363 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.566e+02 2.451e+02 2.848e+02 3.373e+02 5.263e+02, threshold=5.696e+02, percent-clipped=0.0 2023-06-21 16:47:35,330 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=805614.0, ans=0.125 2023-06-21 16:47:51,379 INFO [train.py:996] (3/4) Epoch 5, batch 12300, loss[loss=0.2205, simple_loss=0.3108, pruned_loss=0.0651, over 21738.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2952, pruned_loss=0.07111, over 4246318.57 frames. ], batch size: 247, lr: 6.30e-03, grad_scale: 32.0 2023-06-21 16:50:30,864 INFO [train.py:996] (3/4) Epoch 5, batch 12350, loss[loss=0.2086, simple_loss=0.2946, pruned_loss=0.06125, over 21379.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2998, pruned_loss=0.07207, over 4254538.30 frames. ], batch size: 211, lr: 6.30e-03, grad_scale: 32.0 2023-06-21 16:50:39,593 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.00 vs. limit=10.0 2023-06-21 16:50:49,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=805974.0, ans=0.0 2023-06-21 16:50:50,368 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=806034.0, ans=0.125 2023-06-21 16:51:10,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=806094.0, ans=0.125 2023-06-21 16:51:10,993 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.596e+02 2.374e+02 2.755e+02 3.213e+02 5.680e+02, threshold=5.510e+02, percent-clipped=0.0 2023-06-21 16:51:17,813 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.41 vs. limit=12.0 2023-06-21 16:52:17,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=806214.0, ans=0.0 2023-06-21 16:52:32,117 INFO [train.py:996] (3/4) Epoch 5, batch 12400, loss[loss=0.2884, simple_loss=0.3326, pruned_loss=0.1221, over 21787.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3038, pruned_loss=0.07645, over 4272467.45 frames. ], batch size: 508, lr: 6.29e-03, grad_scale: 32.0 2023-06-21 16:52:33,183 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.91 vs. limit=22.5 2023-06-21 16:53:47,333 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=806394.0, ans=0.125 2023-06-21 16:54:24,250 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.03 vs. limit=15.0 2023-06-21 16:54:45,673 INFO [train.py:996] (3/4) Epoch 5, batch 12450, loss[loss=0.2387, simple_loss=0.3545, pruned_loss=0.0615, over 20787.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3077, pruned_loss=0.07937, over 4275527.96 frames. ], batch size: 607, lr: 6.29e-03, grad_scale: 32.0 2023-06-21 16:55:19,669 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff3.min_abs, batch_count=806634.0, ans=0.2 2023-06-21 16:55:21,371 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=806634.0, ans=0.1 2023-06-21 16:55:57,447 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.958e+02 2.571e+02 2.934e+02 3.546e+02 5.506e+02, threshold=5.868e+02, percent-clipped=0.0 2023-06-21 16:56:12,603 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=806694.0, ans=0.2 2023-06-21 16:56:33,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=806754.0, ans=0.125 2023-06-21 16:56:42,713 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=806754.0, ans=0.0 2023-06-21 16:56:44,153 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 16:56:44,778 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.29 vs. limit=22.5 2023-06-21 16:57:17,117 INFO [train.py:996] (3/4) Epoch 5, batch 12500, loss[loss=0.2889, simple_loss=0.3695, pruned_loss=0.1041, over 21796.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3186, pruned_loss=0.08251, over 4276196.22 frames. ], batch size: 118, lr: 6.29e-03, grad_scale: 16.0 2023-06-21 16:59:45,086 INFO [train.py:996] (3/4) Epoch 5, batch 12550, loss[loss=0.2118, simple_loss=0.2521, pruned_loss=0.08575, over 20120.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.323, pruned_loss=0.0856, over 4277936.74 frames. ], batch size: 703, lr: 6.29e-03, grad_scale: 16.0 2023-06-21 17:00:39,265 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.130e+02 2.629e+02 2.998e+02 3.510e+02 7.002e+02, threshold=5.996e+02, percent-clipped=1.0 2023-06-21 17:00:40,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=807294.0, ans=0.125 2023-06-21 17:01:34,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=807414.0, ans=0.125 2023-06-21 17:01:52,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=807414.0, ans=0.04949747468305833 2023-06-21 17:01:56,445 INFO [train.py:996] (3/4) Epoch 5, batch 12600, loss[loss=0.2121, simple_loss=0.3021, pruned_loss=0.06103, over 21685.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3217, pruned_loss=0.08346, over 4267858.31 frames. ], batch size: 351, lr: 6.29e-03, grad_scale: 16.0 2023-06-21 17:02:33,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=807534.0, ans=0.0 2023-06-21 17:03:02,229 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.11 vs. limit=15.0 2023-06-21 17:03:34,344 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.45 vs. limit=15.0 2023-06-21 17:03:36,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=807654.0, ans=0.125 2023-06-21 17:03:47,175 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.45 vs. limit=10.0 2023-06-21 17:04:01,042 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=807714.0, ans=15.0 2023-06-21 17:04:08,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=807714.0, ans=0.125 2023-06-21 17:04:09,524 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=807774.0, ans=0.1 2023-06-21 17:04:10,528 INFO [train.py:996] (3/4) Epoch 5, batch 12650, loss[loss=0.2347, simple_loss=0.306, pruned_loss=0.08166, over 21892.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3153, pruned_loss=0.0796, over 4272993.64 frames. ], batch size: 124, lr: 6.29e-03, grad_scale: 16.0 2023-06-21 17:04:26,090 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.43 vs. limit=6.0 2023-06-21 17:04:28,333 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=807774.0, ans=0.05 2023-06-21 17:05:09,545 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.734e+02 2.417e+02 2.707e+02 3.120e+02 6.136e+02, threshold=5.414e+02, percent-clipped=1.0 2023-06-21 17:05:32,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=807894.0, ans=0.0 2023-06-21 17:06:32,810 INFO [train.py:996] (3/4) Epoch 5, batch 12700, loss[loss=0.2701, simple_loss=0.3365, pruned_loss=0.1019, over 21234.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3136, pruned_loss=0.08184, over 4281674.94 frames. ], batch size: 143, lr: 6.29e-03, grad_scale: 16.0 2023-06-21 17:07:02,581 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=808134.0, ans=0.125 2023-06-21 17:07:36,685 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=808194.0, ans=0.125 2023-06-21 17:08:31,793 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=808314.0, ans=0.125 2023-06-21 17:08:45,680 INFO [train.py:996] (3/4) Epoch 5, batch 12750, loss[loss=0.2106, simple_loss=0.2887, pruned_loss=0.06626, over 21532.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3151, pruned_loss=0.08207, over 4278298.68 frames. ], batch size: 212, lr: 6.29e-03, grad_scale: 16.0 2023-06-21 17:08:58,395 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.54 vs. limit=15.0 2023-06-21 17:09:43,826 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.994e+02 2.507e+02 2.930e+02 3.517e+02 6.177e+02, threshold=5.859e+02, percent-clipped=3.0 2023-06-21 17:09:53,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=808494.0, ans=0.125 2023-06-21 17:10:51,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=808614.0, ans=0.0 2023-06-21 17:10:55,726 INFO [train.py:996] (3/4) Epoch 5, batch 12800, loss[loss=0.2624, simple_loss=0.3339, pruned_loss=0.09547, over 21240.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3152, pruned_loss=0.08281, over 4279459.33 frames. ], batch size: 143, lr: 6.29e-03, grad_scale: 32.0 2023-06-21 17:11:03,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=808674.0, ans=0.0 2023-06-21 17:11:36,730 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.07 vs. limit=15.0 2023-06-21 17:11:39,804 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.74 vs. limit=15.0 2023-06-21 17:12:10,790 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=808794.0, ans=0.0 2023-06-21 17:12:19,990 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.60 vs. limit=22.5 2023-06-21 17:12:33,880 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=808854.0, ans=0.04949747468305833 2023-06-21 17:13:19,404 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=808914.0, ans=0.125 2023-06-21 17:13:22,386 INFO [train.py:996] (3/4) Epoch 5, batch 12850, loss[loss=0.2305, simple_loss=0.3073, pruned_loss=0.07689, over 21103.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3176, pruned_loss=0.08444, over 4285701.44 frames. ], batch size: 143, lr: 6.28e-03, grad_scale: 32.0 2023-06-21 17:14:34,773 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.692e+02 2.342e+02 2.616e+02 2.870e+02 3.698e+02, threshold=5.233e+02, percent-clipped=0.0 2023-06-21 17:14:35,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=809094.0, ans=0.125 2023-06-21 17:14:46,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=809094.0, ans=0.125 2023-06-21 17:15:26,053 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=809214.0, ans=0.0 2023-06-21 17:15:48,063 INFO [train.py:996] (3/4) Epoch 5, batch 12900, loss[loss=0.1707, simple_loss=0.2505, pruned_loss=0.04546, over 21733.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3157, pruned_loss=0.08145, over 4284083.54 frames. ], batch size: 124, lr: 6.28e-03, grad_scale: 32.0 2023-06-21 17:16:35,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=809334.0, ans=0.2 2023-06-21 17:16:47,260 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.63 vs. limit=15.0 2023-06-21 17:18:10,349 INFO [train.py:996] (3/4) Epoch 5, batch 12950, loss[loss=0.2317, simple_loss=0.3147, pruned_loss=0.07433, over 21608.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.314, pruned_loss=0.07932, over 4280206.12 frames. ], batch size: 263, lr: 6.28e-03, grad_scale: 32.0 2023-06-21 17:19:10,176 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.876e+02 2.360e+02 2.684e+02 3.163e+02 5.049e+02, threshold=5.368e+02, percent-clipped=0.0 2023-06-21 17:19:16,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=809694.0, ans=0.0 2023-06-21 17:20:17,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=809814.0, ans=0.125 2023-06-21 17:20:25,656 INFO [train.py:996] (3/4) Epoch 5, batch 13000, loss[loss=0.1924, simple_loss=0.2762, pruned_loss=0.05433, over 21705.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3139, pruned_loss=0.07955, over 4282383.42 frames. ], batch size: 298, lr: 6.28e-03, grad_scale: 32.0 2023-06-21 17:21:21,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=809934.0, ans=0.125 2023-06-21 17:21:21,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=809934.0, ans=0.0 2023-06-21 17:21:35,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=809994.0, ans=0.125 2023-06-21 17:21:38,067 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=809994.0, ans=0.0 2023-06-21 17:21:38,085 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=809994.0, ans=0.125 2023-06-21 17:22:15,195 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.92 vs. limit=15.0 2023-06-21 17:22:56,637 INFO [train.py:996] (3/4) Epoch 5, batch 13050, loss[loss=0.2212, simple_loss=0.2862, pruned_loss=0.07813, over 21795.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3093, pruned_loss=0.07673, over 4275170.15 frames. ], batch size: 247, lr: 6.28e-03, grad_scale: 32.0 2023-06-21 17:23:17,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=810174.0, ans=0.125 2023-06-21 17:23:21,057 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.35 vs. limit=15.0 2023-06-21 17:23:25,054 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=810234.0, ans=0.1 2023-06-21 17:23:39,768 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.862e+02 2.464e+02 2.848e+02 3.249e+02 5.080e+02, threshold=5.696e+02, percent-clipped=0.0 2023-06-21 17:24:06,453 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=810294.0, ans=0.125 2023-06-21 17:24:50,213 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 17:25:16,442 INFO [train.py:996] (3/4) Epoch 5, batch 13100, loss[loss=0.2389, simple_loss=0.3218, pruned_loss=0.07798, over 21735.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3109, pruned_loss=0.07713, over 4272373.83 frames. ], batch size: 332, lr: 6.28e-03, grad_scale: 32.0 2023-06-21 17:25:31,074 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.17 vs. limit=15.0 2023-06-21 17:26:35,698 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.46 vs. limit=15.0 2023-06-21 17:26:52,052 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=810654.0, ans=0.125 2023-06-21 17:27:24,521 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.95 vs. limit=15.0 2023-06-21 17:27:37,121 INFO [train.py:996] (3/4) Epoch 5, batch 13150, loss[loss=0.1746, simple_loss=0.255, pruned_loss=0.04706, over 21593.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3116, pruned_loss=0.07943, over 4275179.84 frames. ], batch size: 230, lr: 6.28e-03, grad_scale: 32.0 2023-06-21 17:27:41,302 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.10 vs. limit=5.0 2023-06-21 17:27:41,914 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=810774.0, ans=0.125 2023-06-21 17:28:39,919 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.173e+02 2.725e+02 3.119e+02 3.667e+02 5.511e+02, threshold=6.238e+02, percent-clipped=0.0 2023-06-21 17:29:17,072 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=810954.0, ans=0.2 2023-06-21 17:29:47,003 INFO [train.py:996] (3/4) Epoch 5, batch 13200, loss[loss=0.2341, simple_loss=0.302, pruned_loss=0.0831, over 22010.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3111, pruned_loss=0.07926, over 4275244.62 frames. ], batch size: 317, lr: 6.28e-03, grad_scale: 32.0 2023-06-21 17:31:09,028 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=811194.0, ans=0.125 2023-06-21 17:31:39,010 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.87 vs. limit=15.0 2023-06-21 17:31:58,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=811314.0, ans=0.125 2023-06-21 17:32:01,966 INFO [train.py:996] (3/4) Epoch 5, batch 13250, loss[loss=0.2528, simple_loss=0.3247, pruned_loss=0.09041, over 21798.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3116, pruned_loss=0.08123, over 4276486.46 frames. ], batch size: 441, lr: 6.27e-03, grad_scale: 32.0 2023-06-21 17:33:07,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=811434.0, ans=0.0 2023-06-21 17:33:16,251 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.845e+02 2.617e+02 2.907e+02 3.504e+02 5.770e+02, threshold=5.814e+02, percent-clipped=0.0 2023-06-21 17:33:32,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=811494.0, ans=0.125 2023-06-21 17:33:42,642 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=811494.0, ans=0.1 2023-06-21 17:33:52,169 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.80 vs. limit=15.0 2023-06-21 17:33:59,926 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.84 vs. limit=15.0 2023-06-21 17:34:32,727 INFO [train.py:996] (3/4) Epoch 5, batch 13300, loss[loss=0.2909, simple_loss=0.362, pruned_loss=0.1099, over 21504.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3143, pruned_loss=0.08098, over 4277401.60 frames. ], batch size: 471, lr: 6.27e-03, grad_scale: 16.0 2023-06-21 17:34:48,316 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=811674.0, ans=0.125 2023-06-21 17:34:53,457 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.09 vs. limit=15.0 2023-06-21 17:36:03,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=811854.0, ans=0.125 2023-06-21 17:36:15,279 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=811854.0, ans=0.2 2023-06-21 17:36:54,464 INFO [train.py:996] (3/4) Epoch 5, batch 13350, loss[loss=0.2499, simple_loss=0.3335, pruned_loss=0.08316, over 21623.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3182, pruned_loss=0.08316, over 4275111.99 frames. ], batch size: 263, lr: 6.27e-03, grad_scale: 16.0 2023-06-21 17:37:05,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=811974.0, ans=0.0 2023-06-21 17:38:06,696 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.987e+02 2.647e+02 2.976e+02 3.366e+02 5.108e+02, threshold=5.953e+02, percent-clipped=0.0 2023-06-21 17:38:13,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=812094.0, ans=0.125 2023-06-21 17:38:25,554 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.90 vs. limit=10.0 2023-06-21 17:39:15,665 INFO [train.py:996] (3/4) Epoch 5, batch 13400, loss[loss=0.2449, simple_loss=0.2972, pruned_loss=0.09625, over 21865.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3186, pruned_loss=0.08462, over 4275425.28 frames. ], batch size: 98, lr: 6.27e-03, grad_scale: 16.0 2023-06-21 17:39:31,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=812274.0, ans=0.0 2023-06-21 17:40:16,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=812334.0, ans=0.125 2023-06-21 17:40:33,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=812394.0, ans=0.125 2023-06-21 17:40:58,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=812454.0, ans=0.0 2023-06-21 17:41:42,771 INFO [train.py:996] (3/4) Epoch 5, batch 13450, loss[loss=0.2761, simple_loss=0.3366, pruned_loss=0.1078, over 21372.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3196, pruned_loss=0.08715, over 4265481.30 frames. ], batch size: 471, lr: 6.27e-03, grad_scale: 16.0 2023-06-21 17:42:16,514 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=812634.0, ans=0.0 2023-06-21 17:42:31,147 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.096e+02 2.609e+02 2.992e+02 3.362e+02 4.963e+02, threshold=5.984e+02, percent-clipped=0.0 2023-06-21 17:42:59,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=812754.0, ans=0.0 2023-06-21 17:43:07,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=812754.0, ans=0.125 2023-06-21 17:43:57,110 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.60 vs. limit=22.5 2023-06-21 17:43:59,479 INFO [train.py:996] (3/4) Epoch 5, batch 13500, loss[loss=0.1859, simple_loss=0.2467, pruned_loss=0.06259, over 21517.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3111, pruned_loss=0.08434, over 4269294.31 frames. ], batch size: 195, lr: 6.27e-03, grad_scale: 16.0 2023-06-21 17:44:00,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=812874.0, ans=0.2 2023-06-21 17:46:37,686 INFO [train.py:996] (3/4) Epoch 5, batch 13550, loss[loss=0.2542, simple_loss=0.3493, pruned_loss=0.07961, over 21714.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3133, pruned_loss=0.08268, over 4266686.79 frames. ], batch size: 247, lr: 6.27e-03, grad_scale: 16.0 2023-06-21 17:47:28,928 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.975e+02 2.549e+02 2.990e+02 3.504e+02 5.055e+02, threshold=5.980e+02, percent-clipped=0.0 2023-06-21 17:47:35,987 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.45 vs. limit=15.0 2023-06-21 17:48:45,918 INFO [train.py:996] (3/4) Epoch 5, batch 13600, loss[loss=0.2351, simple_loss=0.2991, pruned_loss=0.08558, over 21643.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.315, pruned_loss=0.08274, over 4270954.53 frames. ], batch size: 263, lr: 6.27e-03, grad_scale: 32.0 2023-06-21 17:49:31,390 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.64 vs. limit=15.0 2023-06-21 17:50:50,633 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.29 vs. limit=6.0 2023-06-21 17:51:06,766 INFO [train.py:996] (3/4) Epoch 5, batch 13650, loss[loss=0.2165, simple_loss=0.2851, pruned_loss=0.07393, over 21529.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3098, pruned_loss=0.07995, over 4272277.06 frames. ], batch size: 414, lr: 6.27e-03, grad_scale: 32.0 2023-06-21 17:51:21,813 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=813834.0, ans=0.2 2023-06-21 17:52:01,850 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.954e+02 2.314e+02 2.698e+02 3.279e+02 5.824e+02, threshold=5.397e+02, percent-clipped=0.0 2023-06-21 17:52:26,559 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 17:52:50,485 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=813954.0, ans=0.125 2023-06-21 17:52:52,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=814014.0, ans=0.2 2023-06-21 17:53:02,361 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=814014.0, ans=0.1 2023-06-21 17:53:03,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=814014.0, ans=0.0 2023-06-21 17:53:30,180 INFO [train.py:996] (3/4) Epoch 5, batch 13700, loss[loss=0.1909, simple_loss=0.2523, pruned_loss=0.06478, over 21281.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3059, pruned_loss=0.07927, over 4269974.98 frames. ], batch size: 144, lr: 6.26e-03, grad_scale: 32.0 2023-06-21 17:53:47,940 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=814134.0, ans=0.125 2023-06-21 17:53:59,947 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=814134.0, ans=0.1 2023-06-21 17:54:02,916 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=814134.0, ans=0.125 2023-06-21 17:55:11,822 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.31 vs. limit=15.0 2023-06-21 17:55:24,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=814314.0, ans=0.0 2023-06-21 17:55:39,239 INFO [train.py:996] (3/4) Epoch 5, batch 13750, loss[loss=0.1969, simple_loss=0.2704, pruned_loss=0.06169, over 21322.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3004, pruned_loss=0.0781, over 4267517.56 frames. ], batch size: 176, lr: 6.26e-03, grad_scale: 32.0 2023-06-21 17:56:03,797 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=814374.0, ans=0.95 2023-06-21 17:56:40,369 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 2.570e+02 2.907e+02 3.246e+02 5.241e+02, threshold=5.813e+02, percent-clipped=0.0 2023-06-21 17:56:58,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=814554.0, ans=0.2 2023-06-21 17:57:01,498 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=814554.0, ans=0.0 2023-06-21 17:57:56,818 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.75 vs. limit=15.0 2023-06-21 17:58:04,560 INFO [train.py:996] (3/4) Epoch 5, batch 13800, loss[loss=0.3663, simple_loss=0.4453, pruned_loss=0.1437, over 21443.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3077, pruned_loss=0.07838, over 4266082.99 frames. ], batch size: 507, lr: 6.26e-03, grad_scale: 32.0 2023-06-21 17:58:15,342 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=814674.0, ans=0.2 2023-06-21 17:58:28,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=814674.0, ans=0.05 2023-06-21 17:58:34,032 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-06-21 17:59:55,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=814914.0, ans=0.0 2023-06-21 18:00:36,377 INFO [train.py:996] (3/4) Epoch 5, batch 13850, loss[loss=0.259, simple_loss=0.3297, pruned_loss=0.09417, over 21593.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3141, pruned_loss=0.07918, over 4273148.62 frames. ], batch size: 230, lr: 6.26e-03, grad_scale: 32.0 2023-06-21 18:01:21,807 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.956e+02 2.680e+02 3.025e+02 3.461e+02 6.759e+02, threshold=6.050e+02, percent-clipped=1.0 2023-06-21 18:01:39,423 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=815094.0, ans=0.0 2023-06-21 18:01:43,092 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.17 vs. limit=15.0 2023-06-21 18:02:41,029 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.94 vs. limit=22.5 2023-06-21 18:02:45,814 INFO [train.py:996] (3/4) Epoch 5, batch 13900, loss[loss=0.2345, simple_loss=0.3071, pruned_loss=0.08092, over 21854.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3193, pruned_loss=0.08292, over 4276276.11 frames. ], batch size: 332, lr: 6.26e-03, grad_scale: 32.0 2023-06-21 18:02:46,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=815274.0, ans=0.1 2023-06-21 18:02:51,162 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.81 vs. limit=22.5 2023-06-21 18:02:51,376 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.32 vs. limit=5.0 2023-06-21 18:03:18,803 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=815334.0, ans=0.125 2023-06-21 18:03:30,809 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=815394.0, ans=0.0 2023-06-21 18:03:55,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=815394.0, ans=0.125 2023-06-21 18:04:59,424 INFO [train.py:996] (3/4) Epoch 5, batch 13950, loss[loss=0.2059, simple_loss=0.2915, pruned_loss=0.0602, over 20849.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3185, pruned_loss=0.08472, over 4281692.76 frames. ], batch size: 608, lr: 6.26e-03, grad_scale: 16.0 2023-06-21 18:05:51,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=815694.0, ans=0.04949747468305833 2023-06-21 18:05:56,662 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.798e+02 2.602e+02 2.918e+02 3.271e+02 5.546e+02, threshold=5.836e+02, percent-clipped=0.0 2023-06-21 18:06:41,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=815754.0, ans=0.125 2023-06-21 18:06:58,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=815814.0, ans=0.125 2023-06-21 18:07:06,575 INFO [train.py:996] (3/4) Epoch 5, batch 14000, loss[loss=0.1895, simple_loss=0.2664, pruned_loss=0.05633, over 21176.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3123, pruned_loss=0.08154, over 4273312.49 frames. ], batch size: 143, lr: 6.26e-03, grad_scale: 32.0 2023-06-21 18:07:20,041 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=815874.0, ans=0.0 2023-06-21 18:07:48,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=815934.0, ans=0.125 2023-06-21 18:07:54,028 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=815994.0, ans=0.0 2023-06-21 18:08:36,431 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.19 vs. limit=22.5 2023-06-21 18:08:56,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=816114.0, ans=0.125 2023-06-21 18:09:05,307 INFO [train.py:996] (3/4) Epoch 5, batch 14050, loss[loss=0.2074, simple_loss=0.2752, pruned_loss=0.06976, over 21671.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3073, pruned_loss=0.07768, over 4272293.04 frames. ], batch size: 282, lr: 6.26e-03, grad_scale: 32.0 2023-06-21 18:09:43,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=816234.0, ans=0.1 2023-06-21 18:09:43,142 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=816234.0, ans=0.1 2023-06-21 18:09:43,148 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=816234.0, ans=0.125 2023-06-21 18:09:50,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=816234.0, ans=0.0 2023-06-21 18:10:10,808 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.610e+02 2.220e+02 2.635e+02 3.062e+02 5.472e+02, threshold=5.269e+02, percent-clipped=0.0 2023-06-21 18:10:56,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=816414.0, ans=0.125 2023-06-21 18:11:16,455 INFO [train.py:996] (3/4) Epoch 5, batch 14100, loss[loss=0.2489, simple_loss=0.3166, pruned_loss=0.0906, over 21683.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3027, pruned_loss=0.0774, over 4256153.78 frames. ], batch size: 332, lr: 6.26e-03, grad_scale: 32.0 2023-06-21 18:12:51,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=816714.0, ans=0.125 2023-06-21 18:13:18,686 INFO [train.py:996] (3/4) Epoch 5, batch 14150, loss[loss=0.2217, simple_loss=0.3097, pruned_loss=0.06683, over 21584.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3077, pruned_loss=0.07814, over 4251785.08 frames. ], batch size: 230, lr: 6.25e-03, grad_scale: 32.0 2023-06-21 18:13:35,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=816834.0, ans=0.125 2023-06-21 18:13:46,794 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.67 vs. limit=15.0 2023-06-21 18:13:47,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=816834.0, ans=0.125 2023-06-21 18:14:05,090 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.788e+02 2.286e+02 2.751e+02 3.276e+02 5.188e+02, threshold=5.503e+02, percent-clipped=0.0 2023-06-21 18:14:51,754 INFO [train.py:996] (3/4) Epoch 5, batch 14200, loss[loss=0.2042, simple_loss=0.2862, pruned_loss=0.06107, over 21823.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3062, pruned_loss=0.07662, over 4249647.60 frames. ], batch size: 282, lr: 6.25e-03, grad_scale: 32.0 2023-06-21 18:15:21,576 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=817074.0, ans=0.025 2023-06-21 18:16:24,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=817314.0, ans=0.1 2023-06-21 18:16:38,390 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.90 vs. limit=22.5 2023-06-21 18:16:58,885 INFO [train.py:996] (3/4) Epoch 5, batch 14250, loss[loss=0.2201, simple_loss=0.2849, pruned_loss=0.07767, over 21840.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3009, pruned_loss=0.07651, over 4252991.88 frames. ], batch size: 107, lr: 6.25e-03, grad_scale: 16.0 2023-06-21 18:17:02,351 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=817374.0, ans=0.0 2023-06-21 18:17:26,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=817434.0, ans=0.125 2023-06-21 18:17:56,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=817494.0, ans=0.0 2023-06-21 18:17:57,864 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.598e+02 2.265e+02 2.747e+02 3.168e+02 5.793e+02, threshold=5.495e+02, percent-clipped=1.0 2023-06-21 18:17:58,473 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=817494.0, ans=0.125 2023-06-21 18:19:15,225 INFO [train.py:996] (3/4) Epoch 5, batch 14300, loss[loss=0.2939, simple_loss=0.3804, pruned_loss=0.1037, over 21743.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.301, pruned_loss=0.0757, over 4244395.93 frames. ], batch size: 351, lr: 6.25e-03, grad_scale: 16.0 2023-06-21 18:20:32,054 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.35 vs. limit=15.0 2023-06-21 18:21:00,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=817854.0, ans=0.2 2023-06-21 18:21:34,198 INFO [train.py:996] (3/4) Epoch 5, batch 14350, loss[loss=0.208, simple_loss=0.2867, pruned_loss=0.06463, over 21824.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3081, pruned_loss=0.07723, over 4236953.67 frames. ], batch size: 247, lr: 6.25e-03, grad_scale: 16.0 2023-06-21 18:21:56,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=817974.0, ans=0.0 2023-06-21 18:22:03,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=818034.0, ans=0.0 2023-06-21 18:22:48,702 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.826e+02 2.439e+02 2.916e+02 4.284e+02 1.022e+03, threshold=5.832e+02, percent-clipped=15.0 2023-06-21 18:22:53,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=818094.0, ans=0.125 2023-06-21 18:22:53,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=818094.0, ans=0.125 2023-06-21 18:23:46,452 INFO [train.py:996] (3/4) Epoch 5, batch 14400, loss[loss=0.2665, simple_loss=0.3253, pruned_loss=0.1039, over 21753.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3068, pruned_loss=0.07858, over 4250041.54 frames. ], batch size: 112, lr: 6.25e-03, grad_scale: 32.0 2023-06-21 18:25:55,585 INFO [train.py:996] (3/4) Epoch 5, batch 14450, loss[loss=0.2137, simple_loss=0.2784, pruned_loss=0.07447, over 21781.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.301, pruned_loss=0.07875, over 4256792.45 frames. ], batch size: 351, lr: 6.25e-03, grad_scale: 32.0 2023-06-21 18:25:57,706 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=818574.0, ans=0.125 2023-06-21 18:26:10,983 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.27 vs. limit=15.0 2023-06-21 18:27:01,100 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.873e+02 2.447e+02 2.694e+02 3.271e+02 4.968e+02, threshold=5.388e+02, percent-clipped=0.0 2023-06-21 18:27:49,129 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=818814.0, ans=0.2 2023-06-21 18:27:53,503 INFO [train.py:996] (3/4) Epoch 5, batch 14500, loss[loss=0.2225, simple_loss=0.3096, pruned_loss=0.06773, over 21777.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.2977, pruned_loss=0.07839, over 4251620.98 frames. ], batch size: 371, lr: 6.25e-03, grad_scale: 32.0 2023-06-21 18:27:56,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=818874.0, ans=0.025 2023-06-21 18:29:30,529 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.90 vs. limit=15.0 2023-06-21 18:30:23,178 INFO [train.py:996] (3/4) Epoch 5, batch 14550, loss[loss=0.2187, simple_loss=0.282, pruned_loss=0.07765, over 20048.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3027, pruned_loss=0.07987, over 4256153.33 frames. ], batch size: 703, lr: 6.24e-03, grad_scale: 32.0 2023-06-21 18:30:42,071 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=819174.0, ans=0.125 2023-06-21 18:31:02,490 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=819234.0, ans=0.0 2023-06-21 18:31:33,099 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.000e+02 2.710e+02 3.093e+02 3.463e+02 5.528e+02, threshold=6.187e+02, percent-clipped=2.0 2023-06-21 18:32:03,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=819354.0, ans=0.1 2023-06-21 18:32:06,792 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=819414.0, ans=0.035 2023-06-21 18:32:30,150 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=819414.0, ans=0.0 2023-06-21 18:32:38,584 INFO [train.py:996] (3/4) Epoch 5, batch 14600, loss[loss=0.2661, simple_loss=0.3506, pruned_loss=0.09085, over 21413.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3134, pruned_loss=0.08391, over 4266540.50 frames. ], batch size: 211, lr: 6.24e-03, grad_scale: 16.0 2023-06-21 18:33:37,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=819594.0, ans=0.1 2023-06-21 18:33:37,789 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=819594.0, ans=0.04949747468305833 2023-06-21 18:33:54,897 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=819654.0, ans=0.125 2023-06-21 18:34:18,572 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.58 vs. limit=15.0 2023-06-21 18:34:57,392 INFO [train.py:996] (3/4) Epoch 5, batch 14650, loss[loss=0.159, simple_loss=0.2383, pruned_loss=0.03988, over 21263.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3148, pruned_loss=0.08305, over 4258484.43 frames. ], batch size: 144, lr: 6.24e-03, grad_scale: 16.0 2023-06-21 18:35:59,438 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 2.281e+02 2.602e+02 3.168e+02 7.024e+02, threshold=5.204e+02, percent-clipped=1.0 2023-06-21 18:36:14,926 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.03 vs. limit=15.0 2023-06-21 18:37:03,317 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.52 vs. limit=15.0 2023-06-21 18:37:05,115 INFO [train.py:996] (3/4) Epoch 5, batch 14700, loss[loss=0.2025, simple_loss=0.283, pruned_loss=0.06101, over 21784.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.308, pruned_loss=0.07743, over 4264760.49 frames. ], batch size: 124, lr: 6.24e-03, grad_scale: 16.0 2023-06-21 18:37:06,886 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=820074.0, ans=0.125 2023-06-21 18:39:44,739 INFO [train.py:996] (3/4) Epoch 5, batch 14750, loss[loss=0.2916, simple_loss=0.3658, pruned_loss=0.1087, over 21740.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3132, pruned_loss=0.08, over 4271540.42 frames. ], batch size: 247, lr: 6.24e-03, grad_scale: 16.0 2023-06-21 18:39:55,082 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.40 vs. limit=15.0 2023-06-21 18:40:41,790 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.541e+02 2.601e+02 3.057e+02 3.762e+02 6.456e+02, threshold=6.114e+02, percent-clipped=6.0 2023-06-21 18:41:17,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=820554.0, ans=0.0 2023-06-21 18:41:31,816 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.18 vs. limit=10.0 2023-06-21 18:41:58,015 INFO [train.py:996] (3/4) Epoch 5, batch 14800, loss[loss=0.2359, simple_loss=0.2946, pruned_loss=0.08858, over 21127.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.325, pruned_loss=0.0864, over 4274064.19 frames. ], batch size: 176, lr: 6.24e-03, grad_scale: 32.0 2023-06-21 18:42:04,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=820674.0, ans=0.125 2023-06-21 18:42:24,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=820734.0, ans=0.125 2023-06-21 18:42:40,250 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.68 vs. limit=22.5 2023-06-21 18:44:17,473 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=820974.0, ans=0.95 2023-06-21 18:44:18,453 INFO [train.py:996] (3/4) Epoch 5, batch 14850, loss[loss=0.2207, simple_loss=0.2875, pruned_loss=0.07698, over 21534.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3204, pruned_loss=0.08595, over 4268640.43 frames. ], batch size: 230, lr: 6.24e-03, grad_scale: 16.0 2023-06-21 18:45:20,372 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=821034.0, ans=0.025 2023-06-21 18:45:21,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=821094.0, ans=0.125 2023-06-21 18:45:22,350 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.81 vs. limit=15.0 2023-06-21 18:45:25,342 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-06-21 18:45:45,998 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 2.574e+02 2.949e+02 3.565e+02 8.325e+02, threshold=5.898e+02, percent-clipped=1.0 2023-06-21 18:46:12,781 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=821154.0, ans=0.125 2023-06-21 18:46:32,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=821214.0, ans=0.1 2023-06-21 18:46:46,434 INFO [train.py:996] (3/4) Epoch 5, batch 14900, loss[loss=0.3067, simple_loss=0.3639, pruned_loss=0.1247, over 21468.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3227, pruned_loss=0.08769, over 4270237.74 frames. ], batch size: 471, lr: 6.24e-03, grad_scale: 16.0 2023-06-21 18:48:32,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=821454.0, ans=0.125 2023-06-21 18:49:05,469 INFO [train.py:996] (3/4) Epoch 5, batch 14950, loss[loss=0.2544, simple_loss=0.3335, pruned_loss=0.0877, over 21587.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3218, pruned_loss=0.08672, over 4263785.77 frames. ], batch size: 389, lr: 6.24e-03, grad_scale: 16.0 2023-06-21 18:49:22,223 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.77 vs. limit=15.0 2023-06-21 18:49:36,644 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=821634.0, ans=0.0 2023-06-21 18:49:42,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=821634.0, ans=0.125 2023-06-21 18:49:45,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=821634.0, ans=0.125 2023-06-21 18:50:24,842 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.985e+02 2.514e+02 2.835e+02 3.523e+02 6.432e+02, threshold=5.669e+02, percent-clipped=1.0 2023-06-21 18:50:33,243 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.66 vs. limit=6.0 2023-06-21 18:50:39,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=821754.0, ans=0.125 2023-06-21 18:50:51,797 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.13 vs. limit=15.0 2023-06-21 18:51:22,629 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=821814.0, ans=0.125 2023-06-21 18:51:26,645 INFO [train.py:996] (3/4) Epoch 5, batch 15000, loss[loss=0.2452, simple_loss=0.319, pruned_loss=0.08567, over 20683.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.3225, pruned_loss=0.08752, over 4263754.02 frames. ], batch size: 607, lr: 6.23e-03, grad_scale: 16.0 2023-06-21 18:51:26,645 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-21 18:52:15,532 INFO [train.py:1028] (3/4) Epoch 5, validation: loss=0.2599, simple_loss=0.3537, pruned_loss=0.08302, over 1796401.00 frames. 2023-06-21 18:52:15,534 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-21 18:52:46,028 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.70 vs. limit=10.0 2023-06-21 18:53:09,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=821994.0, ans=0.1 2023-06-21 18:53:09,833 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.47 vs. limit=6.0 2023-06-21 18:53:29,889 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.57 vs. limit=22.5 2023-06-21 18:53:49,329 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=822054.0, ans=0.125 2023-06-21 18:54:18,318 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=822114.0, ans=0.0 2023-06-21 18:54:29,570 INFO [train.py:996] (3/4) Epoch 5, batch 15050, loss[loss=0.2283, simple_loss=0.3024, pruned_loss=0.0771, over 21442.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3232, pruned_loss=0.08778, over 4262392.00 frames. ], batch size: 194, lr: 6.23e-03, grad_scale: 16.0 2023-06-21 18:54:37,220 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=822174.0, ans=0.125 2023-06-21 18:55:14,259 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.06 vs. limit=10.0 2023-06-21 18:55:18,319 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=822234.0, ans=0.015 2023-06-21 18:55:19,908 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 18:55:21,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=822234.0, ans=0.125 2023-06-21 18:55:41,368 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.049e+02 2.696e+02 3.166e+02 3.893e+02 6.757e+02, threshold=6.331e+02, percent-clipped=4.0 2023-06-21 18:56:56,372 INFO [train.py:996] (3/4) Epoch 5, batch 15100, loss[loss=0.2493, simple_loss=0.3211, pruned_loss=0.08879, over 21833.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3261, pruned_loss=0.08715, over 4263382.15 frames. ], batch size: 247, lr: 6.23e-03, grad_scale: 16.0 2023-06-21 18:57:44,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=822534.0, ans=0.125 2023-06-21 18:57:52,903 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 18:58:09,229 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=822654.0, ans=0.125 2023-06-21 18:59:23,185 INFO [train.py:996] (3/4) Epoch 5, batch 15150, loss[loss=0.2478, simple_loss=0.2984, pruned_loss=0.0986, over 21180.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3219, pruned_loss=0.08772, over 4263480.34 frames. ], batch size: 143, lr: 6.23e-03, grad_scale: 16.0 2023-06-21 18:59:35,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=822774.0, ans=0.125 2023-06-21 18:59:38,961 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=822834.0, ans=0.125 2023-06-21 18:59:57,561 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.09 vs. limit=15.0 2023-06-21 19:00:22,613 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.171e+02 2.670e+02 3.218e+02 3.613e+02 4.681e+02, threshold=6.436e+02, percent-clipped=0.0 2023-06-21 19:01:35,532 INFO [train.py:996] (3/4) Epoch 5, batch 15200, loss[loss=0.1957, simple_loss=0.2593, pruned_loss=0.06599, over 21824.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3116, pruned_loss=0.08316, over 4263835.22 frames. ], batch size: 118, lr: 6.23e-03, grad_scale: 32.0 2023-06-21 19:02:02,296 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=823134.0, ans=0.125 2023-06-21 19:02:22,554 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=823194.0, ans=0.2 2023-06-21 19:02:25,596 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=823194.0, ans=0.125 2023-06-21 19:03:33,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=823314.0, ans=0.125 2023-06-21 19:03:36,102 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=823374.0, ans=0.0 2023-06-21 19:03:37,118 INFO [train.py:996] (3/4) Epoch 5, batch 15250, loss[loss=0.2292, simple_loss=0.2995, pruned_loss=0.07944, over 21807.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3047, pruned_loss=0.08116, over 4266586.28 frames. ], batch size: 118, lr: 6.23e-03, grad_scale: 32.0 2023-06-21 19:03:45,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=823374.0, ans=0.05 2023-06-21 19:03:56,728 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=823374.0, ans=0.1 2023-06-21 19:04:36,145 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.735e+02 2.376e+02 2.792e+02 3.272e+02 5.081e+02, threshold=5.584e+02, percent-clipped=0.0 2023-06-21 19:05:02,422 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=823554.0, ans=0.0 2023-06-21 19:05:24,754 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=823554.0, ans=0.125 2023-06-21 19:05:28,260 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.54 vs. limit=22.5 2023-06-21 19:05:50,662 INFO [train.py:996] (3/4) Epoch 5, batch 15300, loss[loss=0.2537, simple_loss=0.32, pruned_loss=0.09368, over 21368.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3088, pruned_loss=0.08406, over 4269189.29 frames. ], batch size: 176, lr: 6.23e-03, grad_scale: 16.0 2023-06-21 19:07:25,951 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.97 vs. limit=6.0 2023-06-21 19:07:51,677 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=823914.0, ans=0.125 2023-06-21 19:08:09,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=823914.0, ans=0.0 2023-06-21 19:08:12,187 INFO [train.py:996] (3/4) Epoch 5, batch 15350, loss[loss=0.2211, simple_loss=0.3148, pruned_loss=0.06365, over 21467.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3154, pruned_loss=0.08597, over 4268402.91 frames. ], batch size: 194, lr: 6.23e-03, grad_scale: 16.0 2023-06-21 19:08:39,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=824034.0, ans=0.05 2023-06-21 19:08:57,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=824094.0, ans=0.125 2023-06-21 19:09:18,327 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.888e+02 2.632e+02 3.057e+02 3.588e+02 5.490e+02, threshold=6.113e+02, percent-clipped=0.0 2023-06-21 19:09:47,285 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.97 vs. limit=22.5 2023-06-21 19:09:47,315 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.68 vs. limit=15.0 2023-06-21 19:10:22,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=824214.0, ans=0.1 2023-06-21 19:10:24,825 INFO [train.py:996] (3/4) Epoch 5, batch 15400, loss[loss=0.2265, simple_loss=0.2978, pruned_loss=0.07759, over 21673.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3172, pruned_loss=0.08447, over 4274005.39 frames. ], batch size: 230, lr: 6.23e-03, grad_scale: 16.0 2023-06-21 19:10:25,224 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=824274.0, ans=0.1 2023-06-21 19:10:37,219 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=824274.0, ans=0.0 2023-06-21 19:11:19,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=824394.0, ans=0.2 2023-06-21 19:12:18,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=824514.0, ans=0.125 2023-06-21 19:12:33,788 INFO [train.py:996] (3/4) Epoch 5, batch 15450, loss[loss=0.2172, simple_loss=0.2794, pruned_loss=0.07753, over 21596.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3155, pruned_loss=0.08382, over 4266085.93 frames. ], batch size: 212, lr: 6.22e-03, grad_scale: 16.0 2023-06-21 19:12:55,371 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.54 vs. limit=22.5 2023-06-21 19:13:16,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=824634.0, ans=0.0 2023-06-21 19:13:20,642 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=824694.0, ans=0.125 2023-06-21 19:13:30,624 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.818e+02 2.430e+02 2.746e+02 3.207e+02 5.836e+02, threshold=5.491e+02, percent-clipped=0.0 2023-06-21 19:13:43,852 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=824754.0, ans=0.125 2023-06-21 19:14:48,606 INFO [train.py:996] (3/4) Epoch 5, batch 15500, loss[loss=0.3016, simple_loss=0.3607, pruned_loss=0.1212, over 21405.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3175, pruned_loss=0.08416, over 4260769.41 frames. ], batch size: 471, lr: 6.22e-03, grad_scale: 16.0 2023-06-21 19:15:20,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=824874.0, ans=0.0 2023-06-21 19:15:38,511 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.95 vs. limit=22.5 2023-06-21 19:17:11,612 INFO [train.py:996] (3/4) Epoch 5, batch 15550, loss[loss=0.2093, simple_loss=0.2932, pruned_loss=0.06271, over 21720.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3144, pruned_loss=0.08221, over 4259635.79 frames. ], batch size: 298, lr: 6.22e-03, grad_scale: 16.0 2023-06-21 19:17:17,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=825174.0, ans=0.125 2023-06-21 19:17:36,726 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-21 19:17:59,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=825294.0, ans=0.125 2023-06-21 19:18:22,524 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.760e+02 2.386e+02 2.767e+02 3.218e+02 7.331e+02, threshold=5.534e+02, percent-clipped=1.0 2023-06-21 19:18:51,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=825354.0, ans=0.2 2023-06-21 19:19:22,019 INFO [train.py:996] (3/4) Epoch 5, batch 15600, loss[loss=0.2394, simple_loss=0.3215, pruned_loss=0.07868, over 21492.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3074, pruned_loss=0.08002, over 4267176.41 frames. ], batch size: 389, lr: 6.22e-03, grad_scale: 32.0 2023-06-21 19:19:27,000 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=825474.0, ans=0.0 2023-06-21 19:21:22,181 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=825714.0, ans=0.125 2023-06-21 19:21:31,853 INFO [train.py:996] (3/4) Epoch 5, batch 15650, loss[loss=0.2455, simple_loss=0.3055, pruned_loss=0.09277, over 21416.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3052, pruned_loss=0.07899, over 4267287.46 frames. ], batch size: 389, lr: 6.22e-03, grad_scale: 32.0 2023-06-21 19:22:03,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=825834.0, ans=0.1 2023-06-21 19:22:33,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=825894.0, ans=0.125 2023-06-21 19:22:42,988 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=825894.0, ans=0.0 2023-06-21 19:22:54,914 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.907e+02 2.460e+02 2.777e+02 3.351e+02 5.058e+02, threshold=5.554e+02, percent-clipped=0.0 2023-06-21 19:23:41,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=826014.0, ans=0.125 2023-06-21 19:23:48,492 INFO [train.py:996] (3/4) Epoch 5, batch 15700, loss[loss=0.2134, simple_loss=0.2976, pruned_loss=0.06462, over 21772.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.301, pruned_loss=0.07808, over 4265565.10 frames. ], batch size: 371, lr: 6.22e-03, grad_scale: 32.0 2023-06-21 19:23:52,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=826074.0, ans=0.125 2023-06-21 19:25:52,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=826314.0, ans=0.2 2023-06-21 19:26:05,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=826374.0, ans=0.0 2023-06-21 19:26:06,627 INFO [train.py:996] (3/4) Epoch 5, batch 15750, loss[loss=0.196, simple_loss=0.2555, pruned_loss=0.06823, over 21603.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.2973, pruned_loss=0.07825, over 4255221.73 frames. ], batch size: 247, lr: 6.22e-03, grad_scale: 32.0 2023-06-21 19:26:10,839 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.65 vs. limit=8.0 2023-06-21 19:26:33,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=826434.0, ans=0.125 2023-06-21 19:26:49,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=826434.0, ans=0.0 2023-06-21 19:26:51,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=826494.0, ans=0.2 2023-06-21 19:27:19,061 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.882e+02 2.398e+02 2.705e+02 3.127e+02 4.328e+02, threshold=5.411e+02, percent-clipped=0.0 2023-06-21 19:27:28,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=826554.0, ans=0.2 2023-06-21 19:28:12,453 INFO [train.py:996] (3/4) Epoch 5, batch 15800, loss[loss=0.2151, simple_loss=0.2716, pruned_loss=0.07932, over 21759.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2924, pruned_loss=0.07697, over 4251312.94 frames. ], batch size: 112, lr: 6.22e-03, grad_scale: 32.0 2023-06-21 19:28:15,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=826674.0, ans=0.0 2023-06-21 19:28:39,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=826734.0, ans=0.04949747468305833 2023-06-21 19:29:44,917 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.46 vs. limit=12.0 2023-06-21 19:29:45,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=826854.0, ans=0.125 2023-06-21 19:30:18,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=826914.0, ans=0.0 2023-06-21 19:30:25,358 INFO [train.py:996] (3/4) Epoch 5, batch 15850, loss[loss=0.2706, simple_loss=0.326, pruned_loss=0.1077, over 21335.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.2976, pruned_loss=0.07966, over 4251700.38 frames. ], batch size: 471, lr: 6.22e-03, grad_scale: 32.0 2023-06-21 19:30:33,322 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=826974.0, ans=0.0 2023-06-21 19:31:02,440 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=827034.0, ans=0.125 2023-06-21 19:31:30,345 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=827094.0, ans=0.1 2023-06-21 19:31:35,839 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.910e+02 2.608e+02 3.041e+02 3.663e+02 6.488e+02, threshold=6.081e+02, percent-clipped=4.0 2023-06-21 19:32:05,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=827154.0, ans=0.0 2023-06-21 19:32:24,222 INFO [train.py:996] (3/4) Epoch 5, batch 15900, loss[loss=0.2051, simple_loss=0.263, pruned_loss=0.07359, over 21457.00 frames. ], tot_loss[loss=0.228, simple_loss=0.2966, pruned_loss=0.07971, over 4251957.42 frames. ], batch size: 212, lr: 6.21e-03, grad_scale: 32.0 2023-06-21 19:33:18,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=827394.0, ans=0.1 2023-06-21 19:34:15,705 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.40 vs. limit=15.0 2023-06-21 19:34:21,039 INFO [train.py:996] (3/4) Epoch 5, batch 15950, loss[loss=0.2125, simple_loss=0.2966, pruned_loss=0.06425, over 21736.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.296, pruned_loss=0.07759, over 4241403.37 frames. ], batch size: 282, lr: 6.21e-03, grad_scale: 16.0 2023-06-21 19:34:48,515 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=827634.0, ans=0.125 2023-06-21 19:35:37,325 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.851e+02 2.379e+02 2.684e+02 3.180e+02 4.998e+02, threshold=5.368e+02, percent-clipped=0.0 2023-06-21 19:36:22,746 INFO [train.py:996] (3/4) Epoch 5, batch 16000, loss[loss=0.2065, simple_loss=0.2597, pruned_loss=0.07663, over 20705.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.2969, pruned_loss=0.07574, over 4249491.44 frames. ], batch size: 608, lr: 6.21e-03, grad_scale: 32.0 2023-06-21 19:36:25,346 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=12.0 2023-06-21 19:37:16,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=827994.0, ans=0.2 2023-06-21 19:37:24,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=827994.0, ans=0.125 2023-06-21 19:38:41,473 INFO [train.py:996] (3/4) Epoch 5, batch 16050, loss[loss=0.2376, simple_loss=0.3294, pruned_loss=0.0729, over 21298.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.2996, pruned_loss=0.07397, over 4254589.58 frames. ], batch size: 159, lr: 6.21e-03, grad_scale: 32.0 2023-06-21 19:39:15,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=828234.0, ans=0.95 2023-06-21 19:39:15,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=828234.0, ans=0.125 2023-06-21 19:39:33,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=828294.0, ans=0.1 2023-06-21 19:39:42,945 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 19:39:54,525 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.880e+02 2.402e+02 2.679e+02 3.498e+02 5.563e+02, threshold=5.357e+02, percent-clipped=1.0 2023-06-21 19:39:55,601 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-21 19:40:01,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=828354.0, ans=0.125 2023-06-21 19:40:44,449 INFO [train.py:996] (3/4) Epoch 5, batch 16100, loss[loss=0.2523, simple_loss=0.3152, pruned_loss=0.09475, over 21767.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3032, pruned_loss=0.0764, over 4265728.90 frames. ], batch size: 441, lr: 6.21e-03, grad_scale: 32.0 2023-06-21 19:41:28,232 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=828534.0, ans=0.0 2023-06-21 19:41:45,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=828594.0, ans=0.0 2023-06-21 19:42:00,545 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=828594.0, ans=0.2 2023-06-21 19:42:30,099 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.40 vs. limit=12.0 2023-06-21 19:42:54,049 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.85 vs. limit=15.0 2023-06-21 19:43:08,405 INFO [train.py:996] (3/4) Epoch 5, batch 16150, loss[loss=0.226, simple_loss=0.3193, pruned_loss=0.06635, over 21808.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3037, pruned_loss=0.07909, over 4275537.09 frames. ], batch size: 282, lr: 6.21e-03, grad_scale: 32.0 2023-06-21 19:44:17,225 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.16 vs. limit=15.0 2023-06-21 19:44:19,018 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.113e+02 2.586e+02 2.871e+02 3.411e+02 6.404e+02, threshold=5.741e+02, percent-clipped=1.0 2023-06-21 19:44:29,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=828954.0, ans=0.025 2023-06-21 19:44:29,966 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=828954.0, ans=0.025 2023-06-21 19:45:12,445 INFO [train.py:996] (3/4) Epoch 5, batch 16200, loss[loss=0.2967, simple_loss=0.3572, pruned_loss=0.1181, over 21804.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3086, pruned_loss=0.08052, over 4273120.96 frames. ], batch size: 441, lr: 6.21e-03, grad_scale: 32.0 2023-06-21 19:47:23,194 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.16 vs. limit=10.0 2023-06-21 19:47:35,961 INFO [train.py:996] (3/4) Epoch 5, batch 16250, loss[loss=0.187, simple_loss=0.2721, pruned_loss=0.05096, over 21635.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3068, pruned_loss=0.07854, over 4274111.66 frames. ], batch size: 263, lr: 6.21e-03, grad_scale: 32.0 2023-06-21 19:47:41,505 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=829374.0, ans=0.0 2023-06-21 19:48:21,545 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=15.0 2023-06-21 19:48:40,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=829494.0, ans=0.125 2023-06-21 19:48:54,920 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.815e+02 2.259e+02 2.680e+02 3.153e+02 6.826e+02, threshold=5.361e+02, percent-clipped=1.0 2023-06-21 19:49:15,504 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=829554.0, ans=0.0 2023-06-21 19:49:43,582 INFO [train.py:996] (3/4) Epoch 5, batch 16300, loss[loss=0.1947, simple_loss=0.2708, pruned_loss=0.05933, over 21697.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3016, pruned_loss=0.07444, over 4268373.56 frames. ], batch size: 332, lr: 6.21e-03, grad_scale: 32.0 2023-06-21 19:50:15,731 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=829734.0, ans=0.125 2023-06-21 19:51:15,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=829854.0, ans=0.0 2023-06-21 19:52:05,553 INFO [train.py:996] (3/4) Epoch 5, batch 16350, loss[loss=0.2973, simple_loss=0.366, pruned_loss=0.1143, over 21783.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3032, pruned_loss=0.07598, over 4256892.09 frames. ], batch size: 441, lr: 6.20e-03, grad_scale: 32.0 2023-06-21 19:52:29,444 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.19 vs. limit=22.5 2023-06-21 19:53:25,389 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.849e+02 2.400e+02 2.710e+02 3.275e+02 5.510e+02, threshold=5.421e+02, percent-clipped=1.0 2023-06-21 19:53:39,840 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.41 vs. limit=15.0 2023-06-21 19:53:42,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=830154.0, ans=0.125 2023-06-21 19:53:56,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=830214.0, ans=0.125 2023-06-21 19:54:19,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=830274.0, ans=0.125 2023-06-21 19:54:20,640 INFO [train.py:996] (3/4) Epoch 5, batch 16400, loss[loss=0.2342, simple_loss=0.3014, pruned_loss=0.08345, over 21878.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.307, pruned_loss=0.07771, over 4261238.68 frames. ], batch size: 107, lr: 6.20e-03, grad_scale: 32.0 2023-06-21 19:54:44,657 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.20 vs. limit=15.0 2023-06-21 19:55:50,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=830454.0, ans=0.0 2023-06-21 19:56:17,592 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.82 vs. limit=15.0 2023-06-21 19:56:39,875 INFO [train.py:996] (3/4) Epoch 5, batch 16450, loss[loss=0.2157, simple_loss=0.2851, pruned_loss=0.07312, over 21777.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3069, pruned_loss=0.0792, over 4272140.21 frames. ], batch size: 247, lr: 6.20e-03, grad_scale: 32.0 2023-06-21 19:57:10,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=830574.0, ans=0.1 2023-06-21 19:57:54,199 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=830694.0, ans=0.0 2023-06-21 19:57:57,677 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 2.625e+02 2.903e+02 3.534e+02 6.213e+02, threshold=5.806e+02, percent-clipped=3.0 2023-06-21 19:58:58,126 INFO [train.py:996] (3/4) Epoch 5, batch 16500, loss[loss=0.1985, simple_loss=0.2601, pruned_loss=0.06849, over 21423.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.306, pruned_loss=0.0794, over 4277216.26 frames. ], batch size: 211, lr: 6.20e-03, grad_scale: 32.0 2023-06-21 19:59:24,692 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=830874.0, ans=0.125 2023-06-21 19:59:50,303 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.47 vs. limit=10.0 2023-06-21 19:59:51,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=830934.0, ans=0.04949747468305833 2023-06-21 20:00:11,733 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=830994.0, ans=0.125 2023-06-21 20:00:35,938 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.62 vs. limit=6.0 2023-06-21 20:01:20,114 INFO [train.py:996] (3/4) Epoch 5, batch 16550, loss[loss=0.2542, simple_loss=0.3164, pruned_loss=0.09596, over 21320.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3064, pruned_loss=0.07766, over 4272400.12 frames. ], batch size: 159, lr: 6.20e-03, grad_scale: 16.0 2023-06-21 20:01:34,722 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=831174.0, ans=0.0 2023-06-21 20:02:00,260 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.68 vs. limit=22.5 2023-06-21 20:02:34,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=831294.0, ans=0.2 2023-06-21 20:02:39,887 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.869e+02 2.770e+02 3.273e+02 4.136e+02 6.995e+02, threshold=6.546e+02, percent-clipped=5.0 2023-06-21 20:02:58,047 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=831354.0, ans=0.0 2023-06-21 20:03:48,554 INFO [train.py:996] (3/4) Epoch 5, batch 16600, loss[loss=0.2813, simple_loss=0.3823, pruned_loss=0.09021, over 21826.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3139, pruned_loss=0.08108, over 4263357.45 frames. ], batch size: 316, lr: 6.20e-03, grad_scale: 16.0 2023-06-21 20:03:55,925 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-21 20:05:02,087 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.19 vs. limit=15.0 2023-06-21 20:05:36,323 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.56 vs. limit=12.0 2023-06-21 20:05:43,509 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.72 vs. limit=15.0 2023-06-21 20:06:10,112 INFO [train.py:996] (3/4) Epoch 5, batch 16650, loss[loss=0.2908, simple_loss=0.3564, pruned_loss=0.1126, over 21783.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3217, pruned_loss=0.08378, over 4265720.82 frames. ], batch size: 441, lr: 6.20e-03, grad_scale: 16.0 2023-06-21 20:06:59,501 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=831894.0, ans=0.125 2023-06-21 20:07:21,249 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.206e+02 2.847e+02 3.289e+02 3.856e+02 6.866e+02, threshold=6.578e+02, percent-clipped=1.0 2023-06-21 20:08:28,191 INFO [train.py:996] (3/4) Epoch 5, batch 16700, loss[loss=0.2581, simple_loss=0.3386, pruned_loss=0.08883, over 21625.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3225, pruned_loss=0.08503, over 4271866.97 frames. ], batch size: 414, lr: 6.20e-03, grad_scale: 16.0 2023-06-21 20:09:48,790 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.42 vs. limit=15.0 2023-06-21 20:10:19,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=832254.0, ans=0.0 2023-06-21 20:10:56,357 INFO [train.py:996] (3/4) Epoch 5, batch 16750, loss[loss=0.24, simple_loss=0.3185, pruned_loss=0.08081, over 20738.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3254, pruned_loss=0.08712, over 4259146.61 frames. ], batch size: 607, lr: 6.20e-03, grad_scale: 16.0 2023-06-21 20:12:34,204 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.152e+02 2.676e+02 3.031e+02 3.509e+02 7.132e+02, threshold=6.063e+02, percent-clipped=1.0 2023-06-21 20:12:41,242 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 20:13:03,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=832614.0, ans=0.125 2023-06-21 20:13:23,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=832614.0, ans=0.125 2023-06-21 20:13:37,288 INFO [train.py:996] (3/4) Epoch 5, batch 16800, loss[loss=0.2186, simple_loss=0.2963, pruned_loss=0.07043, over 21501.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3286, pruned_loss=0.08655, over 4255625.04 frames. ], batch size: 131, lr: 6.19e-03, grad_scale: 32.0 2023-06-21 20:14:00,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=832674.0, ans=0.2 2023-06-21 20:14:19,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=832734.0, ans=0.0 2023-06-21 20:14:36,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=832734.0, ans=0.0 2023-06-21 20:14:46,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=832794.0, ans=0.125 2023-06-21 20:15:58,069 INFO [train.py:996] (3/4) Epoch 5, batch 16850, loss[loss=0.2564, simple_loss=0.3136, pruned_loss=0.09958, over 21771.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3235, pruned_loss=0.08586, over 4264990.31 frames. ], batch size: 441, lr: 6.19e-03, grad_scale: 32.0 2023-06-21 20:16:20,924 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.00 vs. limit=15.0 2023-06-21 20:16:27,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=832974.0, ans=0.2 2023-06-21 20:16:32,147 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=833034.0, ans=0.125 2023-06-21 20:16:37,181 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.28 vs. limit=15.0 2023-06-21 20:16:42,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=833034.0, ans=0.125 2023-06-21 20:16:58,689 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=833094.0, ans=0.1 2023-06-21 20:17:05,596 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.933e+02 2.685e+02 3.021e+02 3.703e+02 6.356e+02, threshold=6.041e+02, percent-clipped=1.0 2023-06-21 20:17:49,073 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=833214.0, ans=0.125 2023-06-21 20:18:07,186 INFO [train.py:996] (3/4) Epoch 5, batch 16900, loss[loss=0.1942, simple_loss=0.2652, pruned_loss=0.0616, over 21550.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3181, pruned_loss=0.08395, over 4268864.49 frames. ], batch size: 230, lr: 6.19e-03, grad_scale: 16.0 2023-06-21 20:18:43,139 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.05 vs. limit=22.5 2023-06-21 20:18:48,473 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=833334.0, ans=0.125 2023-06-21 20:18:50,681 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=15.0 2023-06-21 20:18:51,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=833334.0, ans=0.1 2023-06-21 20:18:54,199 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=833334.0, ans=0.125 2023-06-21 20:18:55,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=833334.0, ans=0.04949747468305833 2023-06-21 20:19:28,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=833454.0, ans=0.025 2023-06-21 20:20:22,851 INFO [train.py:996] (3/4) Epoch 5, batch 16950, loss[loss=0.248, simple_loss=0.3124, pruned_loss=0.09184, over 21842.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3107, pruned_loss=0.08159, over 4276418.65 frames. ], batch size: 107, lr: 6.19e-03, grad_scale: 16.0 2023-06-21 20:20:45,899 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=833574.0, ans=0.125 2023-06-21 20:21:21,792 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=833694.0, ans=0.125 2023-06-21 20:21:33,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=833694.0, ans=0.125 2023-06-21 20:21:46,292 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.999e+02 2.433e+02 2.745e+02 3.481e+02 5.788e+02, threshold=5.489e+02, percent-clipped=0.0 2023-06-21 20:22:27,630 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.77 vs. limit=15.0 2023-06-21 20:22:44,524 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=833874.0, ans=0.09899494936611666 2023-06-21 20:22:52,011 INFO [train.py:996] (3/4) Epoch 5, batch 17000, loss[loss=0.2566, simple_loss=0.3164, pruned_loss=0.09843, over 21258.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.31, pruned_loss=0.08236, over 4273662.81 frames. ], batch size: 176, lr: 6.19e-03, grad_scale: 16.0 2023-06-21 20:23:01,533 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.64 vs. limit=6.0 2023-06-21 20:23:36,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=833934.0, ans=0.125 2023-06-21 20:25:19,786 INFO [train.py:996] (3/4) Epoch 5, batch 17050, loss[loss=0.257, simple_loss=0.3505, pruned_loss=0.08175, over 21626.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3161, pruned_loss=0.08418, over 4274313.43 frames. ], batch size: 230, lr: 6.19e-03, grad_scale: 16.0 2023-06-21 20:25:31,102 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.01 vs. limit=12.0 2023-06-21 20:25:51,733 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=834234.0, ans=0.07 2023-06-21 20:25:56,018 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=834234.0, ans=0.0 2023-06-21 20:25:56,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=834234.0, ans=0.125 2023-06-21 20:25:57,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=834234.0, ans=0.125 2023-06-21 20:26:06,934 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.45 vs. limit=15.0 2023-06-21 20:26:25,075 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.296e+02 2.807e+02 3.360e+02 4.026e+02 6.543e+02, threshold=6.720e+02, percent-clipped=2.0 2023-06-21 20:26:31,472 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.87 vs. limit=15.0 2023-06-21 20:27:33,435 INFO [train.py:996] (3/4) Epoch 5, batch 17100, loss[loss=0.2125, simple_loss=0.2807, pruned_loss=0.07209, over 21863.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3147, pruned_loss=0.0848, over 4279897.91 frames. ], batch size: 247, lr: 6.19e-03, grad_scale: 16.0 2023-06-21 20:28:23,779 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.22 vs. limit=6.0 2023-06-21 20:28:34,137 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=834594.0, ans=6.0 2023-06-21 20:29:35,961 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=834714.0, ans=0.125 2023-06-21 20:29:47,410 INFO [train.py:996] (3/4) Epoch 5, batch 17150, loss[loss=0.1942, simple_loss=0.2641, pruned_loss=0.06214, over 21456.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3103, pruned_loss=0.08372, over 4291489.51 frames. ], batch size: 131, lr: 6.19e-03, grad_scale: 16.0 2023-06-21 20:29:58,316 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=834774.0, ans=0.125 2023-06-21 20:29:58,378 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=834774.0, ans=0.2 2023-06-21 20:30:09,604 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-21 20:30:58,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=834954.0, ans=0.0 2023-06-21 20:31:03,887 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.851e+02 2.351e+02 2.648e+02 3.061e+02 4.435e+02, threshold=5.296e+02, percent-clipped=0.0 2023-06-21 20:31:40,014 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.97 vs. limit=15.0 2023-06-21 20:32:12,192 INFO [train.py:996] (3/4) Epoch 5, batch 17200, loss[loss=0.2404, simple_loss=0.3111, pruned_loss=0.0848, over 21463.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3103, pruned_loss=0.08413, over 4289885.00 frames. ], batch size: 211, lr: 6.19e-03, grad_scale: 32.0 2023-06-21 20:32:55,569 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=835194.0, ans=0.125 2023-06-21 20:33:38,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=835254.0, ans=0.125 2023-06-21 20:34:04,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=835314.0, ans=0.125 2023-06-21 20:34:21,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=835374.0, ans=0.0 2023-06-21 20:34:22,150 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.71 vs. limit=10.0 2023-06-21 20:34:22,668 INFO [train.py:996] (3/4) Epoch 5, batch 17250, loss[loss=0.267, simple_loss=0.3421, pruned_loss=0.09595, over 21685.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3138, pruned_loss=0.08529, over 4282135.51 frames. ], batch size: 298, lr: 6.18e-03, grad_scale: 32.0 2023-06-21 20:34:23,618 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-21 20:35:36,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=835494.0, ans=0.125 2023-06-21 20:35:51,571 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.029e+02 2.701e+02 3.017e+02 3.644e+02 7.802e+02, threshold=6.033e+02, percent-clipped=3.0 2023-06-21 20:35:52,116 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=835554.0, ans=0.1 2023-06-21 20:36:32,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=835614.0, ans=0.0 2023-06-21 20:36:42,222 INFO [train.py:996] (3/4) Epoch 5, batch 17300, loss[loss=0.2913, simple_loss=0.3736, pruned_loss=0.1045, over 17450.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3208, pruned_loss=0.0892, over 4276071.85 frames. ], batch size: 60, lr: 6.18e-03, grad_scale: 16.0 2023-06-21 20:37:51,362 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.06 vs. limit=22.5 2023-06-21 20:38:43,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=835854.0, ans=0.125 2023-06-21 20:39:02,695 INFO [train.py:996] (3/4) Epoch 5, batch 17350, loss[loss=0.2327, simple_loss=0.3044, pruned_loss=0.08048, over 20732.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3222, pruned_loss=0.08905, over 4281052.64 frames. ], batch size: 607, lr: 6.18e-03, grad_scale: 16.0 2023-06-21 20:39:26,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=835974.0, ans=0.125 2023-06-21 20:40:49,977 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.074e+02 2.583e+02 2.886e+02 3.231e+02 4.631e+02, threshold=5.772e+02, percent-clipped=0.0 2023-06-21 20:40:52,452 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.29 vs. limit=22.5 2023-06-21 20:41:32,766 INFO [train.py:996] (3/4) Epoch 5, batch 17400, loss[loss=0.2227, simple_loss=0.3045, pruned_loss=0.07042, over 21788.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3193, pruned_loss=0.08559, over 4279256.03 frames. ], batch size: 316, lr: 6.18e-03, grad_scale: 16.0 2023-06-21 20:42:02,172 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=836274.0, ans=0.125 2023-06-21 20:42:28,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=836334.0, ans=0.125 2023-06-21 20:42:46,613 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2023-06-21 20:43:39,326 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-06-21 20:44:15,063 INFO [train.py:996] (3/4) Epoch 5, batch 17450, loss[loss=0.2529, simple_loss=0.3276, pruned_loss=0.08909, over 20635.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3142, pruned_loss=0.08259, over 4268217.64 frames. ], batch size: 607, lr: 6.18e-03, grad_scale: 16.0 2023-06-21 20:44:40,718 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=836634.0, ans=0.035 2023-06-21 20:44:45,257 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=836634.0, ans=0.2 2023-06-21 20:45:16,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=836694.0, ans=0.0 2023-06-21 20:45:30,572 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.951e+02 2.434e+02 2.809e+02 3.515e+02 5.944e+02, threshold=5.617e+02, percent-clipped=1.0 2023-06-21 20:45:54,723 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=836814.0, ans=0.125 2023-06-21 20:46:17,024 INFO [train.py:996] (3/4) Epoch 5, batch 17500, loss[loss=0.265, simple_loss=0.3372, pruned_loss=0.09641, over 21839.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3103, pruned_loss=0.08026, over 4266271.48 frames. ], batch size: 107, lr: 6.18e-03, grad_scale: 16.0 2023-06-21 20:46:54,070 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=836874.0, ans=0.0 2023-06-21 20:47:24,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=836994.0, ans=0.0 2023-06-21 20:47:43,744 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=837054.0, ans=0.1 2023-06-21 20:48:30,508 INFO [train.py:996] (3/4) Epoch 5, batch 17550, loss[loss=0.2254, simple_loss=0.3098, pruned_loss=0.0705, over 21363.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3104, pruned_loss=0.07949, over 4266474.98 frames. ], batch size: 176, lr: 6.18e-03, grad_scale: 16.0 2023-06-21 20:48:35,224 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=837174.0, ans=0.5 2023-06-21 20:49:13,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=837234.0, ans=0.07 2023-06-21 20:49:47,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=837354.0, ans=0.125 2023-06-21 20:49:50,420 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.814e+02 2.518e+02 2.821e+02 3.568e+02 5.525e+02, threshold=5.643e+02, percent-clipped=0.0 2023-06-21 20:50:45,972 INFO [train.py:996] (3/4) Epoch 5, batch 17600, loss[loss=0.2942, simple_loss=0.3566, pruned_loss=0.116, over 21822.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3125, pruned_loss=0.07993, over 4266378.26 frames. ], batch size: 441, lr: 6.18e-03, grad_scale: 32.0 2023-06-21 20:50:48,329 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=837474.0, ans=0.0 2023-06-21 20:52:38,148 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=837714.0, ans=0.0 2023-06-21 20:52:41,678 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.21 vs. limit=10.0 2023-06-21 20:52:42,137 INFO [train.py:996] (3/4) Epoch 5, batch 17650, loss[loss=0.2316, simple_loss=0.3083, pruned_loss=0.0774, over 21584.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3111, pruned_loss=0.07987, over 4263099.92 frames. ], batch size: 441, lr: 6.18e-03, grad_scale: 32.0 2023-06-21 20:53:08,350 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=837774.0, ans=0.2 2023-06-21 20:53:19,173 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=837774.0, ans=0.2 2023-06-21 20:53:40,329 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.25 vs. limit=15.0 2023-06-21 20:54:08,339 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.934e+02 2.478e+02 2.944e+02 3.611e+02 6.334e+02, threshold=5.887e+02, percent-clipped=3.0 2023-06-21 20:54:23,751 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=837954.0, ans=0.125 2023-06-21 20:54:31,581 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.65 vs. limit=15.0 2023-06-21 20:54:56,280 INFO [train.py:996] (3/4) Epoch 5, batch 17700, loss[loss=0.2591, simple_loss=0.3329, pruned_loss=0.09264, over 21435.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3067, pruned_loss=0.07777, over 4259903.73 frames. ], batch size: 131, lr: 6.17e-03, grad_scale: 32.0 2023-06-21 20:55:08,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=838074.0, ans=0.125 2023-06-21 20:55:11,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=838074.0, ans=0.2 2023-06-21 20:55:18,399 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=838074.0, ans=0.125 2023-06-21 20:56:40,511 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 20:57:22,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=838314.0, ans=0.125 2023-06-21 20:57:22,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=838314.0, ans=0.0 2023-06-21 20:57:27,414 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=838314.0, ans=0.125 2023-06-21 20:57:30,185 INFO [train.py:996] (3/4) Epoch 5, batch 17750, loss[loss=0.248, simple_loss=0.3216, pruned_loss=0.08719, over 21798.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3134, pruned_loss=0.08046, over 4261083.64 frames. ], batch size: 247, lr: 6.17e-03, grad_scale: 16.0 2023-06-21 20:57:31,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=838374.0, ans=0.125 2023-06-21 20:57:41,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=838374.0, ans=0.125 2023-06-21 20:57:42,543 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=838374.0, ans=0.125 2023-06-21 20:58:25,427 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=838494.0, ans=0.125 2023-06-21 20:58:53,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=838554.0, ans=0.0 2023-06-21 20:58:57,745 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.963e+02 2.544e+02 3.029e+02 3.560e+02 6.664e+02, threshold=6.058e+02, percent-clipped=1.0 2023-06-21 20:59:00,390 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.28 vs. limit=15.0 2023-06-21 20:59:04,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=838554.0, ans=0.125 2023-06-21 20:59:49,422 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.89 vs. limit=6.0 2023-06-21 20:59:49,789 INFO [train.py:996] (3/4) Epoch 5, batch 17800, loss[loss=0.2193, simple_loss=0.2953, pruned_loss=0.0717, over 21593.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3131, pruned_loss=0.08001, over 4264065.22 frames. ], batch size: 230, lr: 6.17e-03, grad_scale: 16.0 2023-06-21 20:59:55,073 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=838674.0, ans=0.05 2023-06-21 21:00:53,084 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.28 vs. limit=15.0 2023-06-21 21:01:34,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=838854.0, ans=0.0 2023-06-21 21:01:58,529 INFO [train.py:996] (3/4) Epoch 5, batch 17850, loss[loss=0.3083, simple_loss=0.3701, pruned_loss=0.1232, over 21791.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3124, pruned_loss=0.08029, over 4264223.20 frames. ], batch size: 441, lr: 6.17e-03, grad_scale: 16.0 2023-06-21 21:02:16,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=838974.0, ans=0.125 2023-06-21 21:02:16,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=838974.0, ans=0.125 2023-06-21 21:02:18,682 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.51 vs. limit=15.0 2023-06-21 21:03:12,604 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.38 vs. limit=8.0 2023-06-21 21:03:43,551 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.973e+02 2.616e+02 2.932e+02 3.373e+02 4.756e+02, threshold=5.865e+02, percent-clipped=0.0 2023-06-21 21:04:28,609 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=839214.0, ans=0.125 2023-06-21 21:04:31,038 INFO [train.py:996] (3/4) Epoch 5, batch 17900, loss[loss=0.3092, simple_loss=0.3929, pruned_loss=0.1127, over 21471.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.317, pruned_loss=0.08176, over 4264379.83 frames. ], batch size: 471, lr: 6.17e-03, grad_scale: 16.0 2023-06-21 21:04:46,747 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.49 vs. limit=10.0 2023-06-21 21:05:02,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=839274.0, ans=0.125 2023-06-21 21:05:13,279 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 21:06:12,825 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=839394.0, ans=0.0 2023-06-21 21:07:04,357 INFO [train.py:996] (3/4) Epoch 5, batch 17950, loss[loss=0.2404, simple_loss=0.332, pruned_loss=0.07442, over 21486.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3159, pruned_loss=0.07776, over 4265957.61 frames. ], batch size: 507, lr: 6.17e-03, grad_scale: 16.0 2023-06-21 21:08:02,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=839634.0, ans=0.125 2023-06-21 21:08:27,751 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 21:08:30,047 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.702e+02 2.221e+02 2.558e+02 3.024e+02 5.288e+02, threshold=5.115e+02, percent-clipped=0.0 2023-06-21 21:09:18,129 INFO [train.py:996] (3/4) Epoch 5, batch 18000, loss[loss=0.2128, simple_loss=0.2773, pruned_loss=0.07416, over 21597.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3106, pruned_loss=0.07725, over 4264920.62 frames. ], batch size: 298, lr: 6.17e-03, grad_scale: 32.0 2023-06-21 21:09:18,130 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-21 21:10:07,763 INFO [train.py:1028] (3/4) Epoch 5, validation: loss=0.2683, simple_loss=0.365, pruned_loss=0.08582, over 1796401.00 frames. 2023-06-21 21:10:07,765 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-21 21:11:07,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=839994.0, ans=0.2 2023-06-21 21:12:09,687 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=15.0 2023-06-21 21:12:14,555 INFO [train.py:996] (3/4) Epoch 5, batch 18050, loss[loss=0.212, simple_loss=0.2824, pruned_loss=0.07075, over 21652.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3052, pruned_loss=0.07634, over 4262131.07 frames. ], batch size: 298, lr: 6.17e-03, grad_scale: 32.0 2023-06-21 21:12:31,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=840174.0, ans=0.5 2023-06-21 21:12:34,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=840174.0, ans=0.125 2023-06-21 21:12:46,877 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=840234.0, ans=0.0 2023-06-21 21:13:01,360 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.77 vs. limit=15.0 2023-06-21 21:13:26,097 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=840294.0, ans=0.2 2023-06-21 21:13:37,673 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 2.366e+02 2.744e+02 3.197e+02 4.866e+02, threshold=5.489e+02, percent-clipped=0.0 2023-06-21 21:14:24,255 INFO [train.py:996] (3/4) Epoch 5, batch 18100, loss[loss=0.2344, simple_loss=0.3034, pruned_loss=0.08264, over 21155.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3111, pruned_loss=0.07864, over 4264152.30 frames. ], batch size: 143, lr: 6.17e-03, grad_scale: 32.0 2023-06-21 21:14:34,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=840474.0, ans=0.0 2023-06-21 21:15:01,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=840534.0, ans=0.1 2023-06-21 21:15:14,906 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=840594.0, ans=0.125 2023-06-21 21:16:32,819 INFO [train.py:996] (3/4) Epoch 5, batch 18150, loss[loss=0.2129, simple_loss=0.292, pruned_loss=0.06692, over 21574.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3097, pruned_loss=0.07795, over 4268819.00 frames. ], batch size: 230, lr: 6.16e-03, grad_scale: 32.0 2023-06-21 21:16:49,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=840774.0, ans=0.1 2023-06-21 21:17:05,654 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=840834.0, ans=0.0 2023-06-21 21:17:13,579 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=840894.0, ans=0.125 2023-06-21 21:17:52,206 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.016e+02 2.463e+02 2.724e+02 3.140e+02 4.706e+02, threshold=5.448e+02, percent-clipped=0.0 2023-06-21 21:17:55,604 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 21:17:58,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=840954.0, ans=0.2 2023-06-21 21:18:16,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=841014.0, ans=0.025 2023-06-21 21:18:22,753 INFO [train.py:996] (3/4) Epoch 5, batch 18200, loss[loss=0.1858, simple_loss=0.2634, pruned_loss=0.05416, over 21774.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3027, pruned_loss=0.07748, over 4266540.52 frames. ], batch size: 124, lr: 6.16e-03, grad_scale: 32.0 2023-06-21 21:19:20,513 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.31 vs. limit=22.5 2023-06-21 21:19:58,669 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=841254.0, ans=0.0 2023-06-21 21:20:36,986 INFO [train.py:996] (3/4) Epoch 5, batch 18250, loss[loss=0.2994, simple_loss=0.3352, pruned_loss=0.1318, over 21746.00 frames. ], tot_loss[loss=0.225, simple_loss=0.2979, pruned_loss=0.07603, over 4249221.96 frames. ], batch size: 508, lr: 6.16e-03, grad_scale: 32.0 2023-06-21 21:22:05,476 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.639e+02 2.405e+02 2.752e+02 3.303e+02 7.064e+02, threshold=5.504e+02, percent-clipped=4.0 2023-06-21 21:22:08,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=841554.0, ans=0.0 2023-06-21 21:22:15,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=841614.0, ans=0.2 2023-06-21 21:22:45,505 INFO [train.py:996] (3/4) Epoch 5, batch 18300, loss[loss=0.2218, simple_loss=0.2971, pruned_loss=0.07326, over 21872.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.2986, pruned_loss=0.07605, over 4240979.20 frames. ], batch size: 118, lr: 6.16e-03, grad_scale: 16.0 2023-06-21 21:22:53,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=841674.0, ans=0.04949747468305833 2023-06-21 21:22:55,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=841674.0, ans=0.0 2023-06-21 21:23:35,913 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=841794.0, ans=0.125 2023-06-21 21:23:46,266 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=841794.0, ans=0.125 2023-06-21 21:23:58,085 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=841854.0, ans=0.125 2023-06-21 21:24:44,378 INFO [train.py:996] (3/4) Epoch 5, batch 18350, loss[loss=0.2606, simple_loss=0.3949, pruned_loss=0.06313, over 19700.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3023, pruned_loss=0.07605, over 4237494.07 frames. ], batch size: 702, lr: 6.16e-03, grad_scale: 16.0 2023-06-21 21:24:47,763 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=841974.0, ans=0.0 2023-06-21 21:25:23,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=842034.0, ans=0.0 2023-06-21 21:25:33,217 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=842034.0, ans=0.04949747468305833 2023-06-21 21:26:05,182 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.993e+02 2.529e+02 3.098e+02 3.873e+02 6.507e+02, threshold=6.195e+02, percent-clipped=4.0 2023-06-21 21:26:05,647 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=842154.0, ans=0.125 2023-06-21 21:26:22,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=842214.0, ans=0.125 2023-06-21 21:27:08,509 INFO [train.py:996] (3/4) Epoch 5, batch 18400, loss[loss=0.2334, simple_loss=0.3181, pruned_loss=0.07439, over 21874.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2985, pruned_loss=0.07475, over 4249060.78 frames. ], batch size: 373, lr: 6.16e-03, grad_scale: 32.0 2023-06-21 21:27:41,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=842334.0, ans=0.125 2023-06-21 21:27:56,510 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.53 vs. limit=6.0 2023-06-21 21:28:07,930 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=842394.0, ans=0.0 2023-06-21 21:28:34,807 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2023-06-21 21:29:21,801 INFO [train.py:996] (3/4) Epoch 5, batch 18450, loss[loss=0.1677, simple_loss=0.2635, pruned_loss=0.03593, over 21728.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2952, pruned_loss=0.07095, over 4251382.53 frames. ], batch size: 298, lr: 6.16e-03, grad_scale: 32.0 2023-06-21 21:29:47,442 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.61 vs. limit=15.0 2023-06-21 21:30:06,838 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=842694.0, ans=0.125 2023-06-21 21:30:25,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=842754.0, ans=0.125 2023-06-21 21:30:35,675 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.867e+02 2.187e+02 2.513e+02 3.081e+02 4.801e+02, threshold=5.026e+02, percent-clipped=0.0 2023-06-21 21:30:50,920 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.93 vs. limit=10.0 2023-06-21 21:30:53,058 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=842814.0, ans=0.125 2023-06-21 21:31:16,298 INFO [train.py:996] (3/4) Epoch 5, batch 18500, loss[loss=0.1988, simple_loss=0.2623, pruned_loss=0.06769, over 21595.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2903, pruned_loss=0.07041, over 4244002.04 frames. ], batch size: 263, lr: 6.16e-03, grad_scale: 32.0 2023-06-21 21:31:39,850 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.55 vs. limit=15.0 2023-06-21 21:32:36,569 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=842994.0, ans=0.05 2023-06-21 21:32:40,854 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=843054.0, ans=0.0 2023-06-21 21:33:32,958 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=843114.0, ans=0.1 2023-06-21 21:33:38,482 INFO [train.py:996] (3/4) Epoch 5, batch 18550, loss[loss=0.222, simple_loss=0.2869, pruned_loss=0.07856, over 21519.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.287, pruned_loss=0.06989, over 4236920.29 frames. ], batch size: 391, lr: 6.16e-03, grad_scale: 32.0 2023-06-21 21:34:33,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=843294.0, ans=0.015 2023-06-21 21:35:03,986 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.831e+02 2.298e+02 2.531e+02 2.916e+02 4.382e+02, threshold=5.063e+02, percent-clipped=0.0 2023-06-21 21:35:39,102 INFO [train.py:996] (3/4) Epoch 5, batch 18600, loss[loss=0.2766, simple_loss=0.3528, pruned_loss=0.1002, over 21607.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2851, pruned_loss=0.07006, over 4230719.58 frames. ], batch size: 442, lr: 6.15e-03, grad_scale: 16.0 2023-06-21 21:36:03,109 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=843474.0, ans=0.0 2023-06-21 21:36:16,706 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=843534.0, ans=0.125 2023-06-21 21:36:18,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=843534.0, ans=0.1 2023-06-21 21:37:31,075 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.21 vs. limit=10.0 2023-06-21 21:37:38,658 INFO [train.py:996] (3/4) Epoch 5, batch 18650, loss[loss=0.221, simple_loss=0.2802, pruned_loss=0.08086, over 21318.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2862, pruned_loss=0.07117, over 4215893.27 frames. ], batch size: 551, lr: 6.15e-03, grad_scale: 16.0 2023-06-21 21:38:30,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=843834.0, ans=0.125 2023-06-21 21:38:48,747 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.10 vs. limit=10.0 2023-06-21 21:39:04,238 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.856e+02 2.384e+02 2.688e+02 3.191e+02 3.999e+02, threshold=5.375e+02, percent-clipped=0.0 2023-06-21 21:39:50,089 INFO [train.py:996] (3/4) Epoch 5, batch 18700, loss[loss=0.2178, simple_loss=0.2832, pruned_loss=0.07619, over 22006.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2852, pruned_loss=0.07221, over 4230629.18 frames. ], batch size: 300, lr: 6.15e-03, grad_scale: 16.0 2023-06-21 21:40:45,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=844134.0, ans=0.125 2023-06-21 21:42:03,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=844314.0, ans=0.125 2023-06-21 21:42:07,676 INFO [train.py:996] (3/4) Epoch 5, batch 18750, loss[loss=0.2468, simple_loss=0.3222, pruned_loss=0.08571, over 21755.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.287, pruned_loss=0.07444, over 4250116.43 frames. ], batch size: 332, lr: 6.15e-03, grad_scale: 16.0 2023-06-21 21:42:10,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=844374.0, ans=0.0 2023-06-21 21:42:37,888 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=844374.0, ans=0.1 2023-06-21 21:42:51,826 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.18 vs. limit=12.0 2023-06-21 21:43:02,298 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.71 vs. limit=15.0 2023-06-21 21:43:04,509 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=844434.0, ans=0.125 2023-06-21 21:43:17,311 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.64 vs. limit=22.5 2023-06-21 21:43:41,303 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.605e+02 2.502e+02 2.880e+02 3.663e+02 5.516e+02, threshold=5.761e+02, percent-clipped=2.0 2023-06-21 21:43:43,732 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.24 vs. limit=15.0 2023-06-21 21:44:30,215 INFO [train.py:996] (3/4) Epoch 5, batch 18800, loss[loss=0.2923, simple_loss=0.3691, pruned_loss=0.1078, over 21540.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2927, pruned_loss=0.07598, over 4246557.75 frames. ], batch size: 471, lr: 6.15e-03, grad_scale: 32.0 2023-06-21 21:44:33,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=844674.0, ans=0.0 2023-06-21 21:46:32,353 INFO [train.py:996] (3/4) Epoch 5, batch 18850, loss[loss=0.1999, simple_loss=0.2609, pruned_loss=0.06947, over 21185.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2893, pruned_loss=0.07162, over 4253265.84 frames. ], batch size: 159, lr: 6.15e-03, grad_scale: 32.0 2023-06-21 21:46:44,467 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=844974.0, ans=0.5 2023-06-21 21:48:09,539 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.639e+02 2.173e+02 2.548e+02 3.090e+02 5.403e+02, threshold=5.096e+02, percent-clipped=0.0 2023-06-21 21:48:22,789 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=845154.0, ans=0.0 2023-06-21 21:48:37,683 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=845214.0, ans=0.125 2023-06-21 21:48:47,485 INFO [train.py:996] (3/4) Epoch 5, batch 18900, loss[loss=0.2163, simple_loss=0.2753, pruned_loss=0.07862, over 21676.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2868, pruned_loss=0.07153, over 4262042.71 frames. ], batch size: 247, lr: 6.15e-03, grad_scale: 32.0 2023-06-21 21:48:56,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=845274.0, ans=0.0 2023-06-21 21:49:21,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=845334.0, ans=0.0 2023-06-21 21:49:23,518 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.97 vs. limit=22.5 2023-06-21 21:50:08,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=845454.0, ans=0.5 2023-06-21 21:50:36,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=845514.0, ans=0.025 2023-06-21 21:50:37,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=845514.0, ans=0.0 2023-06-21 21:50:50,457 INFO [train.py:996] (3/4) Epoch 5, batch 18950, loss[loss=0.2013, simple_loss=0.2464, pruned_loss=0.07808, over 21044.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2875, pruned_loss=0.07386, over 4276653.63 frames. ], batch size: 608, lr: 6.15e-03, grad_scale: 32.0 2023-06-21 21:51:33,503 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.87 vs. limit=6.0 2023-06-21 21:51:49,597 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=845634.0, ans=10.0 2023-06-21 21:52:19,759 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.08 vs. limit=15.0 2023-06-21 21:52:52,184 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.946e+02 2.414e+02 2.698e+02 3.115e+02 5.010e+02, threshold=5.395e+02, percent-clipped=0.0 2023-06-21 21:53:21,660 INFO [train.py:996] (3/4) Epoch 5, batch 19000, loss[loss=0.2649, simple_loss=0.342, pruned_loss=0.09385, over 21831.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.2983, pruned_loss=0.07554, over 4285503.00 frames. ], batch size: 282, lr: 6.15e-03, grad_scale: 32.0 2023-06-21 21:53:23,710 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=845874.0, ans=0.125 2023-06-21 21:54:05,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=845934.0, ans=0.0 2023-06-21 21:55:12,353 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=846114.0, ans=0.0 2023-06-21 21:55:29,409 INFO [train.py:996] (3/4) Epoch 5, batch 19050, loss[loss=0.2621, simple_loss=0.3232, pruned_loss=0.1005, over 21728.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.304, pruned_loss=0.07971, over 4287830.52 frames. ], batch size: 389, lr: 6.14e-03, grad_scale: 32.0 2023-06-21 21:56:39,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=846294.0, ans=0.125 2023-06-21 21:56:58,568 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.29 vs. limit=15.0 2023-06-21 21:57:00,288 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.023e+02 2.693e+02 3.057e+02 3.649e+02 5.381e+02, threshold=6.114e+02, percent-clipped=0.0 2023-06-21 21:57:43,805 INFO [train.py:996] (3/4) Epoch 5, batch 19100, loss[loss=0.2109, simple_loss=0.2729, pruned_loss=0.07442, over 21686.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3018, pruned_loss=0.0802, over 4291094.99 frames. ], batch size: 316, lr: 6.14e-03, grad_scale: 16.0 2023-06-21 21:57:44,795 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.36 vs. limit=6.0 2023-06-21 21:58:01,231 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 21:58:26,563 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.24 vs. limit=15.0 2023-06-21 22:00:04,778 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.18 vs. limit=22.5 2023-06-21 22:00:10,251 INFO [train.py:996] (3/4) Epoch 5, batch 19150, loss[loss=0.2427, simple_loss=0.3436, pruned_loss=0.07089, over 21702.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3047, pruned_loss=0.0811, over 4283418.19 frames. ], batch size: 298, lr: 6.14e-03, grad_scale: 16.0 2023-06-21 22:01:27,195 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.67 vs. limit=10.0 2023-06-21 22:01:30,418 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.80 vs. limit=15.0 2023-06-21 22:01:42,314 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.974e+02 2.638e+02 2.878e+02 3.184e+02 5.125e+02, threshold=5.755e+02, percent-clipped=0.0 2023-06-21 22:01:53,378 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=847014.0, ans=0.1 2023-06-21 22:01:54,032 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.68 vs. limit=15.0 2023-06-21 22:02:22,330 INFO [train.py:996] (3/4) Epoch 5, batch 19200, loss[loss=0.2609, simple_loss=0.3658, pruned_loss=0.07801, over 21740.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3163, pruned_loss=0.08258, over 4279623.93 frames. ], batch size: 351, lr: 6.14e-03, grad_scale: 32.0 2023-06-21 22:02:33,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=847074.0, ans=0.2 2023-06-21 22:03:16,707 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.99 vs. limit=12.0 2023-06-21 22:03:25,974 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.58 vs. limit=15.0 2023-06-21 22:03:34,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=847194.0, ans=0.04949747468305833 2023-06-21 22:04:18,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=847254.0, ans=0.125 2023-06-21 22:04:30,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=847314.0, ans=0.125 2023-06-21 22:04:42,047 INFO [train.py:996] (3/4) Epoch 5, batch 19250, loss[loss=0.2025, simple_loss=0.2817, pruned_loss=0.06164, over 21605.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3185, pruned_loss=0.07931, over 4277969.18 frames. ], batch size: 263, lr: 6.14e-03, grad_scale: 32.0 2023-06-21 22:04:53,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=847374.0, ans=0.0 2023-06-21 22:05:36,279 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=847494.0, ans=0.125 2023-06-21 22:06:30,476 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.677e+02 2.396e+02 2.706e+02 3.299e+02 5.125e+02, threshold=5.412e+02, percent-clipped=0.0 2023-06-21 22:06:36,838 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=847614.0, ans=0.125 2023-06-21 22:06:47,586 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=847614.0, ans=0.125 2023-06-21 22:06:51,712 INFO [train.py:996] (3/4) Epoch 5, batch 19300, loss[loss=0.2117, simple_loss=0.2916, pruned_loss=0.06588, over 21279.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3136, pruned_loss=0.077, over 4281782.86 frames. ], batch size: 176, lr: 6.14e-03, grad_scale: 32.0 2023-06-21 22:07:54,009 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=847794.0, ans=0.2 2023-06-21 22:08:18,501 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=847854.0, ans=0.125 2023-06-21 22:08:52,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=847914.0, ans=0.125 2023-06-21 22:08:59,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=847914.0, ans=0.0 2023-06-21 22:09:03,671 INFO [train.py:996] (3/4) Epoch 5, batch 19350, loss[loss=0.2761, simple_loss=0.3513, pruned_loss=0.1005, over 21555.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3093, pruned_loss=0.07383, over 4272283.82 frames. ], batch size: 509, lr: 6.14e-03, grad_scale: 32.0 2023-06-21 22:09:33,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=848034.0, ans=0.0 2023-06-21 22:09:58,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=848094.0, ans=0.05 2023-06-21 22:10:43,868 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.66 vs. limit=8.0 2023-06-21 22:10:49,119 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.636e+02 2.180e+02 2.502e+02 2.789e+02 3.930e+02, threshold=5.004e+02, percent-clipped=0.0 2023-06-21 22:10:51,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=848154.0, ans=0.0 2023-06-21 22:11:11,605 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.67 vs. limit=22.5 2023-06-21 22:11:16,306 INFO [train.py:996] (3/4) Epoch 5, batch 19400, loss[loss=0.2149, simple_loss=0.2879, pruned_loss=0.07093, over 21078.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3057, pruned_loss=0.07261, over 4276951.55 frames. ], batch size: 608, lr: 6.14e-03, grad_scale: 32.0 2023-06-21 22:12:06,248 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=848334.0, ans=0.5 2023-06-21 22:12:10,023 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.16 vs. limit=15.0 2023-06-21 22:12:10,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=848394.0, ans=0.0 2023-06-21 22:12:16,770 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 22:13:09,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=848454.0, ans=0.125 2023-06-21 22:13:20,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=848514.0, ans=0.2 2023-06-21 22:13:26,767 INFO [train.py:996] (3/4) Epoch 5, batch 19450, loss[loss=0.2177, simple_loss=0.2911, pruned_loss=0.0721, over 21802.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3017, pruned_loss=0.07461, over 4275215.83 frames. ], batch size: 118, lr: 6.14e-03, grad_scale: 32.0 2023-06-21 22:13:35,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=848574.0, ans=0.04949747468305833 2023-06-21 22:14:24,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=848694.0, ans=0.0 2023-06-21 22:14:25,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=848694.0, ans=0.125 2023-06-21 22:15:03,932 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.874e+02 2.755e+02 3.312e+02 4.071e+02 7.217e+02, threshold=6.625e+02, percent-clipped=11.0 2023-06-21 22:15:14,930 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=848814.0, ans=0.0 2023-06-21 22:15:31,721 INFO [train.py:996] (3/4) Epoch 5, batch 19500, loss[loss=0.2065, simple_loss=0.276, pruned_loss=0.06849, over 21658.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2963, pruned_loss=0.07525, over 4263338.13 frames. ], batch size: 298, lr: 6.14e-03, grad_scale: 32.0 2023-06-21 22:16:05,743 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.59 vs. limit=15.0 2023-06-21 22:16:24,898 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.27 vs. limit=15.0 2023-06-21 22:16:49,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=848994.0, ans=0.015 2023-06-21 22:17:17,367 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.13 vs. limit=12.0 2023-06-21 22:17:39,114 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 22:17:47,554 INFO [train.py:996] (3/4) Epoch 5, batch 19550, loss[loss=0.2089, simple_loss=0.3058, pruned_loss=0.05601, over 21832.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2924, pruned_loss=0.07347, over 4266152.61 frames. ], batch size: 371, lr: 6.13e-03, grad_scale: 16.0 2023-06-21 22:17:55,709 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=849174.0, ans=0.1 2023-06-21 22:17:59,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=849174.0, ans=0.125 2023-06-21 22:19:29,018 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.022e+02 2.481e+02 2.782e+02 3.315e+02 6.873e+02, threshold=5.565e+02, percent-clipped=1.0 2023-06-21 22:19:51,346 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=849414.0, ans=0.1 2023-06-21 22:19:59,642 INFO [train.py:996] (3/4) Epoch 5, batch 19600, loss[loss=0.2465, simple_loss=0.3075, pruned_loss=0.09275, over 21910.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2955, pruned_loss=0.07522, over 4275250.63 frames. ], batch size: 316, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 22:20:31,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=849534.0, ans=0.125 2023-06-21 22:21:09,495 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.60 vs. limit=12.0 2023-06-21 22:21:22,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=849594.0, ans=0.125 2023-06-21 22:22:29,147 INFO [train.py:996] (3/4) Epoch 5, batch 19650, loss[loss=0.2626, simple_loss=0.3305, pruned_loss=0.09732, over 20014.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3004, pruned_loss=0.07887, over 4273543.68 frames. ], batch size: 702, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 22:22:49,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=849834.0, ans=0.125 2023-06-21 22:24:12,396 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 2.670e+02 2.999e+02 3.349e+02 5.989e+02, threshold=5.997e+02, percent-clipped=1.0 2023-06-21 22:24:29,257 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=850014.0, ans=0.125 2023-06-21 22:24:57,805 INFO [train.py:996] (3/4) Epoch 5, batch 19700, loss[loss=0.2145, simple_loss=0.3081, pruned_loss=0.06041, over 21742.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3019, pruned_loss=0.07885, over 4273657.69 frames. ], batch size: 332, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 22:25:01,524 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=850074.0, ans=0.0 2023-06-21 22:26:02,235 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.58 vs. limit=15.0 2023-06-21 22:26:13,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=850194.0, ans=0.0 2023-06-21 22:26:37,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=850194.0, ans=0.125 2023-06-21 22:26:37,907 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=850194.0, ans=0.2 2023-06-21 22:26:58,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=850254.0, ans=0.125 2023-06-21 22:27:13,083 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=850314.0, ans=0.1 2023-06-21 22:27:22,336 INFO [train.py:996] (3/4) Epoch 5, batch 19750, loss[loss=0.2507, simple_loss=0.3381, pruned_loss=0.08163, over 21833.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3112, pruned_loss=0.08014, over 4277588.53 frames. ], batch size: 298, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 22:27:44,911 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=850374.0, ans=0.0 2023-06-21 22:29:19,492 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.051e+02 2.869e+02 3.746e+02 5.027e+02 9.459e+02, threshold=7.491e+02, percent-clipped=12.0 2023-06-21 22:29:39,219 INFO [train.py:996] (3/4) Epoch 5, batch 19800, loss[loss=0.2395, simple_loss=0.3051, pruned_loss=0.08689, over 21934.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3123, pruned_loss=0.08087, over 4279193.19 frames. ], batch size: 351, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 22:30:59,008 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=850794.0, ans=0.5 2023-06-21 22:32:09,118 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=850914.0, ans=0.125 2023-06-21 22:32:11,690 INFO [train.py:996] (3/4) Epoch 5, batch 19850, loss[loss=0.1794, simple_loss=0.2608, pruned_loss=0.04906, over 21605.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3041, pruned_loss=0.07605, over 4278925.48 frames. ], batch size: 263, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 22:33:18,349 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.71 vs. limit=12.0 2023-06-21 22:33:35,948 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.74 vs. limit=10.0 2023-06-21 22:33:37,127 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.738e+02 2.104e+02 2.344e+02 2.838e+02 4.436e+02, threshold=4.687e+02, percent-clipped=0.0 2023-06-21 22:33:37,853 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=851154.0, ans=0.07 2023-06-21 22:33:40,652 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=851214.0, ans=0.035 2023-06-21 22:34:25,313 INFO [train.py:996] (3/4) Epoch 5, batch 19900, loss[loss=0.201, simple_loss=0.2972, pruned_loss=0.0524, over 21768.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3043, pruned_loss=0.07305, over 4276046.44 frames. ], batch size: 351, lr: 6.13e-03, grad_scale: 16.0 2023-06-21 22:34:53,463 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=851334.0, ans=0.125 2023-06-21 22:35:07,229 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=851394.0, ans=0.125 2023-06-21 22:36:10,865 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.14 vs. limit=15.0 2023-06-21 22:36:29,312 INFO [train.py:996] (3/4) Epoch 5, batch 19950, loss[loss=0.1883, simple_loss=0.2529, pruned_loss=0.06181, over 21630.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2981, pruned_loss=0.0724, over 4271670.56 frames. ], batch size: 282, lr: 6.13e-03, grad_scale: 16.0 2023-06-21 22:36:32,050 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=19.17 vs. limit=22.5 2023-06-21 22:36:41,615 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=851574.0, ans=0.0 2023-06-21 22:37:08,954 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=851634.0, ans=0.125 2023-06-21 22:37:14,940 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=851634.0, ans=0.125 2023-06-21 22:37:56,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=851754.0, ans=0.125 2023-06-21 22:37:59,041 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=851754.0, ans=0.2 2023-06-21 22:38:22,480 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.805e+02 2.543e+02 2.962e+02 3.700e+02 5.630e+02, threshold=5.923e+02, percent-clipped=5.0 2023-06-21 22:38:52,202 INFO [train.py:996] (3/4) Epoch 5, batch 20000, loss[loss=0.2449, simple_loss=0.312, pruned_loss=0.08891, over 21418.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2998, pruned_loss=0.07276, over 4259045.94 frames. ], batch size: 159, lr: 6.12e-03, grad_scale: 32.0 2023-06-21 22:38:52,750 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=851874.0, ans=0.125 2023-06-21 22:38:59,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=851874.0, ans=0.1 2023-06-21 22:39:13,261 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-06-21 22:39:15,852 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.01 vs. limit=15.0 2023-06-21 22:39:21,793 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.55 vs. limit=15.0 2023-06-21 22:39:21,886 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.77 vs. limit=22.5 2023-06-21 22:39:37,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=851994.0, ans=0.1 2023-06-21 22:40:42,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=852114.0, ans=0.125 2023-06-21 22:40:53,538 INFO [train.py:996] (3/4) Epoch 5, batch 20050, loss[loss=0.2417, simple_loss=0.3163, pruned_loss=0.08357, over 21851.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3029, pruned_loss=0.07606, over 4272279.40 frames. ], batch size: 414, lr: 6.12e-03, grad_scale: 32.0 2023-06-21 22:41:27,859 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=852234.0, ans=0.125 2023-06-21 22:41:43,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=852294.0, ans=0.125 2023-06-21 22:42:14,431 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=852354.0, ans=0.025 2023-06-21 22:42:24,537 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.068e+02 2.589e+02 2.869e+02 3.352e+02 4.558e+02, threshold=5.737e+02, percent-clipped=0.0 2023-06-21 22:42:47,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=852414.0, ans=0.125 2023-06-21 22:43:07,072 INFO [train.py:996] (3/4) Epoch 5, batch 20100, loss[loss=0.2861, simple_loss=0.35, pruned_loss=0.1111, over 21603.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3044, pruned_loss=0.07829, over 4276050.46 frames. ], batch size: 471, lr: 6.12e-03, grad_scale: 16.0 2023-06-21 22:44:19,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=852594.0, ans=0.0 2023-06-21 22:45:17,060 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.13 vs. limit=5.0 2023-06-21 22:45:34,425 INFO [train.py:996] (3/4) Epoch 5, batch 20150, loss[loss=0.2642, simple_loss=0.3335, pruned_loss=0.09744, over 21945.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3145, pruned_loss=0.08215, over 4271112.91 frames. ], batch size: 316, lr: 6.12e-03, grad_scale: 16.0 2023-06-21 22:47:06,430 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.65 vs. limit=15.0 2023-06-21 22:47:18,098 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.257e+02 2.899e+02 3.546e+02 4.462e+02 7.672e+02, threshold=7.092e+02, percent-clipped=5.0 2023-06-21 22:47:19,189 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.11 vs. limit=15.0 2023-06-21 22:47:31,126 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.07 vs. limit=15.0 2023-06-21 22:47:53,458 INFO [train.py:996] (3/4) Epoch 5, batch 20200, loss[loss=0.2921, simple_loss=0.3908, pruned_loss=0.09668, over 21245.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3197, pruned_loss=0.08516, over 4268715.53 frames. ], batch size: 548, lr: 6.12e-03, grad_scale: 16.0 2023-06-21 22:47:53,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=853074.0, ans=0.2 2023-06-21 22:48:12,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=853074.0, ans=0.125 2023-06-21 22:49:13,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=853194.0, ans=0.07 2023-06-21 22:49:27,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=853254.0, ans=0.125 2023-06-21 22:49:45,697 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=853314.0, ans=0.07 2023-06-21 22:50:22,423 INFO [train.py:996] (3/4) Epoch 5, batch 20250, loss[loss=0.2207, simple_loss=0.2901, pruned_loss=0.0757, over 21256.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3208, pruned_loss=0.08399, over 4272227.64 frames. ], batch size: 143, lr: 6.12e-03, grad_scale: 16.0 2023-06-21 22:50:29,452 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=853374.0, ans=0.0 2023-06-21 22:50:30,821 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=853374.0, ans=0.0 2023-06-21 22:50:32,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=853374.0, ans=0.125 2023-06-21 22:50:47,334 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=853434.0, ans=0.125 2023-06-21 22:50:51,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=853434.0, ans=0.2 2023-06-21 22:51:51,197 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=853554.0, ans=0.025 2023-06-21 22:51:56,553 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.890e+02 2.435e+02 2.804e+02 3.299e+02 5.136e+02, threshold=5.609e+02, percent-clipped=0.0 2023-06-21 22:52:26,421 INFO [train.py:996] (3/4) Epoch 5, batch 20300, loss[loss=0.243, simple_loss=0.3128, pruned_loss=0.08663, over 21078.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3179, pruned_loss=0.08115, over 4268846.65 frames. ], batch size: 143, lr: 6.12e-03, grad_scale: 16.0 2023-06-21 22:54:00,812 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.75 vs. limit=15.0 2023-06-21 22:54:05,082 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.20 vs. limit=22.5 2023-06-21 22:54:20,955 INFO [train.py:996] (3/4) Epoch 5, batch 20350, loss[loss=0.2954, simple_loss=0.3514, pruned_loss=0.1197, over 21548.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3181, pruned_loss=0.08167, over 4260006.10 frames. ], batch size: 507, lr: 6.12e-03, grad_scale: 16.0 2023-06-21 22:54:43,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=853974.0, ans=0.2 2023-06-21 22:54:47,986 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=853974.0, ans=0.1 2023-06-21 22:56:00,521 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=854154.0, ans=0.0 2023-06-21 22:56:09,297 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.986e+02 2.362e+02 2.783e+02 3.388e+02 6.347e+02, threshold=5.566e+02, percent-clipped=2.0 2023-06-21 22:56:09,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=854214.0, ans=0.0 2023-06-21 22:56:15,597 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=854214.0, ans=0.015 2023-06-21 22:56:15,777 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=854214.0, ans=0.125 2023-06-21 22:56:42,087 INFO [train.py:996] (3/4) Epoch 5, batch 20400, loss[loss=0.2565, simple_loss=0.3239, pruned_loss=0.09452, over 21227.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3199, pruned_loss=0.08359, over 4259583.40 frames. ], batch size: 176, lr: 6.12e-03, grad_scale: 32.0 2023-06-21 22:56:47,860 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.10 vs. limit=15.0 2023-06-21 22:57:00,225 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-21 22:57:15,367 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 22:57:25,902 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.75 vs. limit=6.0 2023-06-21 22:57:32,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=854334.0, ans=0.0 2023-06-21 22:57:48,023 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=854394.0, ans=0.125 2023-06-21 22:58:14,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=854454.0, ans=0.125 2023-06-21 22:58:58,945 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.81 vs. limit=10.0 2023-06-21 22:59:03,939 INFO [train.py:996] (3/4) Epoch 5, batch 20450, loss[loss=0.2429, simple_loss=0.3015, pruned_loss=0.09213, over 21433.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3204, pruned_loss=0.08581, over 4256263.46 frames. ], batch size: 194, lr: 6.11e-03, grad_scale: 16.0 2023-06-21 22:59:19,571 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.83 vs. limit=22.5 2023-06-21 23:00:35,306 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.017e+02 2.525e+02 2.860e+02 3.466e+02 5.175e+02, threshold=5.721e+02, percent-clipped=0.0 2023-06-21 23:00:35,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=854814.0, ans=0.0 2023-06-21 23:00:42,780 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 23:00:55,760 INFO [train.py:996] (3/4) Epoch 5, batch 20500, loss[loss=0.2098, simple_loss=0.2781, pruned_loss=0.07081, over 21761.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3156, pruned_loss=0.08574, over 4259780.91 frames. ], batch size: 316, lr: 6.11e-03, grad_scale: 16.0 2023-06-21 23:01:28,462 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=854934.0, ans=0.125 2023-06-21 23:01:55,270 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.92 vs. limit=15.0 2023-06-21 23:02:38,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=855054.0, ans=0.125 2023-06-21 23:02:41,811 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=855114.0, ans=0.125 2023-06-21 23:03:05,511 INFO [train.py:996] (3/4) Epoch 5, batch 20550, loss[loss=0.2818, simple_loss=0.3576, pruned_loss=0.103, over 21461.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3079, pruned_loss=0.08363, over 4254217.43 frames. ], batch size: 473, lr: 6.11e-03, grad_scale: 16.0 2023-06-21 23:04:45,346 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.963e+02 2.439e+02 2.815e+02 3.369e+02 5.545e+02, threshold=5.629e+02, percent-clipped=0.0 2023-06-21 23:05:12,198 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=855414.0, ans=0.125 2023-06-21 23:05:16,246 INFO [train.py:996] (3/4) Epoch 5, batch 20600, loss[loss=0.2266, simple_loss=0.3022, pruned_loss=0.07551, over 21818.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3116, pruned_loss=0.08238, over 4265267.95 frames. ], batch size: 298, lr: 6.11e-03, grad_scale: 16.0 2023-06-21 23:05:41,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=855534.0, ans=0.1 2023-06-21 23:06:41,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=855654.0, ans=0.0 2023-06-21 23:06:41,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=855654.0, ans=0.125 2023-06-21 23:07:21,865 INFO [train.py:996] (3/4) Epoch 5, batch 20650, loss[loss=0.2145, simple_loss=0.2818, pruned_loss=0.07365, over 21798.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3073, pruned_loss=0.08269, over 4257023.69 frames. ], batch size: 351, lr: 6.11e-03, grad_scale: 16.0 2023-06-21 23:07:25,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=855774.0, ans=0.2 2023-06-21 23:07:49,501 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=855834.0, ans=0.2 2023-06-21 23:07:53,938 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=855834.0, ans=0.0 2023-06-21 23:08:45,028 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=855954.0, ans=0.0 2023-06-21 23:08:51,457 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-21 23:08:52,398 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=855954.0, ans=0.125 2023-06-21 23:09:13,974 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.915e+02 2.383e+02 2.744e+02 3.352e+02 4.969e+02, threshold=5.489e+02, percent-clipped=0.0 2023-06-21 23:09:17,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=856014.0, ans=0.125 2023-06-21 23:09:44,573 INFO [train.py:996] (3/4) Epoch 5, batch 20700, loss[loss=0.1703, simple_loss=0.2444, pruned_loss=0.04803, over 21754.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3004, pruned_loss=0.079, over 4264651.12 frames. ], batch size: 124, lr: 6.11e-03, grad_scale: 16.0 2023-06-21 23:10:30,681 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-06-21 23:11:09,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=856254.0, ans=0.0 2023-06-21 23:11:21,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=856314.0, ans=0.125 2023-06-21 23:11:51,750 INFO [train.py:996] (3/4) Epoch 5, batch 20750, loss[loss=0.2729, simple_loss=0.3759, pruned_loss=0.08499, over 21681.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.2998, pruned_loss=0.07664, over 4256831.01 frames. ], batch size: 389, lr: 6.11e-03, grad_scale: 16.0 2023-06-21 23:12:26,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=856374.0, ans=0.1 2023-06-21 23:12:31,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=856434.0, ans=0.2 2023-06-21 23:12:31,984 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.76 vs. limit=10.0 2023-06-21 23:13:11,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=856494.0, ans=0.125 2023-06-21 23:13:20,468 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.02 vs. limit=10.0 2023-06-21 23:13:40,754 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=856554.0, ans=0.125 2023-06-21 23:13:45,294 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 23:13:45,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=856614.0, ans=0.2 2023-06-21 23:13:46,180 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.914e+02 2.828e+02 3.457e+02 4.836e+02 8.710e+02, threshold=6.913e+02, percent-clipped=21.0 2023-06-21 23:14:13,082 INFO [train.py:996] (3/4) Epoch 5, batch 20800, loss[loss=0.2984, simple_loss=0.3397, pruned_loss=0.1286, over 21340.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3048, pruned_loss=0.0786, over 4254844.01 frames. ], batch size: 507, lr: 6.11e-03, grad_scale: 32.0 2023-06-21 23:14:19,419 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=856674.0, ans=0.1 2023-06-21 23:15:14,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=856794.0, ans=0.2 2023-06-21 23:15:51,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=856914.0, ans=0.1 2023-06-21 23:15:55,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=856914.0, ans=0.04949747468305833 2023-06-21 23:16:01,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=856914.0, ans=0.0 2023-06-21 23:16:24,393 INFO [train.py:996] (3/4) Epoch 5, batch 20850, loss[loss=0.1969, simple_loss=0.2722, pruned_loss=0.0608, over 21859.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2965, pruned_loss=0.07611, over 4251417.36 frames. ], batch size: 333, lr: 6.11e-03, grad_scale: 32.0 2023-06-21 23:16:31,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=856974.0, ans=0.125 2023-06-21 23:16:32,675 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=856974.0, ans=0.125 2023-06-21 23:16:35,874 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 23:17:46,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=857154.0, ans=0.125 2023-06-21 23:18:13,621 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.775e+02 2.467e+02 2.845e+02 3.460e+02 6.189e+02, threshold=5.691e+02, percent-clipped=0.0 2023-06-21 23:18:40,143 INFO [train.py:996] (3/4) Epoch 5, batch 20900, loss[loss=0.2413, simple_loss=0.3193, pruned_loss=0.08165, over 21767.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.2989, pruned_loss=0.07736, over 4255838.00 frames. ], batch size: 351, lr: 6.11e-03, grad_scale: 16.0 2023-06-21 23:19:19,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=857394.0, ans=0.125 2023-06-21 23:19:19,967 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.95 vs. limit=22.5 2023-06-21 23:20:14,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=857514.0, ans=0.125 2023-06-21 23:20:25,864 INFO [train.py:996] (3/4) Epoch 5, batch 20950, loss[loss=0.2054, simple_loss=0.2777, pruned_loss=0.06655, over 21725.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2961, pruned_loss=0.0738, over 4264547.70 frames. ], batch size: 282, lr: 6.10e-03, grad_scale: 16.0 2023-06-21 23:20:36,238 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=857574.0, ans=0.125 2023-06-21 23:21:08,509 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.01 vs. limit=15.0 2023-06-21 23:21:17,818 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=22.5 2023-06-21 23:21:39,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=857694.0, ans=0.125 2023-06-21 23:22:08,066 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.705e+02 2.456e+02 2.757e+02 3.195e+02 6.346e+02, threshold=5.513e+02, percent-clipped=1.0 2023-06-21 23:22:17,210 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.88 vs. limit=15.0 2023-06-21 23:22:39,144 INFO [train.py:996] (3/4) Epoch 5, batch 21000, loss[loss=0.2593, simple_loss=0.3171, pruned_loss=0.1007, over 21759.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2949, pruned_loss=0.0742, over 4264122.02 frames. ], batch size: 389, lr: 6.10e-03, grad_scale: 16.0 2023-06-21 23:22:39,144 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-21 23:23:36,536 INFO [train.py:1028] (3/4) Epoch 5, validation: loss=0.2652, simple_loss=0.3651, pruned_loss=0.08266, over 1796401.00 frames. 2023-06-21 23:23:36,537 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-21 23:23:48,880 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=857874.0, ans=0.5 2023-06-21 23:23:52,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=857934.0, ans=0.0 2023-06-21 23:23:53,543 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=857934.0, ans=0.0 2023-06-21 23:24:22,605 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=857994.0, ans=0.04949747468305833 2023-06-21 23:24:30,594 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.97 vs. limit=15.0 2023-06-21 23:25:09,758 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.16 vs. limit=15.0 2023-06-21 23:25:24,434 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.83 vs. limit=15.0 2023-06-21 23:25:24,733 INFO [train.py:996] (3/4) Epoch 5, batch 21050, loss[loss=0.1975, simple_loss=0.2646, pruned_loss=0.06524, over 21282.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2928, pruned_loss=0.0748, over 4264896.44 frames. ], batch size: 159, lr: 6.10e-03, grad_scale: 16.0 2023-06-21 23:25:41,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=858174.0, ans=0.125 2023-06-21 23:25:51,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=858234.0, ans=0.025 2023-06-21 23:25:56,405 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-06-21 23:26:16,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=858294.0, ans=0.125 2023-06-21 23:26:16,806 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.48 vs. limit=15.0 2023-06-21 23:26:59,615 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.878e+02 2.489e+02 2.963e+02 3.543e+02 5.648e+02, threshold=5.927e+02, percent-clipped=1.0 2023-06-21 23:27:10,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=858414.0, ans=0.125 2023-06-21 23:27:12,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=858414.0, ans=0.125 2023-06-21 23:27:32,376 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=858414.0, ans=0.95 2023-06-21 23:27:34,812 INFO [train.py:996] (3/4) Epoch 5, batch 21100, loss[loss=0.2237, simple_loss=0.2747, pruned_loss=0.08641, over 21316.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2889, pruned_loss=0.07402, over 4262791.26 frames. ], batch size: 160, lr: 6.10e-03, grad_scale: 16.0 2023-06-21 23:27:56,026 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=12.0 2023-06-21 23:28:07,201 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.16 vs. limit=15.0 2023-06-21 23:28:16,606 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.23 vs. limit=22.5 2023-06-21 23:29:35,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=858714.0, ans=0.125 2023-06-21 23:29:44,530 INFO [train.py:996] (3/4) Epoch 5, batch 21150, loss[loss=0.1791, simple_loss=0.2221, pruned_loss=0.06803, over 20816.00 frames. ], tot_loss[loss=0.218, simple_loss=0.286, pruned_loss=0.07494, over 4259633.13 frames. ], batch size: 609, lr: 6.10e-03, grad_scale: 16.0 2023-06-21 23:30:04,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=858774.0, ans=0.125 2023-06-21 23:30:14,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=858834.0, ans=0.0 2023-06-21 23:30:31,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=858894.0, ans=0.2 2023-06-21 23:31:29,323 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.928e+02 2.447e+02 2.797e+02 3.334e+02 4.948e+02, threshold=5.595e+02, percent-clipped=0.0 2023-06-21 23:31:55,145 INFO [train.py:996] (3/4) Epoch 5, batch 21200, loss[loss=0.2013, simple_loss=0.2717, pruned_loss=0.06546, over 21765.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2822, pruned_loss=0.07407, over 4265934.84 frames. ], batch size: 371, lr: 6.10e-03, grad_scale: 32.0 2023-06-21 23:31:57,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=859074.0, ans=0.0 2023-06-21 23:33:53,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=859314.0, ans=0.035 2023-06-21 23:34:07,351 INFO [train.py:996] (3/4) Epoch 5, batch 21250, loss[loss=0.1963, simple_loss=0.2746, pruned_loss=0.05904, over 21326.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2799, pruned_loss=0.07383, over 4267192.65 frames. ], batch size: 131, lr: 6.10e-03, grad_scale: 16.0 2023-06-21 23:35:04,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=859494.0, ans=0.125 2023-06-21 23:35:08,290 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=859554.0, ans=0.125 2023-06-21 23:35:38,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=859614.0, ans=0.125 2023-06-21 23:35:41,302 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.784e+02 2.492e+02 2.834e+02 3.212e+02 4.793e+02, threshold=5.668e+02, percent-clipped=0.0 2023-06-21 23:36:01,187 INFO [train.py:996] (3/4) Epoch 5, batch 21300, loss[loss=0.2536, simple_loss=0.3193, pruned_loss=0.09393, over 21949.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2863, pruned_loss=0.07591, over 4264824.48 frames. ], batch size: 113, lr: 6.10e-03, grad_scale: 16.0 2023-06-21 23:36:24,254 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=859674.0, ans=0.0 2023-06-21 23:36:31,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=859674.0, ans=0.025 2023-06-21 23:36:36,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=859734.0, ans=0.1 2023-06-21 23:36:45,501 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=859734.0, ans=0.0 2023-06-21 23:37:11,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=859794.0, ans=0.125 2023-06-21 23:37:22,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=859854.0, ans=22.5 2023-06-21 23:37:30,467 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.65 vs. limit=12.0 2023-06-21 23:38:27,901 INFO [train.py:996] (3/4) Epoch 5, batch 21350, loss[loss=0.1903, simple_loss=0.2809, pruned_loss=0.04991, over 21756.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2903, pruned_loss=0.07602, over 4265757.97 frames. ], batch size: 282, lr: 6.10e-03, grad_scale: 16.0 2023-06-21 23:38:31,196 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=859974.0, ans=0.0 2023-06-21 23:38:47,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=860034.0, ans=0.125 2023-06-21 23:39:21,286 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=860154.0, ans=0.0 2023-06-21 23:39:53,083 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=860154.0, ans=0.2 2023-06-21 23:39:58,674 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.968e+02 2.438e+02 2.773e+02 3.263e+02 5.694e+02, threshold=5.547e+02, percent-clipped=1.0 2023-06-21 23:40:26,302 INFO [train.py:996] (3/4) Epoch 5, batch 21400, loss[loss=0.227, simple_loss=0.3145, pruned_loss=0.06972, over 21617.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2945, pruned_loss=0.07614, over 4265089.82 frames. ], batch size: 414, lr: 6.09e-03, grad_scale: 16.0 2023-06-21 23:40:46,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=860334.0, ans=0.0 2023-06-21 23:41:32,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=860394.0, ans=0.125 2023-06-21 23:41:52,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=860454.0, ans=0.0 2023-06-21 23:42:01,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=860454.0, ans=0.0 2023-06-21 23:42:13,432 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=860514.0, ans=0.05 2023-06-21 23:42:28,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=860514.0, ans=0.125 2023-06-21 23:42:50,929 INFO [train.py:996] (3/4) Epoch 5, batch 21450, loss[loss=0.2494, simple_loss=0.3163, pruned_loss=0.09127, over 21878.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.2981, pruned_loss=0.07762, over 4275830.51 frames. ], batch size: 414, lr: 6.09e-03, grad_scale: 16.0 2023-06-21 23:43:02,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=860574.0, ans=0.125 2023-06-21 23:43:20,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=860634.0, ans=0.0 2023-06-21 23:43:20,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=860634.0, ans=0.0 2023-06-21 23:43:58,689 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 23:44:20,908 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=860754.0, ans=0.0 2023-06-21 23:44:40,937 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.065e+02 2.786e+02 3.330e+02 3.846e+02 5.703e+02, threshold=6.661e+02, percent-clipped=1.0 2023-06-21 23:44:41,353 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=860814.0, ans=0.125 2023-06-21 23:45:04,746 INFO [train.py:996] (3/4) Epoch 5, batch 21500, loss[loss=0.2052, simple_loss=0.2684, pruned_loss=0.07094, over 21630.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.2964, pruned_loss=0.07872, over 4271896.30 frames. ], batch size: 298, lr: 6.09e-03, grad_scale: 16.0 2023-06-21 23:45:44,707 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=860934.0, ans=0.2 2023-06-21 23:46:00,882 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.60 vs. limit=15.0 2023-06-21 23:46:05,509 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.24 vs. limit=22.5 2023-06-21 23:46:06,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=860994.0, ans=0.125 2023-06-21 23:47:17,750 INFO [train.py:996] (3/4) Epoch 5, batch 21550, loss[loss=0.209, simple_loss=0.2851, pruned_loss=0.06645, over 21838.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2899, pruned_loss=0.07556, over 4275054.57 frames. ], batch size: 107, lr: 6.09e-03, grad_scale: 16.0 2023-06-21 23:48:15,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=861294.0, ans=0.0 2023-06-21 23:49:12,626 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.757e+02 2.449e+02 2.734e+02 3.185e+02 5.518e+02, threshold=5.467e+02, percent-clipped=0.0 2023-06-21 23:49:24,735 INFO [train.py:996] (3/4) Epoch 5, batch 21600, loss[loss=0.1878, simple_loss=0.2473, pruned_loss=0.06416, over 20713.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2872, pruned_loss=0.0742, over 4272768.75 frames. ], batch size: 607, lr: 6.09e-03, grad_scale: 32.0 2023-06-21 23:49:28,302 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=861474.0, ans=0.0 2023-06-21 23:50:32,501 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=861594.0, ans=0.0 2023-06-21 23:50:32,502 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=861594.0, ans=0.125 2023-06-21 23:51:30,322 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=861714.0, ans=0.1 2023-06-21 23:51:39,801 INFO [train.py:996] (3/4) Epoch 5, batch 21650, loss[loss=0.1692, simple_loss=0.2363, pruned_loss=0.05111, over 21466.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2907, pruned_loss=0.07306, over 4275480.72 frames. ], batch size: 212, lr: 6.09e-03, grad_scale: 32.0 2023-06-21 23:52:18,532 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=861834.0, ans=0.0 2023-06-21 23:52:18,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=861834.0, ans=0.2 2023-06-21 23:52:21,378 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=861834.0, ans=0.0 2023-06-21 23:52:28,824 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 23:52:36,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=861894.0, ans=0.2 2023-06-21 23:53:35,954 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=861954.0, ans=0.125 2023-06-21 23:53:43,244 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.784e+02 2.320e+02 2.612e+02 3.014e+02 5.606e+02, threshold=5.225e+02, percent-clipped=2.0 2023-06-21 23:53:48,290 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff2.min_abs, batch_count=862014.0, ans=0.1 2023-06-21 23:53:49,820 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=862014.0, ans=0.0 2023-06-21 23:53:55,261 INFO [train.py:996] (3/4) Epoch 5, batch 21700, loss[loss=0.1921, simple_loss=0.2489, pruned_loss=0.06764, over 20742.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2906, pruned_loss=0.07198, over 4271473.85 frames. ], batch size: 608, lr: 6.09e-03, grad_scale: 32.0 2023-06-21 23:53:57,012 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=862074.0, ans=0.1 2023-06-21 23:54:01,217 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=862074.0, ans=0.2 2023-06-21 23:54:44,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=862194.0, ans=0.125 2023-06-21 23:55:10,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=862254.0, ans=0.09899494936611666 2023-06-21 23:55:39,595 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 23:55:48,504 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=862314.0, ans=0.125 2023-06-21 23:55:56,728 INFO [train.py:996] (3/4) Epoch 5, batch 21750, loss[loss=0.2154, simple_loss=0.2829, pruned_loss=0.07392, over 21845.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2867, pruned_loss=0.07193, over 4264459.91 frames. ], batch size: 107, lr: 6.09e-03, grad_scale: 32.0 2023-06-21 23:57:05,651 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=862494.0, ans=10.0 2023-06-21 23:57:58,712 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.986e+02 2.550e+02 2.899e+02 3.518e+02 5.551e+02, threshold=5.797e+02, percent-clipped=2.0 2023-06-21 23:58:11,059 INFO [train.py:996] (3/4) Epoch 5, batch 21800, loss[loss=0.2016, simple_loss=0.2576, pruned_loss=0.07281, over 21581.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2852, pruned_loss=0.07297, over 4254881.77 frames. ], batch size: 230, lr: 6.09e-03, grad_scale: 32.0 2023-06-21 23:59:18,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=862794.0, ans=0.0 2023-06-21 23:59:19,567 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 23:59:37,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=862854.0, ans=10.0 2023-06-22 00:00:17,438 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.37 vs. limit=12.0 2023-06-22 00:00:25,046 INFO [train.py:996] (3/4) Epoch 5, batch 21850, loss[loss=0.2594, simple_loss=0.3449, pruned_loss=0.08697, over 19744.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.29, pruned_loss=0.07334, over 4229575.89 frames. ], batch size: 702, lr: 6.08e-03, grad_scale: 32.0 2023-06-22 00:00:27,054 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=862974.0, ans=0.0 2023-06-22 00:00:36,960 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.37 vs. limit=22.5 2023-06-22 00:00:47,042 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=863034.0, ans=0.2 2023-06-22 00:01:33,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=863094.0, ans=0.1 2023-06-22 00:01:48,733 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=863094.0, ans=0.2 2023-06-22 00:02:10,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=863154.0, ans=0.0 2023-06-22 00:02:32,165 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.894e+02 2.444e+02 2.856e+02 3.585e+02 5.005e+02, threshold=5.712e+02, percent-clipped=0.0 2023-06-22 00:02:43,748 INFO [train.py:996] (3/4) Epoch 5, batch 21900, loss[loss=0.2152, simple_loss=0.271, pruned_loss=0.0797, over 21227.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2898, pruned_loss=0.07438, over 4242687.82 frames. ], batch size: 176, lr: 6.08e-03, grad_scale: 32.0 2023-06-22 00:03:10,000 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=863334.0, ans=0.125 2023-06-22 00:03:56,324 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=863394.0, ans=0.125 2023-06-22 00:03:58,125 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=863394.0, ans=0.125 2023-06-22 00:04:05,482 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=8.60 vs. limit=12.0 2023-06-22 00:04:43,084 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=863514.0, ans=0.125 2023-06-22 00:04:56,136 INFO [train.py:996] (3/4) Epoch 5, batch 21950, loss[loss=0.1943, simple_loss=0.2714, pruned_loss=0.05854, over 21510.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2856, pruned_loss=0.07353, over 4255287.68 frames. ], batch size: 441, lr: 6.08e-03, grad_scale: 32.0 2023-06-22 00:05:13,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=863574.0, ans=15.0 2023-06-22 00:05:52,598 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=863694.0, ans=0.125 2023-06-22 00:06:38,378 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=863754.0, ans=0.125 2023-06-22 00:06:45,425 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.690e+02 2.222e+02 2.484e+02 2.757e+02 5.064e+02, threshold=4.969e+02, percent-clipped=0.0 2023-06-22 00:07:03,567 INFO [train.py:996] (3/4) Epoch 5, batch 22000, loss[loss=0.2068, simple_loss=0.2726, pruned_loss=0.07048, over 21544.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2807, pruned_loss=0.07093, over 4249889.72 frames. ], batch size: 212, lr: 6.08e-03, grad_scale: 32.0 2023-06-22 00:07:12,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=863874.0, ans=0.2 2023-06-22 00:07:55,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=863934.0, ans=0.07 2023-06-22 00:08:42,829 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.79 vs. limit=22.5 2023-06-22 00:08:56,637 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.25 vs. limit=12.0 2023-06-22 00:09:07,577 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=864114.0, ans=0.2 2023-06-22 00:09:13,105 INFO [train.py:996] (3/4) Epoch 5, batch 22050, loss[loss=0.2165, simple_loss=0.2933, pruned_loss=0.06986, over 21381.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2854, pruned_loss=0.07176, over 4244264.38 frames. ], batch size: 176, lr: 6.08e-03, grad_scale: 32.0 2023-06-22 00:09:15,023 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=864174.0, ans=0.125 2023-06-22 00:11:05,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=864354.0, ans=10.0 2023-06-22 00:11:10,347 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.004e+02 2.782e+02 3.105e+02 3.760e+02 6.556e+02, threshold=6.210e+02, percent-clipped=5.0 2023-06-22 00:11:27,968 INFO [train.py:996] (3/4) Epoch 5, batch 22100, loss[loss=0.283, simple_loss=0.361, pruned_loss=0.1026, over 21279.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.2972, pruned_loss=0.0769, over 4242713.26 frames. ], batch size: 549, lr: 6.08e-03, grad_scale: 32.0 2023-06-22 00:11:40,322 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=864474.0, ans=0.0 2023-06-22 00:13:13,983 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.01 vs. limit=12.0 2023-06-22 00:13:55,540 INFO [train.py:996] (3/4) Epoch 5, batch 22150, loss[loss=0.2346, simple_loss=0.3084, pruned_loss=0.08036, over 21892.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3018, pruned_loss=0.07918, over 4254250.14 frames. ], batch size: 124, lr: 6.08e-03, grad_scale: 32.0 2023-06-22 00:14:42,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=864834.0, ans=10.0 2023-06-22 00:15:40,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=864954.0, ans=0.5 2023-06-22 00:15:54,120 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.272e+02 2.802e+02 3.197e+02 3.821e+02 5.658e+02, threshold=6.394e+02, percent-clipped=0.0 2023-06-22 00:16:08,468 INFO [train.py:996] (3/4) Epoch 5, batch 22200, loss[loss=0.223, simple_loss=0.302, pruned_loss=0.07196, over 21890.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3048, pruned_loss=0.08011, over 4260206.56 frames. ], batch size: 351, lr: 6.08e-03, grad_scale: 32.0 2023-06-22 00:16:16,651 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=865074.0, ans=0.125 2023-06-22 00:17:42,720 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=865194.0, ans=0.0 2023-06-22 00:17:50,211 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 00:18:03,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=865254.0, ans=0.125 2023-06-22 00:18:32,631 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.09 vs. limit=12.0 2023-06-22 00:18:35,914 INFO [train.py:996] (3/4) Epoch 5, batch 22250, loss[loss=0.2656, simple_loss=0.3423, pruned_loss=0.09444, over 21841.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3096, pruned_loss=0.08151, over 4265657.51 frames. ], batch size: 107, lr: 6.08e-03, grad_scale: 32.0 2023-06-22 00:19:49,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=865494.0, ans=0.125 2023-06-22 00:20:06,349 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=865554.0, ans=0.04949747468305833 2023-06-22 00:20:29,887 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=865614.0, ans=0.0 2023-06-22 00:20:30,955 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.143e+02 2.644e+02 2.877e+02 3.306e+02 4.671e+02, threshold=5.753e+02, percent-clipped=0.0 2023-06-22 00:20:42,609 INFO [train.py:996] (3/4) Epoch 5, batch 22300, loss[loss=0.2738, simple_loss=0.3282, pruned_loss=0.1097, over 21762.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3115, pruned_loss=0.08339, over 4277654.37 frames. ], batch size: 441, lr: 6.08e-03, grad_scale: 32.0 2023-06-22 00:22:38,733 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=865854.0, ans=0.0 2023-06-22 00:23:07,775 INFO [train.py:996] (3/4) Epoch 5, batch 22350, loss[loss=0.2363, simple_loss=0.3206, pruned_loss=0.07604, over 16859.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3098, pruned_loss=0.08413, over 4280489.14 frames. ], batch size: 60, lr: 6.07e-03, grad_scale: 32.0 2023-06-22 00:23:22,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=865974.0, ans=0.125 2023-06-22 00:23:35,292 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=865974.0, ans=0.2 2023-06-22 00:24:32,494 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=866094.0, ans=0.125 2023-06-22 00:24:36,718 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=866154.0, ans=0.0 2023-06-22 00:24:37,314 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.03 vs. limit=15.0 2023-06-22 00:25:03,629 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.38 vs. limit=15.0 2023-06-22 00:25:08,178 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.892e+02 2.517e+02 2.753e+02 3.146e+02 5.107e+02, threshold=5.506e+02, percent-clipped=0.0 2023-06-22 00:25:13,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=866214.0, ans=0.125 2023-06-22 00:25:39,855 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.60 vs. limit=15.0 2023-06-22 00:25:40,369 INFO [train.py:996] (3/4) Epoch 5, batch 22400, loss[loss=0.1761, simple_loss=0.2656, pruned_loss=0.04328, over 21737.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.306, pruned_loss=0.08123, over 4283189.36 frames. ], batch size: 282, lr: 6.07e-03, grad_scale: 32.0 2023-06-22 00:25:47,931 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=866274.0, ans=0.0 2023-06-22 00:25:53,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=866334.0, ans=0.125 2023-06-22 00:27:02,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=866454.0, ans=0.025 2023-06-22 00:27:16,747 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.11 vs. limit=15.0 2023-06-22 00:27:40,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=866514.0, ans=0.2 2023-06-22 00:27:44,989 INFO [train.py:996] (3/4) Epoch 5, batch 22450, loss[loss=0.2215, simple_loss=0.2852, pruned_loss=0.07895, over 21794.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.2996, pruned_loss=0.07942, over 4279984.73 frames. ], batch size: 112, lr: 6.07e-03, grad_scale: 32.0 2023-06-22 00:28:04,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=866574.0, ans=0.125 2023-06-22 00:28:41,102 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 00:28:52,609 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.35 vs. limit=22.5 2023-06-22 00:28:58,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=866754.0, ans=0.125 2023-06-22 00:29:33,978 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.934e+02 2.555e+02 2.922e+02 4.023e+02 6.050e+02, threshold=5.844e+02, percent-clipped=3.0 2023-06-22 00:29:50,100 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.73 vs. limit=22.5 2023-06-22 00:29:55,202 INFO [train.py:996] (3/4) Epoch 5, batch 22500, loss[loss=0.2128, simple_loss=0.2941, pruned_loss=0.06568, over 21380.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.2939, pruned_loss=0.07848, over 4278604.09 frames. ], batch size: 194, lr: 6.07e-03, grad_scale: 32.0 2023-06-22 00:30:43,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=866934.0, ans=0.125 2023-06-22 00:31:40,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=867054.0, ans=0.125 2023-06-22 00:32:10,856 INFO [train.py:996] (3/4) Epoch 5, batch 22550, loss[loss=0.2067, simple_loss=0.2813, pruned_loss=0.066, over 21855.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.2995, pruned_loss=0.07918, over 4287906.54 frames. ], batch size: 282, lr: 6.07e-03, grad_scale: 32.0 2023-06-22 00:32:38,850 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.81 vs. limit=15.0 2023-06-22 00:34:08,570 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=867354.0, ans=0.125 2023-06-22 00:34:15,152 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.115e+02 2.802e+02 3.271e+02 3.660e+02 5.759e+02, threshold=6.543e+02, percent-clipped=0.0 2023-06-22 00:34:25,676 INFO [train.py:996] (3/4) Epoch 5, batch 22600, loss[loss=0.2405, simple_loss=0.3195, pruned_loss=0.08073, over 21647.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.302, pruned_loss=0.07936, over 4291429.44 frames. ], batch size: 389, lr: 6.07e-03, grad_scale: 32.0 2023-06-22 00:35:47,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=867654.0, ans=10.0 2023-06-22 00:35:57,372 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=22.5 2023-06-22 00:36:35,207 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.82 vs. limit=15.0 2023-06-22 00:36:38,638 INFO [train.py:996] (3/4) Epoch 5, batch 22650, loss[loss=0.2589, simple_loss=0.2943, pruned_loss=0.1118, over 21324.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.299, pruned_loss=0.07893, over 4288729.86 frames. ], batch size: 507, lr: 6.07e-03, grad_scale: 32.0 2023-06-22 00:36:39,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=867774.0, ans=0.125 2023-06-22 00:37:28,434 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=867834.0, ans=0.2 2023-06-22 00:37:45,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=867894.0, ans=0.2 2023-06-22 00:38:05,272 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.32 vs. limit=15.0 2023-06-22 00:38:33,420 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 2.499e+02 2.940e+02 3.814e+02 6.401e+02, threshold=5.879e+02, percent-clipped=0.0 2023-06-22 00:38:58,142 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-22 00:38:58,679 INFO [train.py:996] (3/4) Epoch 5, batch 22700, loss[loss=0.2027, simple_loss=0.2702, pruned_loss=0.06758, over 21898.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.2939, pruned_loss=0.07863, over 4283945.59 frames. ], batch size: 107, lr: 6.07e-03, grad_scale: 32.0 2023-06-22 00:39:26,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=868134.0, ans=0.1 2023-06-22 00:39:27,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=868134.0, ans=0.125 2023-06-22 00:39:59,229 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=868194.0, ans=0.125 2023-06-22 00:40:07,074 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.15 vs. limit=6.0 2023-06-22 00:40:41,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=868314.0, ans=0.125 2023-06-22 00:40:52,838 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=868374.0, ans=0.0 2023-06-22 00:40:59,347 INFO [train.py:996] (3/4) Epoch 5, batch 22750, loss[loss=0.2599, simple_loss=0.3262, pruned_loss=0.09681, over 21173.00 frames. ], tot_loss[loss=0.228, simple_loss=0.295, pruned_loss=0.08055, over 4270367.32 frames. ], batch size: 143, lr: 6.07e-03, grad_scale: 32.0 2023-06-22 00:41:34,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=868434.0, ans=0.2 2023-06-22 00:42:58,642 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.155e+02 2.771e+02 3.219e+02 3.779e+02 6.245e+02, threshold=6.438e+02, percent-clipped=2.0 2023-06-22 00:43:15,168 INFO [train.py:996] (3/4) Epoch 5, batch 22800, loss[loss=0.222, simple_loss=0.2941, pruned_loss=0.07495, over 21361.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.2987, pruned_loss=0.08284, over 4272794.84 frames. ], batch size: 143, lr: 6.06e-03, grad_scale: 32.0 2023-06-22 00:43:36,387 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=868734.0, ans=0.0 2023-06-22 00:44:05,901 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.54 vs. limit=6.0 2023-06-22 00:44:31,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=868854.0, ans=0.125 2023-06-22 00:44:32,284 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=12.0 2023-06-22 00:45:16,020 INFO [train.py:996] (3/4) Epoch 5, batch 22850, loss[loss=0.2107, simple_loss=0.2676, pruned_loss=0.07689, over 21559.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.2954, pruned_loss=0.08157, over 4279619.87 frames. ], batch size: 263, lr: 6.06e-03, grad_scale: 32.0 2023-06-22 00:46:14,720 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=869034.0, ans=0.2 2023-06-22 00:46:32,742 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=869154.0, ans=0.125 2023-06-22 00:46:51,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=869154.0, ans=0.125 2023-06-22 00:46:51,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=869154.0, ans=0.2 2023-06-22 00:47:36,108 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.084e+02 2.707e+02 3.130e+02 3.568e+02 4.939e+02, threshold=6.260e+02, percent-clipped=0.0 2023-06-22 00:47:59,196 INFO [train.py:996] (3/4) Epoch 5, batch 22900, loss[loss=0.2361, simple_loss=0.3138, pruned_loss=0.07916, over 21747.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.2986, pruned_loss=0.08095, over 4279014.37 frames. ], batch size: 351, lr: 6.06e-03, grad_scale: 16.0 2023-06-22 00:48:49,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=869394.0, ans=0.125 2023-06-22 00:49:24,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=869454.0, ans=0.125 2023-06-22 00:49:38,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=869454.0, ans=0.125 2023-06-22 00:50:33,830 INFO [train.py:996] (3/4) Epoch 5, batch 22950, loss[loss=0.2116, simple_loss=0.2639, pruned_loss=0.07965, over 19910.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3135, pruned_loss=0.07933, over 4278019.15 frames. ], batch size: 702, lr: 6.06e-03, grad_scale: 16.0 2023-06-22 00:51:26,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=869694.0, ans=0.1 2023-06-22 00:51:26,345 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=869694.0, ans=0.0 2023-06-22 00:52:18,173 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=869814.0, ans=0.125 2023-06-22 00:52:23,391 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.734e+02 2.567e+02 2.952e+02 3.530e+02 5.726e+02, threshold=5.904e+02, percent-clipped=0.0 2023-06-22 00:52:44,562 INFO [train.py:996] (3/4) Epoch 5, batch 23000, loss[loss=0.2199, simple_loss=0.2827, pruned_loss=0.0785, over 21536.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3117, pruned_loss=0.0777, over 4278645.60 frames. ], batch size: 195, lr: 6.06e-03, grad_scale: 16.0 2023-06-22 00:52:52,754 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=869874.0, ans=0.0 2023-06-22 00:53:58,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=869994.0, ans=0.0 2023-06-22 00:54:01,651 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.77 vs. limit=15.0 2023-06-22 00:55:02,670 INFO [train.py:996] (3/4) Epoch 5, batch 23050, loss[loss=0.252, simple_loss=0.3282, pruned_loss=0.08795, over 21906.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3127, pruned_loss=0.07939, over 4286152.83 frames. ], batch size: 316, lr: 6.06e-03, grad_scale: 16.0 2023-06-22 00:56:18,853 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=870294.0, ans=0.1 2023-06-22 00:56:52,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=870354.0, ans=0.125 2023-06-22 00:56:53,334 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=870354.0, ans=0.015 2023-06-22 00:57:09,227 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.227e+02 2.630e+02 3.010e+02 3.445e+02 5.620e+02, threshold=6.019e+02, percent-clipped=0.0 2023-06-22 00:57:18,354 INFO [train.py:996] (3/4) Epoch 5, batch 23100, loss[loss=0.1906, simple_loss=0.2555, pruned_loss=0.06288, over 21844.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3075, pruned_loss=0.07961, over 4280384.46 frames. ], batch size: 317, lr: 6.06e-03, grad_scale: 16.0 2023-06-22 00:57:25,276 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.70 vs. limit=22.5 2023-06-22 00:58:02,470 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 00:58:50,024 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=870654.0, ans=0.04949747468305833 2023-06-22 00:59:29,227 INFO [train.py:996] (3/4) Epoch 5, batch 23150, loss[loss=0.2161, simple_loss=0.2837, pruned_loss=0.07427, over 21932.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3005, pruned_loss=0.07821, over 4280298.41 frames. ], batch size: 316, lr: 6.06e-03, grad_scale: 16.0 2023-06-22 00:59:50,175 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=12.0 2023-06-22 00:59:50,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=870834.0, ans=0.125 2023-06-22 01:00:50,031 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.89 vs. limit=12.0 2023-06-22 01:01:30,214 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.884e+02 2.556e+02 2.836e+02 3.479e+02 5.811e+02, threshold=5.672e+02, percent-clipped=0.0 2023-06-22 01:01:38,836 INFO [train.py:996] (3/4) Epoch 5, batch 23200, loss[loss=0.2818, simple_loss=0.3253, pruned_loss=0.1191, over 21803.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.2988, pruned_loss=0.07882, over 4290599.16 frames. ], batch size: 508, lr: 6.06e-03, grad_scale: 32.0 2023-06-22 01:01:42,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=871074.0, ans=0.2 2023-06-22 01:01:59,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=871134.0, ans=0.125 2023-06-22 01:03:48,318 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=871314.0, ans=0.1 2023-06-22 01:03:54,143 INFO [train.py:996] (3/4) Epoch 5, batch 23250, loss[loss=0.2294, simple_loss=0.2943, pruned_loss=0.08225, over 21475.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.2993, pruned_loss=0.08011, over 4293028.09 frames. ], batch size: 211, lr: 6.06e-03, grad_scale: 32.0 2023-06-22 01:03:57,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=871374.0, ans=0.1 2023-06-22 01:03:59,446 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=871374.0, ans=0.1 2023-06-22 01:04:11,894 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.28 vs. limit=15.0 2023-06-22 01:04:41,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=871494.0, ans=0.2 2023-06-22 01:04:43,441 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.63 vs. limit=22.5 2023-06-22 01:04:50,729 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-22 01:05:23,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=871554.0, ans=0.125 2023-06-22 01:05:46,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=871614.0, ans=0.0 2023-06-22 01:05:53,622 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.121e+02 2.691e+02 3.046e+02 3.824e+02 6.255e+02, threshold=6.093e+02, percent-clipped=3.0 2023-06-22 01:05:57,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=871614.0, ans=0.1 2023-06-22 01:06:02,687 INFO [train.py:996] (3/4) Epoch 5, batch 23300, loss[loss=0.2328, simple_loss=0.3278, pruned_loss=0.0689, over 20719.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3086, pruned_loss=0.08284, over 4290558.00 frames. ], batch size: 607, lr: 6.05e-03, grad_scale: 16.0 2023-06-22 01:06:06,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=871674.0, ans=0.125 2023-06-22 01:07:21,099 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.33 vs. limit=15.0 2023-06-22 01:07:24,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=871794.0, ans=0.125 2023-06-22 01:08:37,160 INFO [train.py:996] (3/4) Epoch 5, batch 23350, loss[loss=0.176, simple_loss=0.247, pruned_loss=0.05248, over 21077.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3115, pruned_loss=0.08127, over 4290720.02 frames. ], batch size: 143, lr: 6.05e-03, grad_scale: 16.0 2023-06-22 01:08:45,700 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.04 vs. limit=6.0 2023-06-22 01:09:58,618 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=872094.0, ans=0.2 2023-06-22 01:10:10,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=872154.0, ans=0.125 2023-06-22 01:10:20,937 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=872214.0, ans=0.0 2023-06-22 01:10:31,356 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.775e+02 2.322e+02 2.505e+02 2.999e+02 5.635e+02, threshold=5.010e+02, percent-clipped=0.0 2023-06-22 01:10:48,127 INFO [train.py:996] (3/4) Epoch 5, batch 23400, loss[loss=0.2209, simple_loss=0.2855, pruned_loss=0.0782, over 21447.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3047, pruned_loss=0.0779, over 4288260.32 frames. ], batch size: 176, lr: 6.05e-03, grad_scale: 16.0 2023-06-22 01:11:56,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=872394.0, ans=0.2 2023-06-22 01:12:04,459 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.32 vs. limit=6.0 2023-06-22 01:12:36,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=872514.0, ans=0.125 2023-06-22 01:13:11,277 INFO [train.py:996] (3/4) Epoch 5, batch 23450, loss[loss=0.2597, simple_loss=0.3283, pruned_loss=0.09554, over 21934.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3054, pruned_loss=0.08063, over 4275191.01 frames. ], batch size: 372, lr: 6.05e-03, grad_scale: 16.0 2023-06-22 01:13:13,265 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=872574.0, ans=0.1 2023-06-22 01:15:13,229 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.052e+02 2.562e+02 2.957e+02 3.488e+02 7.125e+02, threshold=5.914e+02, percent-clipped=6.0 2023-06-22 01:15:35,303 INFO [train.py:996] (3/4) Epoch 5, batch 23500, loss[loss=0.2191, simple_loss=0.2734, pruned_loss=0.08243, over 21137.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3064, pruned_loss=0.08307, over 4282157.81 frames. ], batch size: 608, lr: 6.05e-03, grad_scale: 16.0 2023-06-22 01:16:20,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=872934.0, ans=0.125 2023-06-22 01:17:16,746 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=873114.0, ans=0.1 2023-06-22 01:17:18,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=873114.0, ans=0.1 2023-06-22 01:17:32,719 INFO [train.py:996] (3/4) Epoch 5, batch 23550, loss[loss=0.235, simple_loss=0.2973, pruned_loss=0.08635, over 21429.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3029, pruned_loss=0.08226, over 4271387.45 frames. ], batch size: 389, lr: 6.05e-03, grad_scale: 16.0 2023-06-22 01:17:35,362 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.89 vs. limit=15.0 2023-06-22 01:18:40,047 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.36 vs. limit=22.5 2023-06-22 01:18:52,436 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.45 vs. limit=22.5 2023-06-22 01:18:55,225 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.61 vs. limit=15.0 2023-06-22 01:19:01,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=873354.0, ans=0.0 2023-06-22 01:19:04,668 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.29 vs. limit=10.0 2023-06-22 01:19:27,628 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.096e+02 2.576e+02 2.870e+02 3.585e+02 5.605e+02, threshold=5.739e+02, percent-clipped=0.0 2023-06-22 01:19:28,864 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.67 vs. limit=6.0 2023-06-22 01:19:45,033 INFO [train.py:996] (3/4) Epoch 5, batch 23600, loss[loss=0.2068, simple_loss=0.2455, pruned_loss=0.08406, over 20024.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3016, pruned_loss=0.08219, over 4263624.16 frames. ], batch size: 702, lr: 6.05e-03, grad_scale: 32.0 2023-06-22 01:20:00,792 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=873474.0, ans=0.125 2023-06-22 01:20:12,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=873474.0, ans=0.1 2023-06-22 01:20:13,559 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.23 vs. limit=12.0 2023-06-22 01:20:39,345 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=873534.0, ans=0.5 2023-06-22 01:20:48,212 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=873594.0, ans=0.07 2023-06-22 01:21:21,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=873654.0, ans=0.125 2023-06-22 01:22:05,184 INFO [train.py:996] (3/4) Epoch 5, batch 23650, loss[loss=0.3194, simple_loss=0.3728, pruned_loss=0.133, over 21407.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3032, pruned_loss=0.08147, over 4270226.02 frames. ], batch size: 507, lr: 6.05e-03, grad_scale: 32.0 2023-06-22 01:24:19,151 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=874014.0, ans=0.125 2023-06-22 01:24:28,596 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.007e+02 2.418e+02 2.647e+02 3.193e+02 5.088e+02, threshold=5.293e+02, percent-clipped=0.0 2023-06-22 01:24:41,716 INFO [train.py:996] (3/4) Epoch 5, batch 23700, loss[loss=0.2404, simple_loss=0.3285, pruned_loss=0.07611, over 21284.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3059, pruned_loss=0.08015, over 4271689.74 frames. ], batch size: 549, lr: 6.05e-03, grad_scale: 32.0 2023-06-22 01:24:43,852 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=874074.0, ans=0.125 2023-06-22 01:24:45,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=874074.0, ans=0.125 2023-06-22 01:25:22,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=874194.0, ans=0.125 2023-06-22 01:25:44,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=874194.0, ans=0.09899494936611666 2023-06-22 01:25:49,147 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=874254.0, ans=0.04949747468305833 2023-06-22 01:26:56,477 INFO [train.py:996] (3/4) Epoch 5, batch 23750, loss[loss=0.2055, simple_loss=0.2946, pruned_loss=0.05822, over 21444.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3088, pruned_loss=0.0812, over 4269199.37 frames. ], batch size: 194, lr: 6.05e-03, grad_scale: 32.0 2023-06-22 01:27:13,713 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=874434.0, ans=0.1 2023-06-22 01:29:02,184 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.672e+02 2.409e+02 2.705e+02 3.135e+02 4.675e+02, threshold=5.410e+02, percent-clipped=0.0 2023-06-22 01:29:08,990 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=874674.0, ans=0.125 2023-06-22 01:29:10,007 INFO [train.py:996] (3/4) Epoch 5, batch 23800, loss[loss=0.2784, simple_loss=0.3638, pruned_loss=0.0965, over 21654.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3056, pruned_loss=0.07817, over 4270199.56 frames. ], batch size: 389, lr: 6.04e-03, grad_scale: 32.0 2023-06-22 01:29:57,266 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=874734.0, ans=0.125 2023-06-22 01:30:00,888 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=874794.0, ans=0.035 2023-06-22 01:30:55,406 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-22 01:31:28,763 INFO [train.py:996] (3/4) Epoch 5, batch 23850, loss[loss=0.1702, simple_loss=0.2205, pruned_loss=0.05991, over 16192.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3144, pruned_loss=0.08076, over 4267786.75 frames. ], batch size: 61, lr: 6.04e-03, grad_scale: 32.0 2023-06-22 01:32:01,904 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=875034.0, ans=0.125 2023-06-22 01:32:07,812 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=875034.0, ans=0.0 2023-06-22 01:33:37,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=875214.0, ans=0.125 2023-06-22 01:33:37,817 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.40 vs. limit=22.5 2023-06-22 01:33:39,655 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.103e+02 2.789e+02 3.123e+02 3.786e+02 8.749e+02, threshold=6.247e+02, percent-clipped=6.0 2023-06-22 01:34:00,441 INFO [train.py:996] (3/4) Epoch 5, batch 23900, loss[loss=0.2163, simple_loss=0.2791, pruned_loss=0.07677, over 21088.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.322, pruned_loss=0.08278, over 4258642.51 frames. ], batch size: 143, lr: 6.04e-03, grad_scale: 32.0 2023-06-22 01:35:14,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=875454.0, ans=0.2 2023-06-22 01:35:15,047 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 01:35:24,966 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.97 vs. limit=15.0 2023-06-22 01:35:53,728 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.51 vs. limit=15.0 2023-06-22 01:35:54,015 INFO [train.py:996] (3/4) Epoch 5, batch 23950, loss[loss=0.2625, simple_loss=0.3197, pruned_loss=0.1026, over 21623.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3141, pruned_loss=0.08166, over 4262470.56 frames. ], batch size: 441, lr: 6.04e-03, grad_scale: 32.0 2023-06-22 01:36:50,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=875634.0, ans=0.125 2023-06-22 01:37:27,343 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=875754.0, ans=0.1 2023-06-22 01:37:55,914 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.991e+02 2.517e+02 2.934e+02 3.377e+02 5.286e+02, threshold=5.869e+02, percent-clipped=0.0 2023-06-22 01:37:58,302 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=875814.0, ans=0.0 2023-06-22 01:38:06,338 INFO [train.py:996] (3/4) Epoch 5, batch 24000, loss[loss=0.2571, simple_loss=0.3274, pruned_loss=0.09345, over 21473.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3149, pruned_loss=0.08395, over 4270378.57 frames. ], batch size: 211, lr: 6.04e-03, grad_scale: 32.0 2023-06-22 01:38:06,339 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-22 01:38:46,517 INFO [train.py:1028] (3/4) Epoch 5, validation: loss=0.2672, simple_loss=0.3621, pruned_loss=0.08617, over 1796401.00 frames. 2023-06-22 01:38:46,518 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-22 01:39:27,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=875934.0, ans=0.125 2023-06-22 01:39:27,652 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=875934.0, ans=0.125 2023-06-22 01:41:23,103 INFO [train.py:996] (3/4) Epoch 5, batch 24050, loss[loss=0.2019, simple_loss=0.3053, pruned_loss=0.04923, over 20868.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3172, pruned_loss=0.0844, over 4273786.21 frames. ], batch size: 608, lr: 6.04e-03, grad_scale: 32.0 2023-06-22 01:42:10,515 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=876294.0, ans=0.125 2023-06-22 01:42:14,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=876294.0, ans=0.125 2023-06-22 01:43:07,327 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=876354.0, ans=0.1 2023-06-22 01:43:24,680 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.987e+02 2.535e+02 2.835e+02 3.239e+02 5.691e+02, threshold=5.670e+02, percent-clipped=0.0 2023-06-22 01:43:36,732 INFO [train.py:996] (3/4) Epoch 5, batch 24100, loss[loss=0.2447, simple_loss=0.3181, pruned_loss=0.08567, over 21168.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3175, pruned_loss=0.08319, over 4268107.63 frames. ], batch size: 143, lr: 6.04e-03, grad_scale: 32.0 2023-06-22 01:44:00,161 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.10 vs. limit=15.0 2023-06-22 01:45:08,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=876654.0, ans=0.2 2023-06-22 01:45:14,689 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=876714.0, ans=0.0 2023-06-22 01:45:38,779 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=876714.0, ans=0.125 2023-06-22 01:45:49,885 INFO [train.py:996] (3/4) Epoch 5, batch 24150, loss[loss=0.2575, simple_loss=0.3303, pruned_loss=0.09232, over 21815.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3169, pruned_loss=0.085, over 4278927.44 frames. ], batch size: 124, lr: 6.04e-03, grad_scale: 16.0 2023-06-22 01:45:51,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=876774.0, ans=0.125 2023-06-22 01:48:02,125 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.095e+02 2.649e+02 3.023e+02 3.593e+02 4.486e+02, threshold=6.046e+02, percent-clipped=0.0 2023-06-22 01:48:06,739 INFO [train.py:996] (3/4) Epoch 5, batch 24200, loss[loss=0.2339, simple_loss=0.3159, pruned_loss=0.07595, over 21651.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3196, pruned_loss=0.08697, over 4285457.88 frames. ], batch size: 263, lr: 6.04e-03, grad_scale: 16.0 2023-06-22 01:48:55,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=877194.0, ans=0.125 2023-06-22 01:49:04,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=877194.0, ans=0.125 2023-06-22 01:50:13,916 INFO [train.py:996] (3/4) Epoch 5, batch 24250, loss[loss=0.1867, simple_loss=0.2937, pruned_loss=0.03988, over 21656.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3169, pruned_loss=0.08056, over 4288031.45 frames. ], batch size: 389, lr: 6.03e-03, grad_scale: 16.0 2023-06-22 01:50:31,065 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.95 vs. limit=15.0 2023-06-22 01:50:54,473 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=877434.0, ans=0.0 2023-06-22 01:52:29,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=877614.0, ans=0.2 2023-06-22 01:52:32,157 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.447e+02 1.979e+02 2.289e+02 2.942e+02 4.339e+02, threshold=4.579e+02, percent-clipped=0.0 2023-06-22 01:52:36,838 INFO [train.py:996] (3/4) Epoch 5, batch 24300, loss[loss=0.1791, simple_loss=0.261, pruned_loss=0.04863, over 21822.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3085, pruned_loss=0.075, over 4287976.67 frames. ], batch size: 316, lr: 6.03e-03, grad_scale: 16.0 2023-06-22 01:52:43,410 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.42 vs. limit=12.0 2023-06-22 01:53:00,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=877674.0, ans=0.125 2023-06-22 01:53:26,265 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=877734.0, ans=0.0 2023-06-22 01:54:44,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=877914.0, ans=0.125 2023-06-22 01:54:50,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=877914.0, ans=0.0 2023-06-22 01:54:56,187 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.90 vs. limit=15.0 2023-06-22 01:54:56,430 INFO [train.py:996] (3/4) Epoch 5, batch 24350, loss[loss=0.2039, simple_loss=0.2756, pruned_loss=0.06612, over 21673.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3044, pruned_loss=0.07452, over 4292589.17 frames. ], batch size: 263, lr: 6.03e-03, grad_scale: 16.0 2023-06-22 01:54:59,937 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=877974.0, ans=0.125 2023-06-22 01:56:00,911 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=878094.0, ans=0.0 2023-06-22 01:57:07,942 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.029e+02 2.557e+02 2.877e+02 3.455e+02 4.976e+02, threshold=5.754e+02, percent-clipped=3.0 2023-06-22 01:57:18,448 INFO [train.py:996] (3/4) Epoch 5, batch 24400, loss[loss=0.2155, simple_loss=0.2922, pruned_loss=0.06934, over 20736.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3093, pruned_loss=0.07792, over 4288603.86 frames. ], batch size: 608, lr: 6.03e-03, grad_scale: 32.0 2023-06-22 01:57:24,596 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=878274.0, ans=0.1 2023-06-22 01:57:54,952 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.37 vs. limit=22.5 2023-06-22 01:59:08,062 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.63 vs. limit=10.0 2023-06-22 01:59:40,609 INFO [train.py:996] (3/4) Epoch 5, batch 24450, loss[loss=0.2496, simple_loss=0.3403, pruned_loss=0.07941, over 21676.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3121, pruned_loss=0.07978, over 4286013.95 frames. ], batch size: 389, lr: 6.03e-03, grad_scale: 32.0 2023-06-22 02:00:57,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=878694.0, ans=0.125 2023-06-22 02:01:20,430 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=878814.0, ans=0.2 2023-06-22 02:01:40,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=878814.0, ans=0.04949747468305833 2023-06-22 02:01:44,618 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=878814.0, ans=0.125 2023-06-22 02:01:46,968 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.115e+02 2.690e+02 3.074e+02 4.098e+02 6.316e+02, threshold=6.149e+02, percent-clipped=3.0 2023-06-22 02:02:00,020 INFO [train.py:996] (3/4) Epoch 5, batch 24500, loss[loss=0.1886, simple_loss=0.2599, pruned_loss=0.05864, over 17232.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3124, pruned_loss=0.07938, over 4288335.60 frames. ], batch size: 65, lr: 6.03e-03, grad_scale: 32.0 2023-06-22 02:02:00,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=878874.0, ans=0.2 2023-06-22 02:03:31,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=879054.0, ans=0.0 2023-06-22 02:04:01,122 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=879114.0, ans=0.125 2023-06-22 02:04:19,099 INFO [train.py:996] (3/4) Epoch 5, batch 24550, loss[loss=0.2994, simple_loss=0.3715, pruned_loss=0.1136, over 21558.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3149, pruned_loss=0.08188, over 4288694.48 frames. ], batch size: 414, lr: 6.03e-03, grad_scale: 32.0 2023-06-22 02:04:21,165 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=879174.0, ans=0.125 2023-06-22 02:04:36,582 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=879174.0, ans=0.0 2023-06-22 02:04:39,552 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=879174.0, ans=0.125 2023-06-22 02:04:49,078 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=879234.0, ans=0.125 2023-06-22 02:04:50,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=879234.0, ans=0.2 2023-06-22 02:05:17,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=879294.0, ans=0.125 2023-06-22 02:05:26,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=879294.0, ans=0.0 2023-06-22 02:05:34,951 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=879354.0, ans=0.0 2023-06-22 02:06:24,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=879414.0, ans=0.125 2023-06-22 02:06:28,116 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.118e+02 2.618e+02 2.906e+02 3.324e+02 4.617e+02, threshold=5.812e+02, percent-clipped=0.0 2023-06-22 02:06:31,241 INFO [train.py:996] (3/4) Epoch 5, batch 24600, loss[loss=0.2142, simple_loss=0.285, pruned_loss=0.07166, over 21737.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3101, pruned_loss=0.08222, over 4279903.75 frames. ], batch size: 316, lr: 6.03e-03, grad_scale: 16.0 2023-06-22 02:06:57,292 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.71 vs. limit=12.0 2023-06-22 02:07:10,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=879534.0, ans=0.0 2023-06-22 02:08:09,074 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=879654.0, ans=0.125 2023-06-22 02:08:40,861 INFO [train.py:996] (3/4) Epoch 5, batch 24650, loss[loss=0.1883, simple_loss=0.2491, pruned_loss=0.06378, over 21452.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3027, pruned_loss=0.08042, over 4267841.96 frames. ], batch size: 212, lr: 6.03e-03, grad_scale: 16.0 2023-06-22 02:10:25,794 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.03 vs. limit=15.0 2023-06-22 02:10:49,769 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=880014.0, ans=0.0 2023-06-22 02:10:55,271 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.966e+02 2.542e+02 2.905e+02 3.356e+02 5.377e+02, threshold=5.810e+02, percent-clipped=0.0 2023-06-22 02:10:58,330 INFO [train.py:996] (3/4) Epoch 5, batch 24700, loss[loss=0.2045, simple_loss=0.2796, pruned_loss=0.06469, over 15325.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3027, pruned_loss=0.07854, over 4258123.32 frames. ], batch size: 61, lr: 6.03e-03, grad_scale: 16.0 2023-06-22 02:10:58,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=880074.0, ans=0.0 2023-06-22 02:11:54,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=880194.0, ans=0.1 2023-06-22 02:12:38,475 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.32 vs. limit=22.5 2023-06-22 02:12:54,973 INFO [train.py:996] (3/4) Epoch 5, batch 24750, loss[loss=0.1861, simple_loss=0.265, pruned_loss=0.05366, over 21496.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2953, pruned_loss=0.07581, over 4259970.66 frames. ], batch size: 389, lr: 6.02e-03, grad_scale: 16.0 2023-06-22 02:13:52,832 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=880494.0, ans=0.0 2023-06-22 02:14:21,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=880554.0, ans=0.05 2023-06-22 02:14:22,289 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-06-22 02:14:31,612 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=880554.0, ans=0.1 2023-06-22 02:15:01,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=880614.0, ans=0.125 2023-06-22 02:15:04,034 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.972e+02 2.468e+02 2.817e+02 3.299e+02 5.341e+02, threshold=5.634e+02, percent-clipped=0.0 2023-06-22 02:15:04,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=880614.0, ans=0.0 2023-06-22 02:15:06,737 INFO [train.py:996] (3/4) Epoch 5, batch 24800, loss[loss=0.2192, simple_loss=0.28, pruned_loss=0.07924, over 21549.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2899, pruned_loss=0.07591, over 4263778.87 frames. ], batch size: 391, lr: 6.02e-03, grad_scale: 32.0 2023-06-22 02:15:26,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=880674.0, ans=0.0 2023-06-22 02:16:09,891 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=880794.0, ans=0.1 2023-06-22 02:16:11,619 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=880794.0, ans=0.125 2023-06-22 02:17:09,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=880914.0, ans=0.1 2023-06-22 02:17:22,833 INFO [train.py:996] (3/4) Epoch 5, batch 24850, loss[loss=0.2113, simple_loss=0.2718, pruned_loss=0.07544, over 21192.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2904, pruned_loss=0.07744, over 4275370.05 frames. ], batch size: 608, lr: 6.02e-03, grad_scale: 32.0 2023-06-22 02:18:15,830 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=881034.0, ans=10.0 2023-06-22 02:19:34,710 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=881214.0, ans=0.125 2023-06-22 02:19:35,610 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.196e+02 2.607e+02 3.027e+02 3.615e+02 6.896e+02, threshold=6.053e+02, percent-clipped=2.0 2023-06-22 02:19:38,480 INFO [train.py:996] (3/4) Epoch 5, batch 24900, loss[loss=0.2756, simple_loss=0.3576, pruned_loss=0.09682, over 21286.00 frames. ], tot_loss[loss=0.226, simple_loss=0.2947, pruned_loss=0.07868, over 4276966.73 frames. ], batch size: 548, lr: 6.02e-03, grad_scale: 32.0 2023-06-22 02:20:28,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=881334.0, ans=0.1 2023-06-22 02:22:16,553 INFO [train.py:996] (3/4) Epoch 5, batch 24950, loss[loss=0.2674, simple_loss=0.3549, pruned_loss=0.08998, over 21760.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3021, pruned_loss=0.08188, over 4276100.62 frames. ], batch size: 124, lr: 6.02e-03, grad_scale: 16.0 2023-06-22 02:22:26,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=881574.0, ans=0.0 2023-06-22 02:22:34,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=881634.0, ans=0.125 2023-06-22 02:22:54,601 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.03 vs. limit=22.5 2023-06-22 02:24:00,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=881754.0, ans=0.125 2023-06-22 02:24:10,092 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.32 vs. limit=10.0 2023-06-22 02:24:31,891 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=881814.0, ans=0.0 2023-06-22 02:24:34,211 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.284e+02 2.886e+02 3.310e+02 3.841e+02 8.420e+02, threshold=6.620e+02, percent-clipped=6.0 2023-06-22 02:24:35,701 INFO [train.py:996] (3/4) Epoch 5, batch 25000, loss[loss=0.214, simple_loss=0.293, pruned_loss=0.06747, over 21797.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3092, pruned_loss=0.08373, over 4283403.85 frames. ], batch size: 107, lr: 6.02e-03, grad_scale: 16.0 2023-06-22 02:24:42,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=881874.0, ans=0.0 2023-06-22 02:24:55,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=881934.0, ans=0.1 2023-06-22 02:26:10,902 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=882114.0, ans=0.0 2023-06-22 02:26:44,208 INFO [train.py:996] (3/4) Epoch 5, batch 25050, loss[loss=0.1948, simple_loss=0.2598, pruned_loss=0.06493, over 21488.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3014, pruned_loss=0.08189, over 4277606.48 frames. ], batch size: 212, lr: 6.02e-03, grad_scale: 16.0 2023-06-22 02:27:09,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=882234.0, ans=0.125 2023-06-22 02:27:48,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=882294.0, ans=0.1 2023-06-22 02:27:58,081 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.25 vs. limit=15.0 2023-06-22 02:28:57,818 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.157e+02 2.524e+02 2.786e+02 3.350e+02 6.215e+02, threshold=5.572e+02, percent-clipped=0.0 2023-06-22 02:28:59,709 INFO [train.py:996] (3/4) Epoch 5, batch 25100, loss[loss=0.2222, simple_loss=0.2865, pruned_loss=0.07895, over 21328.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.2954, pruned_loss=0.08058, over 4283633.03 frames. ], batch size: 144, lr: 6.02e-03, grad_scale: 16.0 2023-06-22 02:29:07,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=882474.0, ans=0.125 2023-06-22 02:29:29,113 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.87 vs. limit=6.0 2023-06-22 02:30:09,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=882594.0, ans=0.0 2023-06-22 02:31:06,852 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.85 vs. limit=15.0 2023-06-22 02:31:07,268 INFO [train.py:996] (3/4) Epoch 5, batch 25150, loss[loss=0.22, simple_loss=0.3069, pruned_loss=0.06654, over 21829.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.2994, pruned_loss=0.07847, over 4265650.60 frames. ], batch size: 351, lr: 6.02e-03, grad_scale: 16.0 2023-06-22 02:31:10,675 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=882774.0, ans=0.0 2023-06-22 02:31:21,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=882834.0, ans=0.2 2023-06-22 02:31:29,235 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-22 02:32:20,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=882954.0, ans=0.04949747468305833 2023-06-22 02:32:33,368 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 02:32:45,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=883014.0, ans=0.5 2023-06-22 02:33:00,565 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.825e+02 2.215e+02 2.479e+02 2.821e+02 4.157e+02, threshold=4.958e+02, percent-clipped=0.0 2023-06-22 02:33:02,329 INFO [train.py:996] (3/4) Epoch 5, batch 25200, loss[loss=0.1989, simple_loss=0.2869, pruned_loss=0.05544, over 21702.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.2993, pruned_loss=0.07691, over 4263811.49 frames. ], batch size: 247, lr: 6.02e-03, grad_scale: 32.0 2023-06-22 02:33:12,598 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=883074.0, ans=0.05 2023-06-22 02:33:46,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=883134.0, ans=0.0 2023-06-22 02:34:27,264 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 02:35:17,777 INFO [train.py:996] (3/4) Epoch 5, batch 25250, loss[loss=0.268, simple_loss=0.307, pruned_loss=0.1145, over 21363.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2962, pruned_loss=0.07533, over 4267607.33 frames. ], batch size: 508, lr: 6.01e-03, grad_scale: 32.0 2023-06-22 02:35:47,927 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=883434.0, ans=0.0 2023-06-22 02:35:49,384 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=883434.0, ans=0.5 2023-06-22 02:36:24,302 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=883494.0, ans=0.0 2023-06-22 02:37:20,505 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.890e+02 2.367e+02 2.631e+02 2.996e+02 4.981e+02, threshold=5.263e+02, percent-clipped=1.0 2023-06-22 02:37:27,877 INFO [train.py:996] (3/4) Epoch 5, batch 25300, loss[loss=0.217, simple_loss=0.2953, pruned_loss=0.06937, over 21707.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2941, pruned_loss=0.07557, over 4271767.52 frames. ], batch size: 298, lr: 6.01e-03, grad_scale: 32.0 2023-06-22 02:38:15,206 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 02:38:17,501 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.54 vs. limit=15.0 2023-06-22 02:39:44,738 INFO [train.py:996] (3/4) Epoch 5, batch 25350, loss[loss=0.1696, simple_loss=0.2594, pruned_loss=0.03985, over 21612.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2951, pruned_loss=0.07471, over 4264441.90 frames. ], batch size: 263, lr: 6.01e-03, grad_scale: 32.0 2023-06-22 02:39:54,392 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=883974.0, ans=0.125 2023-06-22 02:40:23,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=884094.0, ans=0.125 2023-06-22 02:40:28,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=884094.0, ans=0.2 2023-06-22 02:41:04,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=884154.0, ans=0.1 2023-06-22 02:41:52,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=884214.0, ans=10.0 2023-06-22 02:41:59,334 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.667e+02 2.361e+02 2.646e+02 3.091e+02 5.117e+02, threshold=5.293e+02, percent-clipped=0.0 2023-06-22 02:41:59,357 INFO [train.py:996] (3/4) Epoch 5, batch 25400, loss[loss=0.2011, simple_loss=0.2616, pruned_loss=0.07027, over 21377.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2907, pruned_loss=0.07432, over 4259571.78 frames. ], batch size: 194, lr: 6.01e-03, grad_scale: 16.0 2023-06-22 02:42:09,902 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=884274.0, ans=0.05 2023-06-22 02:42:34,534 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.59 vs. limit=15.0 2023-06-22 02:44:14,162 INFO [train.py:996] (3/4) Epoch 5, batch 25450, loss[loss=0.2297, simple_loss=0.3264, pruned_loss=0.06651, over 21631.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2919, pruned_loss=0.07546, over 4260872.72 frames. ], batch size: 263, lr: 6.01e-03, grad_scale: 16.0 2023-06-22 02:46:30,116 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.857e+02 2.236e+02 2.528e+02 3.001e+02 4.958e+02, threshold=5.055e+02, percent-clipped=0.0 2023-06-22 02:46:30,139 INFO [train.py:996] (3/4) Epoch 5, batch 25500, loss[loss=0.2629, simple_loss=0.344, pruned_loss=0.09087, over 21576.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2918, pruned_loss=0.07189, over 4246085.92 frames. ], batch size: 414, lr: 6.01e-03, grad_scale: 16.0 2023-06-22 02:46:54,083 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=884934.0, ans=0.125 2023-06-22 02:48:47,572 INFO [train.py:996] (3/4) Epoch 5, batch 25550, loss[loss=0.3144, simple_loss=0.3926, pruned_loss=0.1181, over 21426.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.299, pruned_loss=0.07217, over 4250375.23 frames. ], batch size: 507, lr: 6.01e-03, grad_scale: 16.0 2023-06-22 02:49:11,136 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.50 vs. limit=15.0 2023-06-22 02:49:39,855 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=885234.0, ans=0.2 2023-06-22 02:50:06,924 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=885294.0, ans=0.0 2023-06-22 02:50:06,936 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=885294.0, ans=0.1 2023-06-22 02:51:01,402 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.803e+02 2.415e+02 2.727e+02 3.146e+02 6.002e+02, threshold=5.455e+02, percent-clipped=2.0 2023-06-22 02:51:01,425 INFO [train.py:996] (3/4) Epoch 5, batch 25600, loss[loss=0.2598, simple_loss=0.3284, pruned_loss=0.09565, over 21859.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.303, pruned_loss=0.07341, over 4254917.74 frames. ], batch size: 371, lr: 6.01e-03, grad_scale: 32.0 2023-06-22 02:51:28,167 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=885474.0, ans=0.125 2023-06-22 02:51:29,800 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=885474.0, ans=0.0 2023-06-22 02:51:59,661 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=885534.0, ans=0.125 2023-06-22 02:52:01,068 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=885534.0, ans=0.125 2023-06-22 02:52:03,349 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.88 vs. limit=15.0 2023-06-22 02:52:37,917 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=885654.0, ans=0.125 2023-06-22 02:52:48,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=885654.0, ans=0.2 2023-06-22 02:53:01,542 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 02:53:11,095 INFO [train.py:996] (3/4) Epoch 5, batch 25650, loss[loss=0.2134, simple_loss=0.2775, pruned_loss=0.07463, over 21637.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3037, pruned_loss=0.07643, over 4252315.62 frames. ], batch size: 298, lr: 6.01e-03, grad_scale: 16.0 2023-06-22 02:55:24,468 INFO [train.py:996] (3/4) Epoch 5, batch 25700, loss[loss=0.2048, simple_loss=0.2909, pruned_loss=0.05939, over 21634.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3005, pruned_loss=0.07806, over 4250595.69 frames. ], batch size: 263, lr: 6.01e-03, grad_scale: 16.0 2023-06-22 02:55:40,581 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.169e+02 2.593e+02 2.983e+02 3.503e+02 5.289e+02, threshold=5.966e+02, percent-clipped=0.0 2023-06-22 02:57:17,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=886314.0, ans=0.025 2023-06-22 02:57:25,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=886314.0, ans=0.0 2023-06-22 02:57:32,849 INFO [train.py:996] (3/4) Epoch 5, batch 25750, loss[loss=0.2544, simple_loss=0.3194, pruned_loss=0.0947, over 21198.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3066, pruned_loss=0.08069, over 4249276.90 frames. ], batch size: 143, lr: 6.00e-03, grad_scale: 16.0 2023-06-22 02:59:13,808 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.61 vs. limit=15.0 2023-06-22 02:59:39,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=886614.0, ans=0.025 2023-06-22 03:00:20,170 INFO [train.py:996] (3/4) Epoch 5, batch 25800, loss[loss=0.277, simple_loss=0.3478, pruned_loss=0.1031, over 21451.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3211, pruned_loss=0.0864, over 4252866.95 frames. ], batch size: 194, lr: 6.00e-03, grad_scale: 16.0 2023-06-22 03:00:21,877 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.996e+02 2.816e+02 3.413e+02 4.279e+02 8.490e+02, threshold=6.826e+02, percent-clipped=2.0 2023-06-22 03:00:29,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=886674.0, ans=0.2 2023-06-22 03:00:43,950 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=886674.0, ans=0.125 2023-06-22 03:02:03,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=886854.0, ans=0.1 2023-06-22 03:02:24,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=886914.0, ans=0.0 2023-06-22 03:02:26,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=886914.0, ans=0.1 2023-06-22 03:02:42,722 INFO [train.py:996] (3/4) Epoch 5, batch 25850, loss[loss=0.2523, simple_loss=0.3189, pruned_loss=0.0929, over 21838.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3209, pruned_loss=0.08483, over 4250347.15 frames. ], batch size: 124, lr: 6.00e-03, grad_scale: 16.0 2023-06-22 03:03:51,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=887094.0, ans=0.0 2023-06-22 03:03:52,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=887094.0, ans=0.125 2023-06-22 03:04:09,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=887154.0, ans=0.125 2023-06-22 03:04:58,311 INFO [train.py:996] (3/4) Epoch 5, batch 25900, loss[loss=0.2912, simple_loss=0.4001, pruned_loss=0.09116, over 20894.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3226, pruned_loss=0.08563, over 4259938.29 frames. ], batch size: 607, lr: 6.00e-03, grad_scale: 16.0 2023-06-22 03:04:59,675 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.827e+02 2.721e+02 3.071e+02 3.513e+02 5.338e+02, threshold=6.142e+02, percent-clipped=0.0 2023-06-22 03:05:32,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=887334.0, ans=0.0 2023-06-22 03:05:46,305 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=887334.0, ans=0.125 2023-06-22 03:06:50,456 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=887454.0, ans=0.2 2023-06-22 03:07:13,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=887514.0, ans=0.125 2023-06-22 03:07:18,015 INFO [train.py:996] (3/4) Epoch 5, batch 25950, loss[loss=0.2435, simple_loss=0.3209, pruned_loss=0.08308, over 21394.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3287, pruned_loss=0.08828, over 4260959.02 frames. ], batch size: 194, lr: 6.00e-03, grad_scale: 16.0 2023-06-22 03:07:42,587 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=887574.0, ans=0.0 2023-06-22 03:08:53,179 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=887754.0, ans=0.2 2023-06-22 03:09:40,573 INFO [train.py:996] (3/4) Epoch 5, batch 26000, loss[loss=0.2234, simple_loss=0.3107, pruned_loss=0.06809, over 21701.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3287, pruned_loss=0.08812, over 4266964.98 frames. ], batch size: 298, lr: 6.00e-03, grad_scale: 32.0 2023-06-22 03:09:42,002 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.043e+02 2.564e+02 2.907e+02 3.379e+02 5.318e+02, threshold=5.814e+02, percent-clipped=0.0 2023-06-22 03:09:48,578 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=887874.0, ans=0.125 2023-06-22 03:11:43,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=888114.0, ans=0.0 2023-06-22 03:11:47,521 INFO [train.py:996] (3/4) Epoch 5, batch 26050, loss[loss=0.235, simple_loss=0.2967, pruned_loss=0.08669, over 21600.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3283, pruned_loss=0.08919, over 4277795.60 frames. ], batch size: 548, lr: 6.00e-03, grad_scale: 32.0 2023-06-22 03:12:27,467 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=888234.0, ans=0.125 2023-06-22 03:12:28,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=888234.0, ans=0.125 2023-06-22 03:13:04,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=888294.0, ans=0.2 2023-06-22 03:13:45,442 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=12.0 2023-06-22 03:13:58,835 INFO [train.py:996] (3/4) Epoch 5, batch 26100, loss[loss=0.2227, simple_loss=0.284, pruned_loss=0.08069, over 21466.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3213, pruned_loss=0.08787, over 4278018.62 frames. ], batch size: 194, lr: 6.00e-03, grad_scale: 32.0 2023-06-22 03:14:00,250 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.938e+02 2.634e+02 2.953e+02 3.214e+02 4.969e+02, threshold=5.905e+02, percent-clipped=0.0 2023-06-22 03:14:21,877 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.15 vs. limit=15.0 2023-06-22 03:16:09,560 INFO [train.py:996] (3/4) Epoch 5, batch 26150, loss[loss=0.2472, simple_loss=0.3179, pruned_loss=0.08819, over 21625.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3198, pruned_loss=0.08755, over 4282488.43 frames. ], batch size: 230, lr: 6.00e-03, grad_scale: 16.0 2023-06-22 03:17:09,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=888834.0, ans=0.0 2023-06-22 03:17:22,167 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.70 vs. limit=22.5 2023-06-22 03:17:30,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=888894.0, ans=0.125 2023-06-22 03:17:55,226 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=888954.0, ans=0.07 2023-06-22 03:17:58,307 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=888954.0, ans=0.0 2023-06-22 03:18:04,570 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=888954.0, ans=0.125 2023-06-22 03:18:09,482 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.25 vs. limit=15.0 2023-06-22 03:18:41,270 INFO [train.py:996] (3/4) Epoch 5, batch 26200, loss[loss=0.2465, simple_loss=0.3469, pruned_loss=0.07301, over 21719.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3203, pruned_loss=0.0853, over 4283437.45 frames. ], batch size: 351, lr: 6.00e-03, grad_scale: 16.0 2023-06-22 03:18:48,976 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.087e+02 2.441e+02 2.915e+02 3.551e+02 5.779e+02, threshold=5.831e+02, percent-clipped=0.0 2023-06-22 03:19:41,280 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=889134.0, ans=0.125 2023-06-22 03:19:44,912 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-22 03:21:09,476 INFO [train.py:996] (3/4) Epoch 5, batch 26250, loss[loss=0.2201, simple_loss=0.292, pruned_loss=0.07413, over 21179.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3235, pruned_loss=0.08415, over 4278670.53 frames. ], batch size: 608, lr: 5.99e-03, grad_scale: 16.0 2023-06-22 03:21:24,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=889374.0, ans=0.1 2023-06-22 03:21:48,497 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.90 vs. limit=5.0 2023-06-22 03:23:11,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=889614.0, ans=0.0 2023-06-22 03:23:19,294 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=889614.0, ans=0.025 2023-06-22 03:23:26,716 INFO [train.py:996] (3/4) Epoch 5, batch 26300, loss[loss=0.2369, simple_loss=0.3057, pruned_loss=0.08405, over 21290.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3192, pruned_loss=0.08408, over 4282401.11 frames. ], batch size: 176, lr: 5.99e-03, grad_scale: 16.0 2023-06-22 03:23:37,830 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.012e+02 2.568e+02 2.838e+02 3.219e+02 5.714e+02, threshold=5.676e+02, percent-clipped=0.0 2023-06-22 03:23:50,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=889674.0, ans=0.2 2023-06-22 03:24:00,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=889674.0, ans=0.035 2023-06-22 03:24:09,294 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=889734.0, ans=0.125 2023-06-22 03:24:25,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=889734.0, ans=0.125 2023-06-22 03:24:27,044 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 03:25:53,849 INFO [train.py:996] (3/4) Epoch 5, batch 26350, loss[loss=0.2742, simple_loss=0.3449, pruned_loss=0.1017, over 21869.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3179, pruned_loss=0.08484, over 4286337.42 frames. ], batch size: 118, lr: 5.99e-03, grad_scale: 16.0 2023-06-22 03:26:01,020 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=889974.0, ans=0.04949747468305833 2023-06-22 03:26:16,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=889974.0, ans=0.2 2023-06-22 03:26:17,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=889974.0, ans=0.125 2023-06-22 03:26:59,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=890094.0, ans=0.1 2023-06-22 03:27:19,214 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-22 03:27:53,344 INFO [train.py:996] (3/4) Epoch 5, batch 26400, loss[loss=0.2358, simple_loss=0.2846, pruned_loss=0.09351, over 21538.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.312, pruned_loss=0.0848, over 4283479.73 frames. ], batch size: 441, lr: 5.99e-03, grad_scale: 32.0 2023-06-22 03:27:56,288 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.053e+02 2.457e+02 2.775e+02 3.263e+02 6.072e+02, threshold=5.551e+02, percent-clipped=1.0 2023-06-22 03:28:03,129 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=890274.0, ans=0.025 2023-06-22 03:28:22,838 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=890274.0, ans=10.0 2023-06-22 03:29:01,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=890394.0, ans=0.125 2023-06-22 03:29:13,558 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=890454.0, ans=0.125 2023-06-22 03:29:34,290 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.19 vs. limit=15.0 2023-06-22 03:29:42,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=890514.0, ans=0.0 2023-06-22 03:30:28,626 INFO [train.py:996] (3/4) Epoch 5, batch 26450, loss[loss=0.2816, simple_loss=0.3782, pruned_loss=0.09247, over 21651.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3125, pruned_loss=0.08513, over 4257400.81 frames. ], batch size: 389, lr: 5.99e-03, grad_scale: 32.0 2023-06-22 03:32:37,349 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=890814.0, ans=0.0 2023-06-22 03:32:41,196 INFO [train.py:996] (3/4) Epoch 5, batch 26500, loss[loss=0.202, simple_loss=0.2708, pruned_loss=0.06658, over 21442.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3135, pruned_loss=0.08325, over 4260880.62 frames. ], batch size: 194, lr: 5.99e-03, grad_scale: 32.0 2023-06-22 03:32:44,225 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.276e+02 2.731e+02 3.364e+02 4.078e+02 7.843e+02, threshold=6.728e+02, percent-clipped=9.0 2023-06-22 03:32:44,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=890874.0, ans=0.015 2023-06-22 03:33:21,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=890934.0, ans=0.125 2023-06-22 03:34:13,969 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.63 vs. limit=15.0 2023-06-22 03:35:11,484 INFO [train.py:996] (3/4) Epoch 5, batch 26550, loss[loss=0.2515, simple_loss=0.3528, pruned_loss=0.07511, over 19707.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3118, pruned_loss=0.08114, over 4260359.96 frames. ], batch size: 703, lr: 5.99e-03, grad_scale: 32.0 2023-06-22 03:36:33,558 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=891294.0, ans=0.125 2023-06-22 03:36:50,553 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.11 vs. limit=10.0 2023-06-22 03:37:03,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.out_whiten.whitening_limit, batch_count=891354.0, ans=8.0 2023-06-22 03:37:43,559 INFO [train.py:996] (3/4) Epoch 5, batch 26600, loss[loss=0.1995, simple_loss=0.2774, pruned_loss=0.06082, over 21821.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.312, pruned_loss=0.07863, over 4256103.06 frames. ], batch size: 118, lr: 5.99e-03, grad_scale: 32.0 2023-06-22 03:37:46,432 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.746e+02 2.463e+02 2.685e+02 3.046e+02 4.735e+02, threshold=5.371e+02, percent-clipped=0.0 2023-06-22 03:37:49,306 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.04 vs. limit=6.0 2023-06-22 03:38:21,853 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=891534.0, ans=0.0 2023-06-22 03:39:14,061 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 03:39:16,081 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.98 vs. limit=15.0 2023-06-22 03:39:43,218 INFO [train.py:996] (3/4) Epoch 5, batch 26650, loss[loss=0.2046, simple_loss=0.265, pruned_loss=0.07207, over 21624.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.304, pruned_loss=0.07722, over 4257067.89 frames. ], batch size: 247, lr: 5.99e-03, grad_scale: 16.0 2023-06-22 03:39:57,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=891774.0, ans=0.0 2023-06-22 03:39:59,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=891774.0, ans=0.0 2023-06-22 03:40:00,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=891774.0, ans=0.0 2023-06-22 03:40:32,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=891834.0, ans=0.1 2023-06-22 03:40:57,669 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=891954.0, ans=6.0 2023-06-22 03:41:10,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=891954.0, ans=0.0 2023-06-22 03:41:53,004 INFO [train.py:996] (3/4) Epoch 5, batch 26700, loss[loss=0.2057, simple_loss=0.2779, pruned_loss=0.06678, over 21923.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2963, pruned_loss=0.07364, over 4267129.68 frames. ], batch size: 333, lr: 5.99e-03, grad_scale: 16.0 2023-06-22 03:42:06,440 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.415e+02 2.008e+02 2.325e+02 2.767e+02 5.375e+02, threshold=4.650e+02, percent-clipped=1.0 2023-06-22 03:42:19,944 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.48 vs. limit=12.0 2023-06-22 03:42:34,451 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 03:44:15,915 INFO [train.py:996] (3/4) Epoch 5, batch 26750, loss[loss=0.2393, simple_loss=0.324, pruned_loss=0.07728, over 21658.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2965, pruned_loss=0.07244, over 4275506.45 frames. ], batch size: 389, lr: 5.98e-03, grad_scale: 8.0 2023-06-22 03:44:16,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=892374.0, ans=0.0 2023-06-22 03:44:17,777 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=892374.0, ans=0.1 2023-06-22 03:44:48,514 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=892434.0, ans=0.0 2023-06-22 03:45:48,650 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=892554.0, ans=0.04949747468305833 2023-06-22 03:46:28,833 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 03:46:46,163 INFO [train.py:996] (3/4) Epoch 5, batch 26800, loss[loss=0.3387, simple_loss=0.3855, pruned_loss=0.1459, over 21438.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3049, pruned_loss=0.0771, over 4272925.27 frames. ], batch size: 510, lr: 5.98e-03, grad_scale: 16.0 2023-06-22 03:46:52,281 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.875e+02 2.542e+02 2.962e+02 3.442e+02 4.612e+02, threshold=5.925e+02, percent-clipped=0.0 2023-06-22 03:47:14,449 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=892734.0, ans=0.025 2023-06-22 03:47:34,976 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.59 vs. limit=10.0 2023-06-22 03:47:42,763 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=892794.0, ans=0.125 2023-06-22 03:48:55,007 INFO [train.py:996] (3/4) Epoch 5, batch 26850, loss[loss=0.2127, simple_loss=0.274, pruned_loss=0.07571, over 21438.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3058, pruned_loss=0.07946, over 4271188.95 frames. ], batch size: 131, lr: 5.98e-03, grad_scale: 16.0 2023-06-22 03:49:27,914 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=893034.0, ans=0.07 2023-06-22 03:49:27,929 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 03:50:30,604 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=893154.0, ans=0.125 2023-06-22 03:51:06,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=893214.0, ans=0.125 2023-06-22 03:51:10,811 INFO [train.py:996] (3/4) Epoch 5, batch 26900, loss[loss=0.1911, simple_loss=0.2497, pruned_loss=0.06628, over 21442.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.2973, pruned_loss=0.0785, over 4267374.49 frames. ], batch size: 212, lr: 5.98e-03, grad_scale: 16.0 2023-06-22 03:51:22,779 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.056e+02 2.626e+02 2.882e+02 3.361e+02 7.434e+02, threshold=5.764e+02, percent-clipped=1.0 2023-06-22 03:51:40,832 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=893334.0, ans=0.125 2023-06-22 03:52:10,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=893394.0, ans=0.125 2023-06-22 03:52:32,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=893454.0, ans=0.125 2023-06-22 03:52:41,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=893454.0, ans=0.0 2023-06-22 03:53:11,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=893514.0, ans=0.0 2023-06-22 03:53:16,928 INFO [train.py:996] (3/4) Epoch 5, batch 26950, loss[loss=0.2469, simple_loss=0.3326, pruned_loss=0.08056, over 21676.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.2968, pruned_loss=0.07853, over 4264540.53 frames. ], batch size: 247, lr: 5.98e-03, grad_scale: 16.0 2023-06-22 03:54:43,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=893754.0, ans=0.0 2023-06-22 03:55:38,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=893874.0, ans=0.2 2023-06-22 03:55:45,147 INFO [train.py:996] (3/4) Epoch 5, batch 27000, loss[loss=0.2074, simple_loss=0.299, pruned_loss=0.05791, over 21637.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.2977, pruned_loss=0.07725, over 4259385.58 frames. ], batch size: 263, lr: 5.98e-03, grad_scale: 16.0 2023-06-22 03:55:45,147 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-22 03:56:32,877 INFO [train.py:1028] (3/4) Epoch 5, validation: loss=0.2499, simple_loss=0.3437, pruned_loss=0.07804, over 1796401.00 frames. 2023-06-22 03:56:32,878 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-22 03:56:45,962 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.801e+02 2.322e+02 2.675e+02 3.569e+02 6.901e+02, threshold=5.350e+02, percent-clipped=2.0 2023-06-22 03:57:33,084 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=893994.0, ans=0.125 2023-06-22 03:57:37,510 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=893994.0, ans=0.125 2023-06-22 03:57:41,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=894054.0, ans=0.1 2023-06-22 03:58:01,854 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=894054.0, ans=0.1 2023-06-22 03:58:30,415 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=894114.0, ans=0.2 2023-06-22 03:58:32,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=894114.0, ans=0.2 2023-06-22 03:58:47,686 INFO [train.py:996] (3/4) Epoch 5, batch 27050, loss[loss=0.2225, simple_loss=0.3047, pruned_loss=0.07014, over 21710.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2994, pruned_loss=0.07396, over 4263887.71 frames. ], batch size: 247, lr: 5.98e-03, grad_scale: 16.0 2023-06-22 03:58:48,072 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=894174.0, ans=0.0 2023-06-22 03:59:45,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=894294.0, ans=0.1 2023-06-22 04:00:32,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=894414.0, ans=0.0 2023-06-22 04:00:48,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=894414.0, ans=0.2 2023-06-22 04:00:52,705 INFO [train.py:996] (3/4) Epoch 5, batch 27100, loss[loss=0.2178, simple_loss=0.2839, pruned_loss=0.07584, over 21729.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3023, pruned_loss=0.07492, over 4272172.23 frames. ], batch size: 264, lr: 5.98e-03, grad_scale: 16.0 2023-06-22 04:00:58,766 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.689e+02 2.373e+02 2.647e+02 3.210e+02 5.471e+02, threshold=5.294e+02, percent-clipped=1.0 2023-06-22 04:01:50,057 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=894534.0, ans=0.1 2023-06-22 04:02:06,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=894594.0, ans=0.0 2023-06-22 04:02:55,893 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=894714.0, ans=0.125 2023-06-22 04:03:12,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=894714.0, ans=0.125 2023-06-22 04:03:16,616 INFO [train.py:996] (3/4) Epoch 5, batch 27150, loss[loss=0.2598, simple_loss=0.3437, pruned_loss=0.08792, over 21695.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3131, pruned_loss=0.07821, over 4272933.00 frames. ], batch size: 247, lr: 5.98e-03, grad_scale: 16.0 2023-06-22 04:03:31,265 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.43 vs. limit=6.0 2023-06-22 04:04:02,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=894834.0, ans=0.125 2023-06-22 04:04:27,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=894894.0, ans=0.125 2023-06-22 04:05:39,744 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.50 vs. limit=15.0 2023-06-22 04:05:41,569 INFO [train.py:996] (3/4) Epoch 5, batch 27200, loss[loss=0.2783, simple_loss=0.3538, pruned_loss=0.1014, over 21589.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3209, pruned_loss=0.08146, over 4272876.18 frames. ], batch size: 389, lr: 5.98e-03, grad_scale: 32.0 2023-06-22 04:05:47,605 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.168e+02 2.848e+02 3.372e+02 3.981e+02 6.685e+02, threshold=6.744e+02, percent-clipped=3.0 2023-06-22 04:05:59,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=895074.0, ans=0.0 2023-06-22 04:06:12,771 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=895134.0, ans=0.125 2023-06-22 04:07:06,222 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=895254.0, ans=0.125 2023-06-22 04:07:13,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=895254.0, ans=0.0 2023-06-22 04:07:31,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=895314.0, ans=0.125 2023-06-22 04:07:58,346 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=895314.0, ans=0.0 2023-06-22 04:08:00,932 INFO [train.py:996] (3/4) Epoch 5, batch 27250, loss[loss=0.2836, simple_loss=0.3506, pruned_loss=0.1084, over 21331.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3257, pruned_loss=0.08624, over 4275006.77 frames. ], batch size: 176, lr: 5.97e-03, grad_scale: 32.0 2023-06-22 04:08:30,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=895374.0, ans=0.0 2023-06-22 04:08:32,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=895434.0, ans=0.1 2023-06-22 04:10:32,063 INFO [train.py:996] (3/4) Epoch 5, batch 27300, loss[loss=0.272, simple_loss=0.3489, pruned_loss=0.09756, over 21570.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3269, pruned_loss=0.08729, over 4266153.19 frames. ], batch size: 131, lr: 5.97e-03, grad_scale: 32.0 2023-06-22 04:10:52,686 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.112e+02 2.750e+02 3.166e+02 3.741e+02 6.824e+02, threshold=6.331e+02, percent-clipped=1.0 2023-06-22 04:11:11,149 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.87 vs. limit=15.0 2023-06-22 04:13:20,060 INFO [train.py:996] (3/4) Epoch 5, batch 27350, loss[loss=0.247, simple_loss=0.3277, pruned_loss=0.08315, over 21856.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3302, pruned_loss=0.08797, over 4260759.84 frames. ], batch size: 124, lr: 5.97e-03, grad_scale: 32.0 2023-06-22 04:13:29,883 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=895974.0, ans=0.04949747468305833 2023-06-22 04:14:51,810 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=896154.0, ans=0.125 2023-06-22 04:15:08,386 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=896214.0, ans=0.125 2023-06-22 04:15:27,416 INFO [train.py:996] (3/4) Epoch 5, batch 27400, loss[loss=0.2123, simple_loss=0.277, pruned_loss=0.07375, over 21658.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3245, pruned_loss=0.08665, over 4268843.75 frames. ], batch size: 247, lr: 5.97e-03, grad_scale: 32.0 2023-06-22 04:15:28,624 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.35 vs. limit=8.0 2023-06-22 04:15:45,253 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.047e+02 2.543e+02 2.846e+02 3.223e+02 4.913e+02, threshold=5.692e+02, percent-clipped=0.0 2023-06-22 04:16:12,855 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-22 04:16:35,248 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=896394.0, ans=0.2 2023-06-22 04:17:49,328 INFO [train.py:996] (3/4) Epoch 5, batch 27450, loss[loss=0.2359, simple_loss=0.3249, pruned_loss=0.07341, over 21591.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3184, pruned_loss=0.08511, over 4272021.04 frames. ], batch size: 389, lr: 5.97e-03, grad_scale: 16.0 2023-06-22 04:19:23,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=896754.0, ans=0.2 2023-06-22 04:19:39,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=896814.0, ans=0.1 2023-06-22 04:20:02,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=896814.0, ans=0.125 2023-06-22 04:20:11,251 INFO [train.py:996] (3/4) Epoch 5, batch 27500, loss[loss=0.2418, simple_loss=0.3152, pruned_loss=0.08425, over 21238.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.317, pruned_loss=0.08555, over 4279294.50 frames. ], batch size: 143, lr: 5.97e-03, grad_scale: 16.0 2023-06-22 04:20:18,389 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 2.586e+02 3.048e+02 3.349e+02 5.042e+02, threshold=6.096e+02, percent-clipped=0.0 2023-06-22 04:20:23,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=896874.0, ans=0.1 2023-06-22 04:20:25,107 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.98 vs. limit=15.0 2023-06-22 04:21:37,487 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.79 vs. limit=15.0 2023-06-22 04:21:52,335 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=897114.0, ans=0.125 2023-06-22 04:21:55,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=897114.0, ans=0.09899494936611666 2023-06-22 04:22:18,496 INFO [train.py:996] (3/4) Epoch 5, batch 27550, loss[loss=0.1885, simple_loss=0.2602, pruned_loss=0.05838, over 21637.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.312, pruned_loss=0.08218, over 4280124.41 frames. ], batch size: 247, lr: 5.97e-03, grad_scale: 16.0 2023-06-22 04:22:20,514 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=897174.0, ans=0.1 2023-06-22 04:22:27,444 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=897174.0, ans=0.1 2023-06-22 04:23:40,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=897294.0, ans=0.0 2023-06-22 04:23:42,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=897294.0, ans=0.0 2023-06-22 04:23:43,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=897294.0, ans=0.2 2023-06-22 04:24:22,282 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=897414.0, ans=0.2 2023-06-22 04:24:27,636 INFO [train.py:996] (3/4) Epoch 5, batch 27600, loss[loss=0.2286, simple_loss=0.2914, pruned_loss=0.08291, over 22016.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3056, pruned_loss=0.08096, over 4277305.17 frames. ], batch size: 103, lr: 5.97e-03, grad_scale: 32.0 2023-06-22 04:24:46,033 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.028e+02 2.414e+02 2.661e+02 3.209e+02 4.551e+02, threshold=5.321e+02, percent-clipped=0.0 2023-06-22 04:25:44,280 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.85 vs. limit=15.0 2023-06-22 04:26:15,580 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.44 vs. limit=15.0 2023-06-22 04:26:34,746 INFO [train.py:996] (3/4) Epoch 5, batch 27650, loss[loss=0.2365, simple_loss=0.3094, pruned_loss=0.08181, over 16656.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3005, pruned_loss=0.08057, over 4261850.05 frames. ], batch size: 62, lr: 5.97e-03, grad_scale: 32.0 2023-06-22 04:28:50,436 INFO [train.py:996] (3/4) Epoch 5, batch 27700, loss[loss=0.2492, simple_loss=0.3424, pruned_loss=0.07801, over 21748.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.2999, pruned_loss=0.07882, over 4258927.87 frames. ], batch size: 332, lr: 5.97e-03, grad_scale: 32.0 2023-06-22 04:29:15,742 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.981e+02 2.480e+02 2.768e+02 3.290e+02 5.245e+02, threshold=5.535e+02, percent-clipped=0.0 2023-06-22 04:30:39,988 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=18.14 vs. limit=22.5 2023-06-22 04:31:04,043 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.28 vs. limit=22.5 2023-06-22 04:31:17,212 INFO [train.py:996] (3/4) Epoch 5, batch 27750, loss[loss=0.2345, simple_loss=0.3097, pruned_loss=0.07968, over 21802.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3023, pruned_loss=0.07812, over 4264340.28 frames. ], batch size: 414, lr: 5.96e-03, grad_scale: 16.0 2023-06-22 04:31:43,306 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-22 04:32:15,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=898494.0, ans=10.0 2023-06-22 04:33:11,399 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=898554.0, ans=0.125 2023-06-22 04:33:11,417 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=898554.0, ans=0.0 2023-06-22 04:33:32,581 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=898674.0, ans=0.125 2023-06-22 04:33:33,551 INFO [train.py:996] (3/4) Epoch 5, batch 27800, loss[loss=0.2402, simple_loss=0.3125, pruned_loss=0.08399, over 21886.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3025, pruned_loss=0.07854, over 4277213.31 frames. ], batch size: 118, lr: 5.96e-03, grad_scale: 16.0 2023-06-22 04:33:40,047 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.58 vs. limit=12.0 2023-06-22 04:33:41,764 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.940e+02 2.543e+02 3.042e+02 3.729e+02 6.528e+02, threshold=6.084e+02, percent-clipped=2.0 2023-06-22 04:35:43,848 INFO [train.py:996] (3/4) Epoch 5, batch 27850, loss[loss=0.2095, simple_loss=0.2682, pruned_loss=0.07544, over 21200.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3022, pruned_loss=0.08, over 4288428.38 frames. ], batch size: 608, lr: 5.96e-03, grad_scale: 16.0 2023-06-22 04:35:47,122 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=898974.0, ans=0.0 2023-06-22 04:35:57,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=898974.0, ans=0.125 2023-06-22 04:35:57,697 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=898974.0, ans=0.125 2023-06-22 04:36:35,533 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.07 vs. limit=10.0 2023-06-22 04:37:06,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=899094.0, ans=0.1 2023-06-22 04:37:14,659 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-06-22 04:37:34,272 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 04:38:21,893 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.43 vs. limit=15.0 2023-06-22 04:38:27,010 INFO [train.py:996] (3/4) Epoch 5, batch 27900, loss[loss=0.2167, simple_loss=0.3043, pruned_loss=0.06456, over 21657.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3112, pruned_loss=0.08096, over 4292196.89 frames. ], batch size: 263, lr: 5.96e-03, grad_scale: 16.0 2023-06-22 04:38:45,900 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.32 vs. limit=15.0 2023-06-22 04:38:46,422 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.080e+02 2.475e+02 2.737e+02 3.192e+02 5.724e+02, threshold=5.474e+02, percent-clipped=0.0 2023-06-22 04:38:48,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=899274.0, ans=0.2 2023-06-22 04:39:59,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=899454.0, ans=0.125 2023-06-22 04:40:13,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=899454.0, ans=0.0 2023-06-22 04:40:14,565 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.98 vs. limit=15.0 2023-06-22 04:40:19,853 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=899514.0, ans=0.125 2023-06-22 04:40:40,718 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.92 vs. limit=6.0 2023-06-22 04:40:51,509 INFO [train.py:996] (3/4) Epoch 5, batch 27950, loss[loss=0.2696, simple_loss=0.3549, pruned_loss=0.09217, over 21717.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3113, pruned_loss=0.07769, over 4286908.21 frames. ], batch size: 441, lr: 5.96e-03, grad_scale: 16.0 2023-06-22 04:41:58,296 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=899694.0, ans=0.125 2023-06-22 04:43:06,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=899874.0, ans=0.125 2023-06-22 04:43:07,333 INFO [train.py:996] (3/4) Epoch 5, batch 28000, loss[loss=0.2308, simple_loss=0.2955, pruned_loss=0.08302, over 21422.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3085, pruned_loss=0.07504, over 4289116.93 frames. ], batch size: 144, lr: 5.96e-03, grad_scale: 32.0 2023-06-22 04:43:29,727 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.18 vs. limit=5.0 2023-06-22 04:43:31,381 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.728e+02 2.277e+02 2.720e+02 3.265e+02 5.503e+02, threshold=5.441e+02, percent-clipped=1.0 2023-06-22 04:44:41,129 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=900054.0, ans=0.125 2023-06-22 04:45:36,022 INFO [train.py:996] (3/4) Epoch 5, batch 28050, loss[loss=0.2501, simple_loss=0.3238, pruned_loss=0.08819, over 21730.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3077, pruned_loss=0.07746, over 4293911.34 frames. ], batch size: 441, lr: 5.96e-03, grad_scale: 32.0 2023-06-22 04:45:37,733 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=900174.0, ans=0.125 2023-06-22 04:45:49,453 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=900174.0, ans=0.125 2023-06-22 04:47:07,296 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=900294.0, ans=0.125 2023-06-22 04:47:19,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=900354.0, ans=0.125 2023-06-22 04:47:29,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=900414.0, ans=0.1 2023-06-22 04:47:39,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=900414.0, ans=0.125 2023-06-22 04:47:57,166 INFO [train.py:996] (3/4) Epoch 5, batch 28100, loss[loss=0.2266, simple_loss=0.2962, pruned_loss=0.07851, over 21938.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.305, pruned_loss=0.07759, over 4293336.70 frames. ], batch size: 103, lr: 5.96e-03, grad_scale: 16.0 2023-06-22 04:47:59,769 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.98 vs. limit=15.0 2023-06-22 04:48:05,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=900474.0, ans=0.125 2023-06-22 04:48:16,677 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.925e+02 2.684e+02 3.220e+02 3.866e+02 7.727e+02, threshold=6.440e+02, percent-clipped=6.0 2023-06-22 04:49:01,367 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=900594.0, ans=0.125 2023-06-22 04:50:04,223 INFO [train.py:996] (3/4) Epoch 5, batch 28150, loss[loss=0.2245, simple_loss=0.2825, pruned_loss=0.08327, over 14824.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.2976, pruned_loss=0.0768, over 4276742.26 frames. ], batch size: 62, lr: 5.96e-03, grad_scale: 16.0 2023-06-22 04:50:33,789 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=900834.0, ans=0.0 2023-06-22 04:50:58,045 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=900834.0, ans=15.0 2023-06-22 04:51:25,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=900954.0, ans=0.1 2023-06-22 04:52:25,418 INFO [train.py:996] (3/4) Epoch 5, batch 28200, loss[loss=0.2514, simple_loss=0.3114, pruned_loss=0.09573, over 21692.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.2956, pruned_loss=0.07783, over 4277200.25 frames. ], batch size: 351, lr: 5.96e-03, grad_scale: 16.0 2023-06-22 04:52:35,734 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.170e+02 2.760e+02 3.168e+02 3.960e+02 6.976e+02, threshold=6.335e+02, percent-clipped=2.0 2023-06-22 04:53:16,500 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.55 vs. limit=22.5 2023-06-22 04:53:57,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=901254.0, ans=0.0 2023-06-22 04:54:10,246 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 04:54:34,810 INFO [train.py:996] (3/4) Epoch 5, batch 28250, loss[loss=0.2278, simple_loss=0.3128, pruned_loss=0.07138, over 16164.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.2994, pruned_loss=0.08012, over 4264213.31 frames. ], batch size: 60, lr: 5.95e-03, grad_scale: 16.0 2023-06-22 04:54:56,557 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 04:55:39,232 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=901434.0, ans=0.0 2023-06-22 04:55:47,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=901494.0, ans=0.125 2023-06-22 04:56:40,248 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.36 vs. limit=15.0 2023-06-22 04:56:53,967 INFO [train.py:996] (3/4) Epoch 5, batch 28300, loss[loss=0.1791, simple_loss=0.2687, pruned_loss=0.04477, over 21759.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.2966, pruned_loss=0.07734, over 4257109.58 frames. ], batch size: 351, lr: 5.95e-03, grad_scale: 16.0 2023-06-22 04:57:26,483 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.777e+02 2.516e+02 2.928e+02 3.510e+02 5.631e+02, threshold=5.856e+02, percent-clipped=0.0 2023-06-22 04:57:44,556 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=6.001e-03 2023-06-22 04:57:45,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=901794.0, ans=0.0 2023-06-22 04:58:18,384 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=901794.0, ans=0.0 2023-06-22 04:58:20,389 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.37 vs. limit=22.5 2023-06-22 04:59:03,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=901914.0, ans=0.125 2023-06-22 04:59:26,094 INFO [train.py:996] (3/4) Epoch 5, batch 28350, loss[loss=0.1995, simple_loss=0.2664, pruned_loss=0.06635, over 21283.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2926, pruned_loss=0.07252, over 4254939.63 frames. ], batch size: 160, lr: 5.95e-03, grad_scale: 16.0 2023-06-22 05:00:08,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=902034.0, ans=0.09899494936611666 2023-06-22 05:00:49,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=902154.0, ans=0.125 2023-06-22 05:00:54,936 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.67 vs. limit=6.0 2023-06-22 05:01:31,921 INFO [train.py:996] (3/4) Epoch 5, batch 28400, loss[loss=0.2557, simple_loss=0.3237, pruned_loss=0.09386, over 21711.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2897, pruned_loss=0.07255, over 4260812.22 frames. ], batch size: 351, lr: 5.95e-03, grad_scale: 32.0 2023-06-22 05:02:02,170 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.695e+02 2.345e+02 2.616e+02 3.191e+02 6.472e+02, threshold=5.233e+02, percent-clipped=2.0 2023-06-22 05:02:56,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=902394.0, ans=0.125 2023-06-22 05:03:03,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=902454.0, ans=0.2 2023-06-22 05:03:54,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=902514.0, ans=0.1 2023-06-22 05:03:56,948 INFO [train.py:996] (3/4) Epoch 5, batch 28450, loss[loss=0.2636, simple_loss=0.3269, pruned_loss=0.1002, over 21771.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.2969, pruned_loss=0.07716, over 4268619.37 frames. ], batch size: 441, lr: 5.95e-03, grad_scale: 16.0 2023-06-22 05:04:11,372 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=902574.0, ans=0.04949747468305833 2023-06-22 05:05:20,509 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=902754.0, ans=0.0 2023-06-22 05:05:56,803 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=902814.0, ans=0.1 2023-06-22 05:05:56,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=902814.0, ans=0.125 2023-06-22 05:05:58,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=902814.0, ans=0.0 2023-06-22 05:06:04,439 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.06 vs. limit=10.0 2023-06-22 05:06:25,824 INFO [train.py:996] (3/4) Epoch 5, batch 28500, loss[loss=0.2549, simple_loss=0.3212, pruned_loss=0.09435, over 21554.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.2989, pruned_loss=0.07915, over 4277634.61 frames. ], batch size: 230, lr: 5.95e-03, grad_scale: 16.0 2023-06-22 05:06:38,322 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.223e+02 2.555e+02 2.878e+02 3.269e+02 4.287e+02, threshold=5.756e+02, percent-clipped=0.0 2023-06-22 05:06:40,539 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=902934.0, ans=0.125 2023-06-22 05:07:01,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=902934.0, ans=0.0 2023-06-22 05:07:05,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=902934.0, ans=0.125 2023-06-22 05:07:07,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=902994.0, ans=0.1 2023-06-22 05:07:27,777 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.94 vs. limit=15.0 2023-06-22 05:08:46,877 INFO [train.py:996] (3/4) Epoch 5, batch 28550, loss[loss=0.1998, simple_loss=0.2823, pruned_loss=0.05867, over 21864.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3065, pruned_loss=0.08157, over 4282846.55 frames. ], batch size: 98, lr: 5.95e-03, grad_scale: 16.0 2023-06-22 05:09:09,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=903174.0, ans=0.0 2023-06-22 05:09:13,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=903234.0, ans=0.2 2023-06-22 05:09:34,465 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=903294.0, ans=0.125 2023-06-22 05:09:35,969 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=903294.0, ans=0.0 2023-06-22 05:10:36,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=903414.0, ans=0.125 2023-06-22 05:11:12,590 INFO [train.py:996] (3/4) Epoch 5, batch 28600, loss[loss=0.2581, simple_loss=0.3339, pruned_loss=0.09109, over 21667.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3128, pruned_loss=0.08331, over 4278069.33 frames. ], batch size: 351, lr: 5.95e-03, grad_scale: 16.0 2023-06-22 05:11:22,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=903474.0, ans=0.125 2023-06-22 05:11:24,653 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.229e+02 2.734e+02 3.060e+02 3.584e+02 6.352e+02, threshold=6.121e+02, percent-clipped=1.0 2023-06-22 05:12:07,145 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=903594.0, ans=10.0 2023-06-22 05:12:42,385 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=903654.0, ans=0.0 2023-06-22 05:13:22,531 INFO [train.py:996] (3/4) Epoch 5, batch 28650, loss[loss=0.1958, simple_loss=0.2589, pruned_loss=0.06639, over 21760.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.307, pruned_loss=0.08277, over 4269698.62 frames. ], batch size: 317, lr: 5.95e-03, grad_scale: 16.0 2023-06-22 05:13:30,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=903774.0, ans=0.05 2023-06-22 05:15:43,977 INFO [train.py:996] (3/4) Epoch 5, batch 28700, loss[loss=0.2314, simple_loss=0.3, pruned_loss=0.08138, over 21700.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3047, pruned_loss=0.08326, over 4264778.51 frames. ], batch size: 298, lr: 5.95e-03, grad_scale: 16.0 2023-06-22 05:16:01,192 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 2.516e+02 2.747e+02 3.102e+02 4.785e+02, threshold=5.493e+02, percent-clipped=0.0 2023-06-22 05:16:56,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=904254.0, ans=0.0 2023-06-22 05:17:16,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=904254.0, ans=0.025 2023-06-22 05:17:17,229 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-06-22 05:18:03,003 INFO [train.py:996] (3/4) Epoch 5, batch 28750, loss[loss=0.2103, simple_loss=0.2961, pruned_loss=0.06224, over 21751.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3051, pruned_loss=0.0841, over 4272704.84 frames. ], batch size: 247, lr: 5.94e-03, grad_scale: 16.0 2023-06-22 05:18:36,438 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=904434.0, ans=0.04949747468305833 2023-06-22 05:19:48,887 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 05:19:58,286 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.91 vs. limit=22.5 2023-06-22 05:20:32,365 INFO [train.py:996] (3/4) Epoch 5, batch 28800, loss[loss=0.3194, simple_loss=0.3803, pruned_loss=0.1292, over 21205.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3089, pruned_loss=0.08428, over 4280147.27 frames. ], batch size: 143, lr: 5.94e-03, grad_scale: 32.0 2023-06-22 05:20:50,083 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.262e+02 2.736e+02 3.061e+02 3.498e+02 5.651e+02, threshold=6.121e+02, percent-clipped=1.0 2023-06-22 05:21:49,867 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.83 vs. limit=15.0 2023-06-22 05:23:02,430 INFO [train.py:996] (3/4) Epoch 5, batch 28850, loss[loss=0.2276, simple_loss=0.2978, pruned_loss=0.07875, over 21838.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3103, pruned_loss=0.08563, over 4281813.05 frames. ], batch size: 124, lr: 5.94e-03, grad_scale: 32.0 2023-06-22 05:23:14,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=904974.0, ans=0.0 2023-06-22 05:24:46,137 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=905154.0, ans=0.1 2023-06-22 05:24:53,539 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=905154.0, ans=0.125 2023-06-22 05:25:42,883 INFO [train.py:996] (3/4) Epoch 5, batch 28900, loss[loss=0.2245, simple_loss=0.296, pruned_loss=0.07651, over 21787.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3135, pruned_loss=0.08724, over 4284783.76 frames. ], batch size: 247, lr: 5.94e-03, grad_scale: 32.0 2023-06-22 05:25:55,353 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.043e+02 2.574e+02 2.990e+02 3.468e+02 6.193e+02, threshold=5.980e+02, percent-clipped=0.0 2023-06-22 05:26:07,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=905334.0, ans=0.125 2023-06-22 05:26:17,296 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=905334.0, ans=0.0 2023-06-22 05:26:31,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=905394.0, ans=0.125 2023-06-22 05:27:36,399 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=905454.0, ans=0.0 2023-06-22 05:27:43,815 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=905514.0, ans=0.125 2023-06-22 05:27:54,369 INFO [train.py:996] (3/4) Epoch 5, batch 28950, loss[loss=0.2635, simple_loss=0.3487, pruned_loss=0.08921, over 21567.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3133, pruned_loss=0.08633, over 4276990.81 frames. ], batch size: 471, lr: 5.94e-03, grad_scale: 32.0 2023-06-22 05:28:12,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=905574.0, ans=0.07 2023-06-22 05:29:09,338 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.80 vs. limit=22.5 2023-06-22 05:29:39,398 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=905754.0, ans=0.125 2023-06-22 05:30:19,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=905814.0, ans=0.125 2023-06-22 05:30:22,095 INFO [train.py:996] (3/4) Epoch 5, batch 29000, loss[loss=0.2509, simple_loss=0.3227, pruned_loss=0.0896, over 21792.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3169, pruned_loss=0.08577, over 4274423.87 frames. ], batch size: 247, lr: 5.94e-03, grad_scale: 32.0 2023-06-22 05:30:40,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=905874.0, ans=0.125 2023-06-22 05:30:46,531 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.026e+02 2.550e+02 2.882e+02 3.345e+02 5.949e+02, threshold=5.765e+02, percent-clipped=1.0 2023-06-22 05:30:54,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=905934.0, ans=0.2 2023-06-22 05:31:17,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=905934.0, ans=0.2 2023-06-22 05:31:29,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=905994.0, ans=0.1 2023-06-22 05:31:34,328 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.77 vs. limit=15.0 2023-06-22 05:32:07,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=906054.0, ans=0.125 2023-06-22 05:32:44,715 INFO [train.py:996] (3/4) Epoch 5, batch 29050, loss[loss=0.2329, simple_loss=0.2912, pruned_loss=0.08726, over 21589.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3151, pruned_loss=0.08639, over 4281078.98 frames. ], batch size: 548, lr: 5.94e-03, grad_scale: 32.0 2023-06-22 05:33:33,777 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=906234.0, ans=0.125 2023-06-22 05:34:27,689 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-06-22 05:34:59,591 INFO [train.py:996] (3/4) Epoch 5, batch 29100, loss[loss=0.2299, simple_loss=0.3388, pruned_loss=0.06043, over 19983.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3063, pruned_loss=0.08392, over 4287790.52 frames. ], batch size: 702, lr: 5.94e-03, grad_scale: 32.0 2023-06-22 05:35:37,015 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.003e+02 2.541e+02 2.908e+02 3.289e+02 5.672e+02, threshold=5.815e+02, percent-clipped=0.0 2023-06-22 05:35:54,070 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=906534.0, ans=0.035 2023-06-22 05:36:51,435 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=906714.0, ans=0.125 2023-06-22 05:37:19,649 INFO [train.py:996] (3/4) Epoch 5, batch 29150, loss[loss=0.239, simple_loss=0.3027, pruned_loss=0.08763, over 20065.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3039, pruned_loss=0.08125, over 4290027.95 frames. ], batch size: 707, lr: 5.94e-03, grad_scale: 16.0 2023-06-22 05:37:24,778 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.42 vs. limit=6.0 2023-06-22 05:37:42,745 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=906774.0, ans=0.0 2023-06-22 05:37:53,406 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.94 vs. limit=10.0 2023-06-22 05:38:09,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=906834.0, ans=0.125 2023-06-22 05:38:18,824 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.04 vs. limit=15.0 2023-06-22 05:39:10,465 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=907014.0, ans=0.125 2023-06-22 05:39:23,824 INFO [train.py:996] (3/4) Epoch 5, batch 29200, loss[loss=0.1935, simple_loss=0.2559, pruned_loss=0.06554, over 21549.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3005, pruned_loss=0.08089, over 4285444.07 frames. ], batch size: 263, lr: 5.94e-03, grad_scale: 32.0 2023-06-22 05:40:04,042 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.047e+02 2.662e+02 3.109e+02 4.025e+02 6.896e+02, threshold=6.219e+02, percent-clipped=4.0 2023-06-22 05:40:21,638 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.09 vs. limit=15.0 2023-06-22 05:41:49,350 INFO [train.py:996] (3/4) Epoch 5, batch 29250, loss[loss=0.2107, simple_loss=0.2923, pruned_loss=0.06453, over 21602.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.2998, pruned_loss=0.07922, over 4284616.94 frames. ], batch size: 230, lr: 5.93e-03, grad_scale: 32.0 2023-06-22 05:42:03,075 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=907374.0, ans=0.0 2023-06-22 05:42:28,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=907434.0, ans=0.0 2023-06-22 05:42:36,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=907494.0, ans=0.125 2023-06-22 05:43:14,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=907554.0, ans=0.125 2023-06-22 05:43:15,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=907554.0, ans=0.0 2023-06-22 05:43:26,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=907554.0, ans=0.1 2023-06-22 05:44:03,777 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=907614.0, ans=0.125 2023-06-22 05:44:06,197 INFO [train.py:996] (3/4) Epoch 5, batch 29300, loss[loss=0.2138, simple_loss=0.279, pruned_loss=0.07434, over 21694.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3011, pruned_loss=0.07808, over 4280925.08 frames. ], batch size: 282, lr: 5.93e-03, grad_scale: 32.0 2023-06-22 05:44:30,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=907674.0, ans=0.1 2023-06-22 05:44:37,885 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.867e+02 2.436e+02 2.713e+02 3.173e+02 4.858e+02, threshold=5.427e+02, percent-clipped=0.0 2023-06-22 05:44:58,741 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.61 vs. limit=15.0 2023-06-22 05:45:11,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=907854.0, ans=0.125 2023-06-22 05:46:04,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=907914.0, ans=0.2 2023-06-22 05:46:07,449 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=907914.0, ans=0.09899494936611666 2023-06-22 05:46:13,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=907974.0, ans=0.0 2023-06-22 05:46:14,515 INFO [train.py:996] (3/4) Epoch 5, batch 29350, loss[loss=0.2205, simple_loss=0.3139, pruned_loss=0.06352, over 21825.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.2979, pruned_loss=0.07726, over 4279605.62 frames. ], batch size: 317, lr: 5.93e-03, grad_scale: 32.0 2023-06-22 05:46:58,700 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=908034.0, ans=0.1 2023-06-22 05:47:03,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=908034.0, ans=0.1 2023-06-22 05:47:29,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=908094.0, ans=0.125 2023-06-22 05:48:47,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=908274.0, ans=0.125 2023-06-22 05:48:48,310 INFO [train.py:996] (3/4) Epoch 5, batch 29400, loss[loss=0.1707, simple_loss=0.2367, pruned_loss=0.05233, over 21529.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2965, pruned_loss=0.07503, over 4269890.39 frames. ], batch size: 195, lr: 5.93e-03, grad_scale: 32.0 2023-06-22 05:49:08,063 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.007e+02 2.428e+02 2.742e+02 3.233e+02 5.582e+02, threshold=5.484e+02, percent-clipped=1.0 2023-06-22 05:49:16,386 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=908334.0, ans=0.125 2023-06-22 05:50:30,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=908454.0, ans=0.125 2023-06-22 05:50:34,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=908514.0, ans=0.0 2023-06-22 05:51:06,629 INFO [train.py:996] (3/4) Epoch 5, batch 29450, loss[loss=0.3028, simple_loss=0.362, pruned_loss=0.1218, over 21350.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2963, pruned_loss=0.07508, over 4273327.58 frames. ], batch size: 507, lr: 5.93e-03, grad_scale: 32.0 2023-06-22 05:52:13,826 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=908694.0, ans=0.0 2023-06-22 05:52:36,449 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 05:52:39,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=908754.0, ans=0.125 2023-06-22 05:53:20,894 INFO [train.py:996] (3/4) Epoch 5, batch 29500, loss[loss=0.2332, simple_loss=0.3011, pruned_loss=0.08265, over 21483.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.301, pruned_loss=0.07802, over 4271704.56 frames. ], batch size: 131, lr: 5.93e-03, grad_scale: 32.0 2023-06-22 05:53:25,917 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=908874.0, ans=0.125 2023-06-22 05:53:30,767 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.55 vs. limit=15.0 2023-06-22 05:53:33,969 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 2.755e+02 3.048e+02 3.720e+02 7.452e+02, threshold=6.096e+02, percent-clipped=3.0 2023-06-22 05:54:04,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=908934.0, ans=0.025 2023-06-22 05:54:11,226 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=908994.0, ans=0.0 2023-06-22 05:54:33,873 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=908994.0, ans=0.2 2023-06-22 05:55:12,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=909114.0, ans=0.0 2023-06-22 05:55:39,418 INFO [train.py:996] (3/4) Epoch 5, batch 29550, loss[loss=0.2058, simple_loss=0.274, pruned_loss=0.06876, over 21619.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.2995, pruned_loss=0.07904, over 4280506.55 frames. ], batch size: 212, lr: 5.93e-03, grad_scale: 16.0 2023-06-22 05:55:44,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=909174.0, ans=0.0 2023-06-22 05:56:41,648 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.13 vs. limit=15.0 2023-06-22 05:56:49,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=909354.0, ans=6.0 2023-06-22 05:57:03,052 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=909354.0, ans=0.125 2023-06-22 05:58:01,461 INFO [train.py:996] (3/4) Epoch 5, batch 29600, loss[loss=0.2722, simple_loss=0.3763, pruned_loss=0.08404, over 20805.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3054, pruned_loss=0.08111, over 4276878.20 frames. ], batch size: 608, lr: 5.93e-03, grad_scale: 32.0 2023-06-22 05:58:02,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=909474.0, ans=0.125 2023-06-22 05:58:29,142 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.104e+02 2.694e+02 3.024e+02 3.458e+02 5.010e+02, threshold=6.047e+02, percent-clipped=0.0 2023-06-22 05:59:58,789 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=909714.0, ans=0.125 2023-06-22 05:59:58,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=909714.0, ans=0.1 2023-06-22 06:00:19,752 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.64 vs. limit=15.0 2023-06-22 06:00:19,825 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.81 vs. limit=22.5 2023-06-22 06:00:26,808 INFO [train.py:996] (3/4) Epoch 5, batch 29650, loss[loss=0.1996, simple_loss=0.2641, pruned_loss=0.06761, over 21567.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3022, pruned_loss=0.07728, over 4283074.03 frames. ], batch size: 195, lr: 5.93e-03, grad_scale: 32.0 2023-06-22 06:00:45,543 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=909834.0, ans=0.125 2023-06-22 06:00:55,473 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=909834.0, ans=0.1 2023-06-22 06:01:29,796 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=909894.0, ans=0.125 2023-06-22 06:01:42,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=909894.0, ans=0.05 2023-06-22 06:01:45,086 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.08 vs. limit=10.0 2023-06-22 06:01:48,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=909954.0, ans=0.5 2023-06-22 06:02:10,149 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-22 06:02:27,671 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.64 vs. limit=22.5 2023-06-22 06:02:32,779 INFO [train.py:996] (3/4) Epoch 5, batch 29700, loss[loss=0.246, simple_loss=0.3645, pruned_loss=0.06373, over 20913.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3057, pruned_loss=0.07861, over 4290476.50 frames. ], batch size: 607, lr: 5.93e-03, grad_scale: 32.0 2023-06-22 06:03:02,112 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.724e+02 2.342e+02 2.589e+02 3.104e+02 5.027e+02, threshold=5.177e+02, percent-clipped=0.0 2023-06-22 06:03:07,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=910134.0, ans=0.1 2023-06-22 06:03:45,878 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=910194.0, ans=0.2 2023-06-22 06:04:20,886 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=910254.0, ans=0.125 2023-06-22 06:04:37,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=910314.0, ans=0.0 2023-06-22 06:04:40,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=910314.0, ans=0.0 2023-06-22 06:04:48,306 INFO [train.py:996] (3/4) Epoch 5, batch 29750, loss[loss=0.2163, simple_loss=0.3055, pruned_loss=0.06351, over 21635.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3108, pruned_loss=0.0781, over 4283243.48 frames. ], batch size: 230, lr: 5.92e-03, grad_scale: 32.0 2023-06-22 06:04:49,332 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.97 vs. limit=22.5 2023-06-22 06:04:51,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=910374.0, ans=0.125 2023-06-22 06:05:51,865 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=910494.0, ans=0.125 2023-06-22 06:05:51,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=910494.0, ans=0.125 2023-06-22 06:05:58,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=910494.0, ans=0.2 2023-06-22 06:06:42,503 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=910554.0, ans=0.05 2023-06-22 06:07:10,297 INFO [train.py:996] (3/4) Epoch 5, batch 29800, loss[loss=0.2339, simple_loss=0.304, pruned_loss=0.08192, over 21855.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3135, pruned_loss=0.07983, over 4292310.33 frames. ], batch size: 414, lr: 5.92e-03, grad_scale: 32.0 2023-06-22 06:07:14,157 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-22 06:07:40,155 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.962e+02 2.449e+02 2.699e+02 3.078e+02 3.997e+02, threshold=5.399e+02, percent-clipped=0.0 2023-06-22 06:09:21,739 INFO [train.py:996] (3/4) Epoch 5, batch 29850, loss[loss=0.2651, simple_loss=0.3248, pruned_loss=0.1027, over 21934.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3097, pruned_loss=0.07762, over 4286310.12 frames. ], batch size: 107, lr: 5.92e-03, grad_scale: 32.0 2023-06-22 06:10:02,188 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.06 vs. limit=22.5 2023-06-22 06:10:19,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=911034.0, ans=0.125 2023-06-22 06:10:49,448 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-22 06:10:57,509 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=911154.0, ans=0.0 2023-06-22 06:11:46,336 INFO [train.py:996] (3/4) Epoch 5, batch 29900, loss[loss=0.225, simple_loss=0.2938, pruned_loss=0.07807, over 21498.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3082, pruned_loss=0.07884, over 4289574.06 frames. ], batch size: 211, lr: 5.92e-03, grad_scale: 32.0 2023-06-22 06:12:00,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=911274.0, ans=0.125 2023-06-22 06:12:18,153 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.757e+02 2.539e+02 3.015e+02 3.692e+02 5.996e+02, threshold=6.029e+02, percent-clipped=2.0 2023-06-22 06:12:39,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=911394.0, ans=0.1 2023-06-22 06:14:06,410 INFO [train.py:996] (3/4) Epoch 5, batch 29950, loss[loss=0.2843, simple_loss=0.35, pruned_loss=0.1093, over 21572.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3129, pruned_loss=0.08295, over 4288116.63 frames. ], batch size: 194, lr: 5.92e-03, grad_scale: 32.0 2023-06-22 06:14:47,022 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.33 vs. limit=15.0 2023-06-22 06:16:11,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=911814.0, ans=0.125 2023-06-22 06:16:32,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=911874.0, ans=0.125 2023-06-22 06:16:33,822 INFO [train.py:996] (3/4) Epoch 5, batch 30000, loss[loss=0.2423, simple_loss=0.3329, pruned_loss=0.07589, over 21727.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3153, pruned_loss=0.08307, over 4288132.50 frames. ], batch size: 441, lr: 5.92e-03, grad_scale: 32.0 2023-06-22 06:16:33,822 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-22 06:17:15,856 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.4.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([1.9332, 2.9872, 2.7800, 1.8347], device='cuda:3') 2023-06-22 06:17:18,836 INFO [train.py:1028] (3/4) Epoch 5, validation: loss=0.2496, simple_loss=0.3465, pruned_loss=0.07629, over 1796401.00 frames. 2023-06-22 06:17:18,837 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-22 06:17:54,409 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.843e+02 2.727e+02 3.282e+02 4.075e+02 7.165e+02, threshold=6.565e+02, percent-clipped=2.0 2023-06-22 06:18:14,771 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=911994.0, ans=0.1 2023-06-22 06:18:34,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=912054.0, ans=0.0 2023-06-22 06:19:40,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=912114.0, ans=0.125 2023-06-22 06:20:02,386 INFO [train.py:996] (3/4) Epoch 5, batch 30050, loss[loss=0.2538, simple_loss=0.3775, pruned_loss=0.0651, over 20743.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3162, pruned_loss=0.07916, over 4278980.83 frames. ], batch size: 607, lr: 5.92e-03, grad_scale: 32.0 2023-06-22 06:20:51,285 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=912294.0, ans=0.1 2023-06-22 06:21:21,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=912354.0, ans=0.125 2023-06-22 06:21:26,064 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=912354.0, ans=0.125 2023-06-22 06:22:04,889 INFO [train.py:996] (3/4) Epoch 5, batch 30100, loss[loss=0.225, simple_loss=0.301, pruned_loss=0.07449, over 21486.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3142, pruned_loss=0.07977, over 4271826.21 frames. ], batch size: 389, lr: 5.92e-03, grad_scale: 32.0 2023-06-22 06:22:21,790 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.800e+02 2.782e+02 3.150e+02 3.648e+02 6.507e+02, threshold=6.300e+02, percent-clipped=0.0 2023-06-22 06:22:32,707 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=912534.0, ans=0.1 2023-06-22 06:23:05,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=912654.0, ans=0.2 2023-06-22 06:23:55,574 INFO [train.py:996] (3/4) Epoch 5, batch 30150, loss[loss=0.3087, simple_loss=0.3526, pruned_loss=0.1324, over 21418.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3107, pruned_loss=0.08097, over 4274278.63 frames. ], batch size: 510, lr: 5.92e-03, grad_scale: 32.0 2023-06-22 06:23:56,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=912774.0, ans=0.1 2023-06-22 06:25:31,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=912894.0, ans=10.0 2023-06-22 06:26:30,433 INFO [train.py:996] (3/4) Epoch 5, batch 30200, loss[loss=0.2115, simple_loss=0.3031, pruned_loss=0.06, over 21781.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3117, pruned_loss=0.07963, over 4279020.42 frames. ], batch size: 282, lr: 5.92e-03, grad_scale: 32.0 2023-06-22 06:26:39,881 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.45 vs. limit=15.0 2023-06-22 06:26:59,508 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.869e+02 2.559e+02 2.949e+02 3.535e+02 6.486e+02, threshold=5.897e+02, percent-clipped=1.0 2023-06-22 06:28:02,482 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.03 vs. limit=22.5 2023-06-22 06:28:18,372 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=913314.0, ans=0.025 2023-06-22 06:28:18,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=913314.0, ans=0.125 2023-06-22 06:28:33,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=913314.0, ans=0.0 2023-06-22 06:28:49,078 INFO [train.py:996] (3/4) Epoch 5, batch 30250, loss[loss=0.2388, simple_loss=0.3371, pruned_loss=0.07023, over 21390.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3183, pruned_loss=0.08091, over 4278336.45 frames. ], batch size: 194, lr: 5.92e-03, grad_scale: 32.0 2023-06-22 06:30:19,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=913494.0, ans=0.0 2023-06-22 06:31:21,063 INFO [train.py:996] (3/4) Epoch 5, batch 30300, loss[loss=0.2063, simple_loss=0.2717, pruned_loss=0.0705, over 21750.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3169, pruned_loss=0.0815, over 4278035.45 frames. ], batch size: 112, lr: 5.91e-03, grad_scale: 32.0 2023-06-22 06:31:24,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=913674.0, ans=0.125 2023-06-22 06:31:59,443 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.898e+02 2.693e+02 3.085e+02 3.904e+02 6.808e+02, threshold=6.171e+02, percent-clipped=2.0 2023-06-22 06:33:44,895 INFO [train.py:996] (3/4) Epoch 5, batch 30350, loss[loss=0.2361, simple_loss=0.3113, pruned_loss=0.08045, over 21687.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3188, pruned_loss=0.08305, over 4283375.17 frames. ], batch size: 298, lr: 5.91e-03, grad_scale: 32.0 2023-06-22 06:34:18,028 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=914034.0, ans=0.125 2023-06-22 06:36:24,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=914214.0, ans=0.125 2023-06-22 06:36:45,046 INFO [train.py:996] (3/4) Epoch 5, batch 30400, loss[loss=0.2295, simple_loss=0.2864, pruned_loss=0.08632, over 20215.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3142, pruned_loss=0.08172, over 4274826.53 frames. ], batch size: 703, lr: 5.91e-03, grad_scale: 32.0 2023-06-22 06:37:06,733 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=914274.0, ans=0.1 2023-06-22 06:37:53,818 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.256e+02 2.876e+02 3.400e+02 4.558e+02 7.300e+02, threshold=6.801e+02, percent-clipped=4.0 2023-06-22 06:38:08,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=914334.0, ans=0.125 2023-06-22 06:39:30,924 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=914454.0, ans=0.0 2023-06-22 06:41:03,384 INFO [train.py:996] (3/4) Epoch 5, batch 30450, loss[loss=0.3114, simple_loss=0.4121, pruned_loss=0.1053, over 19697.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3154, pruned_loss=0.08197, over 4211790.49 frames. ], batch size: 702, lr: 5.91e-03, grad_scale: 16.0 2023-06-22 06:43:17,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=914694.0, ans=0.1 2023-06-22 06:43:48,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=914694.0, ans=0.125 2023-06-22 06:43:59,554 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=914754.0, ans=0.0 2023-06-22 06:47:12,994 INFO [train.py:996] (3/4) Epoch 6, batch 0, loss[loss=0.221, simple_loss=0.2836, pruned_loss=0.07921, over 21732.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2836, pruned_loss=0.07921, over 21732.00 frames. ], batch size: 317, lr: 5.35e-03, grad_scale: 32.0 2023-06-22 06:47:12,995 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-22 06:48:06,637 INFO [train.py:1028] (3/4) Epoch 6, validation: loss=0.2383, simple_loss=0.345, pruned_loss=0.06584, over 1796401.00 frames. 2023-06-22 06:48:06,639 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-22 06:48:29,956 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 06:48:50,388 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=914898.0, ans=0.1 2023-06-22 06:48:52,676 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.955e+02 5.006e+02 6.285e+02 8.348e+02 2.118e+03, threshold=1.257e+03, percent-clipped=42.0 2023-06-22 06:48:57,463 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=914958.0, ans=0.1 2023-06-22 06:50:15,984 INFO [train.py:996] (3/4) Epoch 6, batch 50, loss[loss=0.3027, simple_loss=0.371, pruned_loss=0.1172, over 21482.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3256, pruned_loss=0.08651, over 970624.88 frames. ], batch size: 471, lr: 5.35e-03, grad_scale: 16.0 2023-06-22 06:50:19,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=915138.0, ans=0.125 2023-06-22 06:50:21,486 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.22 vs. limit=12.0 2023-06-22 06:50:22,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=915138.0, ans=0.125 2023-06-22 06:52:24,631 INFO [train.py:996] (3/4) Epoch 6, batch 100, loss[loss=0.2547, simple_loss=0.3337, pruned_loss=0.0879, over 21510.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3329, pruned_loss=0.08461, over 1713400.33 frames. ], batch size: 194, lr: 5.34e-03, grad_scale: 16.0 2023-06-22 06:52:37,483 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.38 vs. limit=15.0 2023-06-22 06:52:38,309 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=915498.0, ans=0.125 2023-06-22 06:53:09,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=915498.0, ans=0.125 2023-06-22 06:53:10,361 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.761e+02 2.323e+02 2.634e+02 3.053e+02 4.648e+02, threshold=5.268e+02, percent-clipped=0.0 2023-06-22 06:53:25,527 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.54 vs. limit=15.0 2023-06-22 06:54:19,689 INFO [train.py:996] (3/4) Epoch 6, batch 150, loss[loss=0.2601, simple_loss=0.3352, pruned_loss=0.09249, over 21791.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3331, pruned_loss=0.08365, over 2276076.35 frames. ], batch size: 124, lr: 5.34e-03, grad_scale: 16.0 2023-06-22 06:55:45,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=915918.0, ans=0.125 2023-06-22 06:55:54,463 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=915918.0, ans=0.125 2023-06-22 06:56:43,939 INFO [train.py:996] (3/4) Epoch 6, batch 200, loss[loss=0.2648, simple_loss=0.3646, pruned_loss=0.0825, over 21209.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.328, pruned_loss=0.08132, over 2716151.43 frames. ], batch size: 548, lr: 5.34e-03, grad_scale: 16.0 2023-06-22 06:57:42,671 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.797e+02 2.601e+02 2.986e+02 3.624e+02 6.597e+02, threshold=5.972e+02, percent-clipped=3.0 2023-06-22 06:58:41,745 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=916278.0, ans=0.0 2023-06-22 06:58:53,834 INFO [train.py:996] (3/4) Epoch 6, batch 250, loss[loss=0.2218, simple_loss=0.3175, pruned_loss=0.06303, over 21666.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3236, pruned_loss=0.08003, over 3059453.49 frames. ], batch size: 414, lr: 5.34e-03, grad_scale: 16.0 2023-06-22 06:59:20,665 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.20 vs. limit=22.5 2023-06-22 06:59:40,877 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=916398.0, ans=0.125 2023-06-22 07:00:03,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=916458.0, ans=0.0 2023-06-22 07:00:11,226 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=916518.0, ans=0.2 2023-06-22 07:00:49,905 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.20 vs. limit=10.0 2023-06-22 07:01:15,605 INFO [train.py:996] (3/4) Epoch 6, batch 300, loss[loss=0.2621, simple_loss=0.3331, pruned_loss=0.09551, over 21475.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3184, pruned_loss=0.08011, over 3320464.06 frames. ], batch size: 194, lr: 5.34e-03, grad_scale: 16.0 2023-06-22 07:01:42,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=916638.0, ans=10.0 2023-06-22 07:02:11,254 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.812e+02 2.629e+02 3.080e+02 3.512e+02 4.991e+02, threshold=6.161e+02, percent-clipped=0.0 2023-06-22 07:02:20,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=916758.0, ans=0.125 2023-06-22 07:02:48,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=916878.0, ans=0.1 2023-06-22 07:02:58,897 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=916878.0, ans=0.125 2023-06-22 07:03:36,291 INFO [train.py:996] (3/4) Epoch 6, batch 350, loss[loss=0.2387, simple_loss=0.3018, pruned_loss=0.08782, over 20004.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3108, pruned_loss=0.07939, over 3532525.76 frames. ], batch size: 703, lr: 5.34e-03, grad_scale: 16.0 2023-06-22 07:04:05,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=916998.0, ans=0.04949747468305833 2023-06-22 07:04:14,988 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.98 vs. limit=22.5 2023-06-22 07:04:30,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=917058.0, ans=0.125 2023-06-22 07:04:36,736 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.65 vs. limit=15.0 2023-06-22 07:04:57,675 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.27 vs. limit=15.0 2023-06-22 07:05:03,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=917118.0, ans=0.015 2023-06-22 07:05:03,913 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=917118.0, ans=0.035 2023-06-22 07:05:42,666 INFO [train.py:996] (3/4) Epoch 6, batch 400, loss[loss=0.2334, simple_loss=0.3334, pruned_loss=0.06671, over 21821.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3037, pruned_loss=0.07811, over 3683666.09 frames. ], batch size: 316, lr: 5.34e-03, grad_scale: 32.0 2023-06-22 07:05:59,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=917238.0, ans=0.2 2023-06-22 07:06:09,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=917238.0, ans=0.125 2023-06-22 07:06:41,522 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.619e+02 3.011e+02 3.416e+02 5.139e+02, threshold=6.021e+02, percent-clipped=0.0 2023-06-22 07:06:57,155 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.73 vs. limit=15.0 2023-06-22 07:07:48,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=917478.0, ans=0.0 2023-06-22 07:07:52,899 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=917478.0, ans=0.125 2023-06-22 07:07:52,961 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 07:07:55,473 INFO [train.py:996] (3/4) Epoch 6, batch 450, loss[loss=0.2059, simple_loss=0.3023, pruned_loss=0.05479, over 21782.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.2999, pruned_loss=0.07674, over 3818790.43 frames. ], batch size: 371, lr: 5.34e-03, grad_scale: 32.0 2023-06-22 07:08:34,895 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.36 vs. limit=15.0 2023-06-22 07:09:00,008 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=917658.0, ans=0.0 2023-06-22 07:09:36,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=917718.0, ans=0.0 2023-06-22 07:10:16,311 INFO [train.py:996] (3/4) Epoch 6, batch 500, loss[loss=0.2199, simple_loss=0.2924, pruned_loss=0.07369, over 21780.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.2999, pruned_loss=0.07526, over 3920754.20 frames. ], batch size: 247, lr: 5.34e-03, grad_scale: 32.0 2023-06-22 07:10:51,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=917898.0, ans=0.125 2023-06-22 07:10:57,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=917898.0, ans=0.2 2023-06-22 07:11:01,586 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.947e+02 2.514e+02 2.945e+02 3.485e+02 5.759e+02, threshold=5.890e+02, percent-clipped=0.0 2023-06-22 07:11:07,715 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=917958.0, ans=0.125 2023-06-22 07:11:26,041 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=918018.0, ans=0.1 2023-06-22 07:12:21,973 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=918138.0, ans=0.1 2023-06-22 07:12:28,431 INFO [train.py:996] (3/4) Epoch 6, batch 550, loss[loss=0.1993, simple_loss=0.2774, pruned_loss=0.06054, over 21468.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3052, pruned_loss=0.07553, over 3987439.19 frames. ], batch size: 131, lr: 5.34e-03, grad_scale: 32.0 2023-06-22 07:12:50,749 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.02 vs. limit=15.0 2023-06-22 07:12:57,738 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=15.0 2023-06-22 07:13:15,129 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=918198.0, ans=0.125 2023-06-22 07:13:22,412 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=918258.0, ans=0.2 2023-06-22 07:14:38,209 INFO [train.py:996] (3/4) Epoch 6, batch 600, loss[loss=0.1852, simple_loss=0.2958, pruned_loss=0.0373, over 20809.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3086, pruned_loss=0.07604, over 4046674.24 frames. ], batch size: 608, lr: 5.34e-03, grad_scale: 32.0 2023-06-22 07:15:22,304 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.050e+02 2.705e+02 3.324e+02 3.965e+02 6.330e+02, threshold=6.647e+02, percent-clipped=3.0 2023-06-22 07:15:53,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=918618.0, ans=0.05 2023-06-22 07:16:18,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=918618.0, ans=0.125 2023-06-22 07:16:49,482 INFO [train.py:996] (3/4) Epoch 6, batch 650, loss[loss=0.2199, simple_loss=0.2917, pruned_loss=0.07408, over 21857.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3101, pruned_loss=0.07639, over 4102907.80 frames. ], batch size: 124, lr: 5.34e-03, grad_scale: 32.0 2023-06-22 07:17:10,248 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=918738.0, ans=0.0 2023-06-22 07:17:40,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=918858.0, ans=0.125 2023-06-22 07:17:50,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=918858.0, ans=0.2 2023-06-22 07:18:29,219 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=918978.0, ans=0.125 2023-06-22 07:18:53,910 INFO [train.py:996] (3/4) Epoch 6, batch 700, loss[loss=0.2228, simple_loss=0.3065, pruned_loss=0.06951, over 21771.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3103, pruned_loss=0.0777, over 4145983.20 frames. ], batch size: 112, lr: 5.33e-03, grad_scale: 16.0 2023-06-22 07:19:29,650 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=919098.0, ans=0.125 2023-06-22 07:19:46,551 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.060e+02 2.473e+02 2.771e+02 3.367e+02 4.695e+02, threshold=5.542e+02, percent-clipped=0.0 2023-06-22 07:20:02,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=919158.0, ans=0.1 2023-06-22 07:20:07,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=919158.0, ans=0.0 2023-06-22 07:20:18,011 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=919158.0, ans=0.07 2023-06-22 07:21:03,761 INFO [train.py:996] (3/4) Epoch 6, batch 750, loss[loss=0.1921, simple_loss=0.2599, pruned_loss=0.06215, over 21360.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3117, pruned_loss=0.07908, over 4181584.95 frames. ], batch size: 194, lr: 5.33e-03, grad_scale: 16.0 2023-06-22 07:22:13,388 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=919458.0, ans=0.125 2023-06-22 07:22:18,248 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.29 vs. limit=15.0 2023-06-22 07:22:37,224 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=919578.0, ans=0.0 2023-06-22 07:23:08,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=919578.0, ans=0.2 2023-06-22 07:23:10,545 INFO [train.py:996] (3/4) Epoch 6, batch 800, loss[loss=0.2241, simple_loss=0.2936, pruned_loss=0.07731, over 21947.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3095, pruned_loss=0.07882, over 4201527.98 frames. ], batch size: 333, lr: 5.33e-03, grad_scale: 32.0 2023-06-22 07:23:32,344 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.85 vs. limit=15.0 2023-06-22 07:24:08,543 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.010e+02 2.537e+02 3.024e+02 3.645e+02 6.511e+02, threshold=6.048e+02, percent-clipped=3.0 2023-06-22 07:25:16,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=919878.0, ans=0.125 2023-06-22 07:25:23,384 INFO [train.py:996] (3/4) Epoch 6, batch 850, loss[loss=0.2242, simple_loss=0.2976, pruned_loss=0.07539, over 21240.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3061, pruned_loss=0.07845, over 4226226.13 frames. ], batch size: 144, lr: 5.33e-03, grad_scale: 32.0 2023-06-22 07:27:43,055 INFO [train.py:996] (3/4) Epoch 6, batch 900, loss[loss=0.1945, simple_loss=0.2668, pruned_loss=0.06108, over 21760.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3032, pruned_loss=0.07699, over 4230153.65 frames. ], batch size: 124, lr: 5.33e-03, grad_scale: 32.0 2023-06-22 07:27:52,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=920238.0, ans=0.0 2023-06-22 07:28:24,036 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.030e+02 2.591e+02 2.994e+02 3.530e+02 5.655e+02, threshold=5.988e+02, percent-clipped=0.0 2023-06-22 07:28:42,966 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=920358.0, ans=0.95 2023-06-22 07:29:48,823 INFO [train.py:996] (3/4) Epoch 6, batch 950, loss[loss=0.2236, simple_loss=0.3108, pruned_loss=0.06822, over 21762.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3012, pruned_loss=0.07702, over 4240159.47 frames. ], batch size: 247, lr: 5.33e-03, grad_scale: 32.0 2023-06-22 07:29:59,222 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=920538.0, ans=0.125 2023-06-22 07:30:25,124 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=920598.0, ans=15.0 2023-06-22 07:31:02,124 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=920658.0, ans=0.125 2023-06-22 07:31:41,690 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=920778.0, ans=0.1 2023-06-22 07:31:43,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=920778.0, ans=0.0 2023-06-22 07:31:57,638 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 07:32:08,535 INFO [train.py:996] (3/4) Epoch 6, batch 1000, loss[loss=0.2313, simple_loss=0.3237, pruned_loss=0.06942, over 21717.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3007, pruned_loss=0.07706, over 4257911.67 frames. ], batch size: 414, lr: 5.33e-03, grad_scale: 32.0 2023-06-22 07:33:01,791 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.125e+02 2.550e+02 2.798e+02 3.235e+02 6.072e+02, threshold=5.596e+02, percent-clipped=1.0 2023-06-22 07:33:39,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=921018.0, ans=0.2 2023-06-22 07:34:20,078 INFO [train.py:996] (3/4) Epoch 6, batch 1050, loss[loss=0.3169, simple_loss=0.3651, pruned_loss=0.1344, over 21456.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3017, pruned_loss=0.07809, over 4260804.40 frames. ], batch size: 507, lr: 5.33e-03, grad_scale: 32.0 2023-06-22 07:34:29,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=921138.0, ans=0.1 2023-06-22 07:34:59,083 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=921258.0, ans=0.07 2023-06-22 07:35:58,894 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=921378.0, ans=0.125 2023-06-22 07:36:26,860 INFO [train.py:996] (3/4) Epoch 6, batch 1100, loss[loss=0.2815, simple_loss=0.3474, pruned_loss=0.1078, over 21577.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3022, pruned_loss=0.07736, over 4267571.42 frames. ], batch size: 414, lr: 5.33e-03, grad_scale: 16.0 2023-06-22 07:37:18,785 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.935e+02 2.723e+02 3.096e+02 3.940e+02 7.393e+02, threshold=6.192e+02, percent-clipped=9.0 2023-06-22 07:37:45,681 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.19 vs. limit=15.0 2023-06-22 07:38:44,052 INFO [train.py:996] (3/4) Epoch 6, batch 1150, loss[loss=0.2825, simple_loss=0.3396, pruned_loss=0.1128, over 21532.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3028, pruned_loss=0.0767, over 4274775.76 frames. ], batch size: 471, lr: 5.33e-03, grad_scale: 16.0 2023-06-22 07:39:19,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=921798.0, ans=0.0 2023-06-22 07:39:53,326 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 07:40:13,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=921858.0, ans=0.0 2023-06-22 07:40:20,355 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=921918.0, ans=0.125 2023-06-22 07:40:37,660 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=921918.0, ans=0.2 2023-06-22 07:40:51,585 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=921978.0, ans=0.0 2023-06-22 07:41:01,465 INFO [train.py:996] (3/4) Epoch 6, batch 1200, loss[loss=0.1971, simple_loss=0.2384, pruned_loss=0.07789, over 19970.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3019, pruned_loss=0.07582, over 4277739.37 frames. ], batch size: 703, lr: 5.33e-03, grad_scale: 32.0 2023-06-22 07:42:06,349 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=922158.0, ans=0.1 2023-06-22 07:42:14,176 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.71 vs. limit=12.0 2023-06-22 07:42:14,467 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.142e+02 2.527e+02 2.979e+02 3.746e+02 6.173e+02, threshold=5.958e+02, percent-clipped=0.0 2023-06-22 07:43:23,983 INFO [train.py:996] (3/4) Epoch 6, batch 1250, loss[loss=0.254, simple_loss=0.3195, pruned_loss=0.09422, over 21273.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3035, pruned_loss=0.07744, over 4276826.97 frames. ], batch size: 159, lr: 5.32e-03, grad_scale: 32.0 2023-06-22 07:43:45,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=922338.0, ans=0.2 2023-06-22 07:43:46,470 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.17 vs. limit=15.0 2023-06-22 07:45:10,579 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.86 vs. limit=10.0 2023-06-22 07:45:11,184 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=922518.0, ans=0.125 2023-06-22 07:45:20,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=922578.0, ans=0.0 2023-06-22 07:45:34,452 INFO [train.py:996] (3/4) Epoch 6, batch 1300, loss[loss=0.2564, simple_loss=0.3426, pruned_loss=0.08506, over 21771.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3054, pruned_loss=0.0781, over 4277694.80 frames. ], batch size: 351, lr: 5.32e-03, grad_scale: 32.0 2023-06-22 07:46:09,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=922698.0, ans=0.025 2023-06-22 07:46:11,350 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=922698.0, ans=0.1 2023-06-22 07:46:20,265 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=922698.0, ans=0.125 2023-06-22 07:46:54,068 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.178e+02 2.638e+02 3.096e+02 3.825e+02 7.395e+02, threshold=6.191e+02, percent-clipped=3.0 2023-06-22 07:46:59,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=922758.0, ans=0.5 2023-06-22 07:47:13,713 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.77 vs. limit=15.0 2023-06-22 07:47:19,465 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.62 vs. limit=15.0 2023-06-22 07:47:34,029 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.23 vs. limit=22.5 2023-06-22 07:47:42,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=922878.0, ans=0.125 2023-06-22 07:47:59,660 INFO [train.py:996] (3/4) Epoch 6, batch 1350, loss[loss=0.2796, simple_loss=0.3578, pruned_loss=0.1007, over 21473.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3057, pruned_loss=0.07824, over 4286700.39 frames. ], batch size: 471, lr: 5.32e-03, grad_scale: 32.0 2023-06-22 07:48:56,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=922998.0, ans=0.125 2023-06-22 07:49:29,172 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=923058.0, ans=0.0 2023-06-22 07:49:29,744 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.10 vs. limit=15.0 2023-06-22 07:49:45,007 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=923118.0, ans=0.1 2023-06-22 07:49:55,750 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=923178.0, ans=0.125 2023-06-22 07:50:04,093 INFO [train.py:996] (3/4) Epoch 6, batch 1400, loss[loss=0.255, simple_loss=0.3403, pruned_loss=0.08487, over 21653.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.304, pruned_loss=0.07715, over 4284493.76 frames. ], batch size: 263, lr: 5.32e-03, grad_scale: 32.0 2023-06-22 07:50:05,955 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=923238.0, ans=0.125 2023-06-22 07:50:07,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=923238.0, ans=0.95 2023-06-22 07:51:10,362 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.994e+02 2.497e+02 2.735e+02 3.069e+02 5.769e+02, threshold=5.470e+02, percent-clipped=0.0 2023-06-22 07:51:14,166 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=22.5 2023-06-22 07:51:32,754 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=923418.0, ans=0.125 2023-06-22 07:52:19,681 INFO [train.py:996] (3/4) Epoch 6, batch 1450, loss[loss=0.2251, simple_loss=0.3064, pruned_loss=0.07192, over 21702.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3049, pruned_loss=0.07817, over 4278880.67 frames. ], batch size: 263, lr: 5.32e-03, grad_scale: 32.0 2023-06-22 07:53:37,052 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=923718.0, ans=22.5 2023-06-22 07:54:36,511 INFO [train.py:996] (3/4) Epoch 6, batch 1500, loss[loss=0.2115, simple_loss=0.2809, pruned_loss=0.07102, over 21811.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3066, pruned_loss=0.07948, over 4285818.94 frames. ], batch size: 282, lr: 5.32e-03, grad_scale: 32.0 2023-06-22 07:55:31,682 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.072e+02 2.593e+02 2.969e+02 3.439e+02 4.928e+02, threshold=5.939e+02, percent-clipped=0.0 2023-06-22 07:55:58,121 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=924018.0, ans=0.125 2023-06-22 07:56:39,030 INFO [train.py:996] (3/4) Epoch 6, batch 1550, loss[loss=0.1388, simple_loss=0.2003, pruned_loss=0.03862, over 17243.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.305, pruned_loss=0.07813, over 4288087.40 frames. ], batch size: 62, lr: 5.32e-03, grad_scale: 32.0 2023-06-22 07:56:53,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=924138.0, ans=0.0 2023-06-22 07:58:21,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=924318.0, ans=0.05 2023-06-22 07:58:26,078 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.10 vs. limit=10.0 2023-06-22 07:58:27,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=924378.0, ans=0.0 2023-06-22 07:58:54,193 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.40 vs. limit=10.0 2023-06-22 07:59:11,424 INFO [train.py:996] (3/4) Epoch 6, batch 1600, loss[loss=0.3048, simple_loss=0.3697, pruned_loss=0.1199, over 21433.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3041, pruned_loss=0.07721, over 4283101.87 frames. ], batch size: 507, lr: 5.32e-03, grad_scale: 32.0 2023-06-22 07:59:49,777 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=924498.0, ans=0.125 2023-06-22 08:00:09,257 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.968e+02 2.447e+02 2.889e+02 3.506e+02 5.752e+02, threshold=5.778e+02, percent-clipped=0.0 2023-06-22 08:00:11,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=924558.0, ans=0.125 2023-06-22 08:00:14,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=924558.0, ans=0.2 2023-06-22 08:00:29,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=924618.0, ans=0.125 2023-06-22 08:01:23,864 INFO [train.py:996] (3/4) Epoch 6, batch 1650, loss[loss=0.182, simple_loss=0.248, pruned_loss=0.05803, over 21448.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3028, pruned_loss=0.07715, over 4279341.49 frames. ], batch size: 230, lr: 5.32e-03, grad_scale: 16.0 2023-06-22 08:01:59,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=924798.0, ans=0.0 2023-06-22 08:03:01,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=924918.0, ans=0.125 2023-06-22 08:03:15,715 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=924978.0, ans=0.0 2023-06-22 08:03:29,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=924978.0, ans=0.0 2023-06-22 08:03:42,434 INFO [train.py:996] (3/4) Epoch 6, batch 1700, loss[loss=0.2244, simple_loss=0.295, pruned_loss=0.07696, over 21682.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3067, pruned_loss=0.07853, over 4283075.78 frames. ], batch size: 230, lr: 5.32e-03, grad_scale: 16.0 2023-06-22 08:04:18,432 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-06-22 08:04:25,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=925098.0, ans=0.125 2023-06-22 08:04:42,724 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.047e+02 2.575e+02 2.876e+02 3.379e+02 6.371e+02, threshold=5.752e+02, percent-clipped=1.0 2023-06-22 08:05:56,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=925278.0, ans=0.125 2023-06-22 08:06:10,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=925278.0, ans=10.0 2023-06-22 08:06:14,609 INFO [train.py:996] (3/4) Epoch 6, batch 1750, loss[loss=0.2042, simple_loss=0.2811, pruned_loss=0.0636, over 21474.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3036, pruned_loss=0.07556, over 4265967.45 frames. ], batch size: 211, lr: 5.32e-03, grad_scale: 16.0 2023-06-22 08:06:15,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=925338.0, ans=0.0 2023-06-22 08:06:22,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=925338.0, ans=0.2 2023-06-22 08:06:44,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=925398.0, ans=0.0 2023-06-22 08:07:05,073 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.48 vs. limit=15.0 2023-06-22 08:07:06,058 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=925398.0, ans=0.2 2023-06-22 08:07:22,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=925458.0, ans=0.025 2023-06-22 08:08:31,583 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=925578.0, ans=0.125 2023-06-22 08:08:37,353 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=925578.0, ans=0.0 2023-06-22 08:08:43,945 INFO [train.py:996] (3/4) Epoch 6, batch 1800, loss[loss=0.2027, simple_loss=0.3008, pruned_loss=0.05227, over 21746.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3021, pruned_loss=0.07366, over 4274164.05 frames. ], batch size: 352, lr: 5.32e-03, grad_scale: 16.0 2023-06-22 08:08:58,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=925638.0, ans=0.0 2023-06-22 08:09:22,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=925698.0, ans=0.125 2023-06-22 08:09:31,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=925758.0, ans=0.0 2023-06-22 08:09:38,630 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.899e+02 2.513e+02 3.137e+02 3.734e+02 6.683e+02, threshold=6.274e+02, percent-clipped=3.0 2023-06-22 08:09:54,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=925758.0, ans=0.0 2023-06-22 08:10:29,185 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=925818.0, ans=0.0 2023-06-22 08:10:58,596 INFO [train.py:996] (3/4) Epoch 6, batch 1850, loss[loss=0.2122, simple_loss=0.3044, pruned_loss=0.06, over 21784.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3055, pruned_loss=0.07317, over 4277672.44 frames. ], batch size: 282, lr: 5.31e-03, grad_scale: 16.0 2023-06-22 08:11:00,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=925938.0, ans=0.125 2023-06-22 08:11:03,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=925938.0, ans=0.0 2023-06-22 08:11:16,329 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=18.97 vs. limit=22.5 2023-06-22 08:12:09,697 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 08:13:07,278 INFO [train.py:996] (3/4) Epoch 6, batch 1900, loss[loss=0.175, simple_loss=0.2607, pruned_loss=0.04465, over 21629.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3051, pruned_loss=0.0734, over 4276222.91 frames. ], batch size: 230, lr: 5.31e-03, grad_scale: 16.0 2023-06-22 08:13:35,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=926238.0, ans=0.95 2023-06-22 08:13:57,078 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_na.min_abs, batch_count=926358.0, ans=0.02 2023-06-22 08:13:57,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=926358.0, ans=0.1 2023-06-22 08:13:59,633 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.839e+02 2.371e+02 2.637e+02 3.288e+02 5.530e+02, threshold=5.274e+02, percent-clipped=0.0 2023-06-22 08:14:03,088 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=926358.0, ans=0.125 2023-06-22 08:14:46,752 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=926418.0, ans=0.09899494936611666 2023-06-22 08:15:06,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=926478.0, ans=0.0 2023-06-22 08:15:08,553 INFO [train.py:996] (3/4) Epoch 6, batch 1950, loss[loss=0.255, simple_loss=0.3284, pruned_loss=0.09074, over 21365.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3033, pruned_loss=0.07355, over 4283557.15 frames. ], batch size: 549, lr: 5.31e-03, grad_scale: 16.0 2023-06-22 08:15:25,255 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.11 vs. limit=15.0 2023-06-22 08:15:38,326 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.87 vs. limit=6.0 2023-06-22 08:15:51,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=926598.0, ans=0.05 2023-06-22 08:15:59,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=926658.0, ans=0.1 2023-06-22 08:17:00,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=926718.0, ans=0.125 2023-06-22 08:17:07,330 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=926778.0, ans=0.125 2023-06-22 08:17:25,659 INFO [train.py:996] (3/4) Epoch 6, batch 2000, loss[loss=0.1651, simple_loss=0.2411, pruned_loss=0.04458, over 21306.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2996, pruned_loss=0.07224, over 4280654.33 frames. ], batch size: 131, lr: 5.31e-03, grad_scale: 32.0 2023-06-22 08:18:41,771 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.772e+02 2.517e+02 3.003e+02 3.680e+02 6.988e+02, threshold=6.006e+02, percent-clipped=2.0 2023-06-22 08:18:43,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=926958.0, ans=0.125 2023-06-22 08:19:06,064 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=927018.0, ans=0.0 2023-06-22 08:19:22,205 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=927018.0, ans=0.2 2023-06-22 08:19:41,271 INFO [train.py:996] (3/4) Epoch 6, batch 2050, loss[loss=0.236, simple_loss=0.3169, pruned_loss=0.07755, over 21769.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.3003, pruned_loss=0.07206, over 4271723.12 frames. ], batch size: 247, lr: 5.31e-03, grad_scale: 32.0 2023-06-22 08:20:32,493 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.55 vs. limit=15.0 2023-06-22 08:21:30,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=927318.0, ans=0.125 2023-06-22 08:21:37,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=927378.0, ans=0.0 2023-06-22 08:21:46,625 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 08:21:53,269 INFO [train.py:996] (3/4) Epoch 6, batch 2100, loss[loss=0.2375, simple_loss=0.3062, pruned_loss=0.08437, over 20724.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3014, pruned_loss=0.07385, over 4278100.06 frames. ], batch size: 607, lr: 5.31e-03, grad_scale: 32.0 2023-06-22 08:22:58,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=927558.0, ans=0.125 2023-06-22 08:22:58,629 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=927558.0, ans=0.0 2023-06-22 08:23:02,544 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.001e+02 2.488e+02 2.797e+02 3.181e+02 4.805e+02, threshold=5.593e+02, percent-clipped=0.0 2023-06-22 08:23:19,494 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 08:23:32,717 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=927618.0, ans=0.125 2023-06-22 08:23:55,614 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.87 vs. limit=12.0 2023-06-22 08:23:55,927 INFO [train.py:996] (3/4) Epoch 6, batch 2150, loss[loss=0.22, simple_loss=0.2808, pruned_loss=0.07965, over 21493.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3015, pruned_loss=0.07537, over 4278717.53 frames. ], batch size: 441, lr: 5.31e-03, grad_scale: 16.0 2023-06-22 08:25:09,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=927858.0, ans=0.125 2023-06-22 08:26:29,706 INFO [train.py:996] (3/4) Epoch 6, batch 2200, loss[loss=0.1998, simple_loss=0.2718, pruned_loss=0.06387, over 21178.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3029, pruned_loss=0.07574, over 4269999.98 frames. ], batch size: 608, lr: 5.31e-03, grad_scale: 16.0 2023-06-22 08:27:14,157 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.28 vs. limit=10.0 2023-06-22 08:27:19,791 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.050e+02 2.524e+02 2.938e+02 3.360e+02 6.065e+02, threshold=5.877e+02, percent-clipped=1.0 2023-06-22 08:27:50,574 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=928218.0, ans=0.2 2023-06-22 08:28:07,001 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=928278.0, ans=0.125 2023-06-22 08:28:32,331 INFO [train.py:996] (3/4) Epoch 6, batch 2250, loss[loss=0.1873, simple_loss=0.2557, pruned_loss=0.0595, over 21813.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2999, pruned_loss=0.0737, over 4272499.55 frames. ], batch size: 98, lr: 5.31e-03, grad_scale: 16.0 2023-06-22 08:29:16,118 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=928398.0, ans=0.125 2023-06-22 08:29:59,360 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.33 vs. limit=15.0 2023-06-22 08:30:02,573 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.94 vs. limit=8.0 2023-06-22 08:30:14,001 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=928578.0, ans=0.0 2023-06-22 08:30:23,221 INFO [train.py:996] (3/4) Epoch 6, batch 2300, loss[loss=0.2092, simple_loss=0.2772, pruned_loss=0.07062, over 21838.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2967, pruned_loss=0.07321, over 4268420.86 frames. ], batch size: 107, lr: 5.31e-03, grad_scale: 16.0 2023-06-22 08:30:52,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=928638.0, ans=0.125 2023-06-22 08:31:09,455 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.37 vs. limit=22.5 2023-06-22 08:31:21,408 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.05 vs. limit=15.0 2023-06-22 08:31:28,776 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.850e+02 2.418e+02 2.777e+02 3.405e+02 6.239e+02, threshold=5.554e+02, percent-clipped=2.0 2023-06-22 08:32:27,215 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.13 vs. limit=15.0 2023-06-22 08:32:31,792 INFO [train.py:996] (3/4) Epoch 6, batch 2350, loss[loss=0.1864, simple_loss=0.2527, pruned_loss=0.06, over 21347.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.294, pruned_loss=0.07362, over 4265230.29 frames. ], batch size: 211, lr: 5.31e-03, grad_scale: 16.0 2023-06-22 08:33:00,467 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=928938.0, ans=0.025 2023-06-22 08:34:16,811 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-06-22 08:34:19,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=929178.0, ans=0.125 2023-06-22 08:34:33,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=929178.0, ans=0.2 2023-06-22 08:34:42,152 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 08:34:49,993 INFO [train.py:996] (3/4) Epoch 6, batch 2400, loss[loss=0.2792, simple_loss=0.3455, pruned_loss=0.1065, over 21592.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.296, pruned_loss=0.07589, over 4270947.40 frames. ], batch size: 415, lr: 5.31e-03, grad_scale: 32.0 2023-06-22 08:34:50,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=929238.0, ans=0.0 2023-06-22 08:36:02,025 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.593e+02 2.927e+02 3.472e+02 6.319e+02, threshold=5.855e+02, percent-clipped=5.0 2023-06-22 08:37:11,337 INFO [train.py:996] (3/4) Epoch 6, batch 2450, loss[loss=0.2306, simple_loss=0.3547, pruned_loss=0.05328, over 20719.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.2997, pruned_loss=0.07806, over 4273523.29 frames. ], batch size: 608, lr: 5.30e-03, grad_scale: 32.0 2023-06-22 08:37:43,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=929598.0, ans=0.125 2023-06-22 08:38:25,182 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-22 08:39:13,826 INFO [train.py:996] (3/4) Epoch 6, batch 2500, loss[loss=0.2277, simple_loss=0.3258, pruned_loss=0.06477, over 21164.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3008, pruned_loss=0.07876, over 4282340.25 frames. ], batch size: 143, lr: 5.30e-03, grad_scale: 32.0 2023-06-22 08:39:36,420 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=929838.0, ans=0.125 2023-06-22 08:39:46,020 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.32 vs. limit=15.0 2023-06-22 08:39:58,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=929898.0, ans=0.1 2023-06-22 08:40:01,444 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=929898.0, ans=0.125 2023-06-22 08:40:15,525 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.19 vs. limit=15.0 2023-06-22 08:40:23,085 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.077e+02 2.580e+02 2.958e+02 3.428e+02 5.178e+02, threshold=5.916e+02, percent-clipped=0.0 2023-06-22 08:41:28,672 INFO [train.py:996] (3/4) Epoch 6, batch 2550, loss[loss=0.2227, simple_loss=0.3186, pruned_loss=0.06341, over 21720.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.2997, pruned_loss=0.0776, over 4278383.65 frames. ], batch size: 247, lr: 5.30e-03, grad_scale: 16.0 2023-06-22 08:41:31,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=930138.0, ans=0.0 2023-06-22 08:43:51,034 INFO [train.py:996] (3/4) Epoch 6, batch 2600, loss[loss=0.2512, simple_loss=0.3229, pruned_loss=0.08975, over 21930.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3019, pruned_loss=0.07788, over 4274614.04 frames. ], batch size: 372, lr: 5.30e-03, grad_scale: 16.0 2023-06-22 08:44:29,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=930498.0, ans=0.0 2023-06-22 08:45:08,325 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.989e+02 2.524e+02 2.947e+02 3.277e+02 5.096e+02, threshold=5.894e+02, percent-clipped=0.0 2023-06-22 08:45:17,825 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=5.428e-03 2023-06-22 08:45:21,112 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.38 vs. limit=15.0 2023-06-22 08:46:11,835 INFO [train.py:996] (3/4) Epoch 6, batch 2650, loss[loss=0.2274, simple_loss=0.2981, pruned_loss=0.07833, over 21914.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3017, pruned_loss=0.07879, over 4283541.97 frames. ], batch size: 351, lr: 5.30e-03, grad_scale: 16.0 2023-06-22 08:46:44,237 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 08:46:51,134 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-22 08:47:26,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=930858.0, ans=0.125 2023-06-22 08:47:39,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=930918.0, ans=0.1 2023-06-22 08:47:57,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=930978.0, ans=0.125 2023-06-22 08:48:18,339 INFO [train.py:996] (3/4) Epoch 6, batch 2700, loss[loss=0.229, simple_loss=0.2999, pruned_loss=0.07907, over 21493.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.2999, pruned_loss=0.0786, over 4282788.21 frames. ], batch size: 131, lr: 5.30e-03, grad_scale: 16.0 2023-06-22 08:48:18,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=931038.0, ans=0.0 2023-06-22 08:49:01,368 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=931098.0, ans=0.1 2023-06-22 08:49:23,211 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=931158.0, ans=0.1 2023-06-22 08:49:35,957 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.061e+02 2.647e+02 2.951e+02 3.422e+02 5.387e+02, threshold=5.902e+02, percent-clipped=0.0 2023-06-22 08:49:54,876 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.45 vs. limit=15.0 2023-06-22 08:50:25,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=931278.0, ans=0.0 2023-06-22 08:50:35,162 INFO [train.py:996] (3/4) Epoch 6, batch 2750, loss[loss=0.2403, simple_loss=0.3185, pruned_loss=0.08101, over 21797.00 frames. ], tot_loss[loss=0.227, simple_loss=0.2985, pruned_loss=0.07772, over 4285353.44 frames. ], batch size: 124, lr: 5.30e-03, grad_scale: 16.0 2023-06-22 08:52:01,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=931518.0, ans=15.0 2023-06-22 08:52:44,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=931578.0, ans=0.125 2023-06-22 08:53:01,320 INFO [train.py:996] (3/4) Epoch 6, batch 2800, loss[loss=0.2505, simple_loss=0.3158, pruned_loss=0.09254, over 21218.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3025, pruned_loss=0.07787, over 4289572.24 frames. ], batch size: 176, lr: 5.30e-03, grad_scale: 32.0 2023-06-22 08:53:33,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=931638.0, ans=0.07 2023-06-22 08:53:46,065 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.87 vs. limit=15.0 2023-06-22 08:54:11,412 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.064e+02 2.698e+02 3.040e+02 3.430e+02 5.325e+02, threshold=6.080e+02, percent-clipped=0.0 2023-06-22 08:54:25,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=931818.0, ans=0.125 2023-06-22 08:54:26,261 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.65 vs. limit=15.0 2023-06-22 08:54:47,821 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=931878.0, ans=0.04949747468305833 2023-06-22 08:55:25,865 INFO [train.py:996] (3/4) Epoch 6, batch 2850, loss[loss=0.2563, simple_loss=0.3406, pruned_loss=0.08596, over 20916.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3073, pruned_loss=0.08031, over 4285304.37 frames. ], batch size: 607, lr: 5.30e-03, grad_scale: 32.0 2023-06-22 08:56:40,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=932058.0, ans=0.0 2023-06-22 08:57:09,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=932118.0, ans=0.1 2023-06-22 08:57:17,226 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-06-22 08:57:34,747 INFO [train.py:996] (3/4) Epoch 6, batch 2900, loss[loss=0.2565, simple_loss=0.3125, pruned_loss=0.1003, over 21757.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3036, pruned_loss=0.07953, over 4284895.25 frames. ], batch size: 473, lr: 5.30e-03, grad_scale: 32.0 2023-06-22 08:57:54,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=932238.0, ans=0.125 2023-06-22 08:57:56,151 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=932238.0, ans=0.125 2023-06-22 08:58:25,131 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 08:58:35,852 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=932358.0, ans=0.0 2023-06-22 08:58:38,200 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.091e+02 2.660e+02 3.154e+02 3.986e+02 8.685e+02, threshold=6.308e+02, percent-clipped=2.0 2023-06-22 08:59:16,533 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 08:59:16,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff3.min_abs, batch_count=932418.0, ans=0.2 2023-06-22 08:59:18,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=932418.0, ans=0.0 2023-06-22 08:59:51,145 INFO [train.py:996] (3/4) Epoch 6, batch 2950, loss[loss=0.221, simple_loss=0.3128, pruned_loss=0.06454, over 21663.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3052, pruned_loss=0.07992, over 4291006.20 frames. ], batch size: 230, lr: 5.30e-03, grad_scale: 32.0 2023-06-22 09:01:02,172 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=932658.0, ans=0.125 2023-06-22 09:01:10,771 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=932658.0, ans=0.2 2023-06-22 09:01:13,784 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=932718.0, ans=0.1 2023-06-22 09:01:55,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=932778.0, ans=0.1 2023-06-22 09:02:10,245 INFO [train.py:996] (3/4) Epoch 6, batch 3000, loss[loss=0.2281, simple_loss=0.3088, pruned_loss=0.07366, over 21816.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3095, pruned_loss=0.08039, over 4292617.83 frames. ], batch size: 282, lr: 5.29e-03, grad_scale: 16.0 2023-06-22 09:02:10,249 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-22 09:03:08,554 INFO [train.py:1028] (3/4) Epoch 6, validation: loss=0.2509, simple_loss=0.3421, pruned_loss=0.07991, over 1796401.00 frames. 2023-06-22 09:03:08,555 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-22 09:03:24,653 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=932898.0, ans=0.0 2023-06-22 09:03:37,211 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.27 vs. limit=15.0 2023-06-22 09:03:57,560 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.033e+02 2.599e+02 2.976e+02 3.379e+02 5.904e+02, threshold=5.951e+02, percent-clipped=0.0 2023-06-22 09:04:04,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=932958.0, ans=0.1 2023-06-22 09:04:04,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=932958.0, ans=0.0 2023-06-22 09:05:27,313 INFO [train.py:996] (3/4) Epoch 6, batch 3050, loss[loss=0.1892, simple_loss=0.2796, pruned_loss=0.04944, over 21765.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3095, pruned_loss=0.07799, over 4291671.20 frames. ], batch size: 332, lr: 5.29e-03, grad_scale: 16.0 2023-06-22 09:05:48,204 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=933198.0, ans=0.0 2023-06-22 09:07:13,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=933378.0, ans=0.125 2023-06-22 09:07:16,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=933378.0, ans=0.1 2023-06-22 09:07:39,459 INFO [train.py:996] (3/4) Epoch 6, batch 3100, loss[loss=0.2315, simple_loss=0.3246, pruned_loss=0.06915, over 21633.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3075, pruned_loss=0.07679, over 4282004.20 frames. ], batch size: 389, lr: 5.29e-03, grad_scale: 16.0 2023-06-22 09:07:44,818 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.02 vs. limit=10.0 2023-06-22 09:08:04,962 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.01 vs. limit=15.0 2023-06-22 09:08:35,969 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=933558.0, ans=0.125 2023-06-22 09:08:43,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=933558.0, ans=0.125 2023-06-22 09:08:44,156 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.949e+02 2.677e+02 3.194e+02 3.911e+02 7.241e+02, threshold=6.388e+02, percent-clipped=3.0 2023-06-22 09:09:58,485 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=933678.0, ans=0.125 2023-06-22 09:10:02,630 INFO [train.py:996] (3/4) Epoch 6, batch 3150, loss[loss=0.2561, simple_loss=0.3311, pruned_loss=0.09056, over 21397.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3101, pruned_loss=0.07781, over 4280938.31 frames. ], batch size: 159, lr: 5.29e-03, grad_scale: 16.0 2023-06-22 09:10:03,779 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.09 vs. limit=6.0 2023-06-22 09:10:21,226 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.85 vs. limit=15.0 2023-06-22 09:11:45,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=933858.0, ans=0.2 2023-06-22 09:12:23,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=933978.0, ans=0.125 2023-06-22 09:12:30,963 INFO [train.py:996] (3/4) Epoch 6, batch 3200, loss[loss=0.2035, simple_loss=0.3012, pruned_loss=0.05289, over 21739.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3132, pruned_loss=0.0786, over 4284669.54 frames. ], batch size: 351, lr: 5.29e-03, grad_scale: 32.0 2023-06-22 09:12:31,473 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=934038.0, ans=0.125 2023-06-22 09:13:30,089 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.80 vs. limit=15.0 2023-06-22 09:13:30,988 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=934098.0, ans=0.125 2023-06-22 09:13:34,605 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.32 vs. limit=15.0 2023-06-22 09:13:53,859 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.766e+02 2.451e+02 2.830e+02 3.191e+02 4.381e+02, threshold=5.660e+02, percent-clipped=0.0 2023-06-22 09:14:31,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=934278.0, ans=0.125 2023-06-22 09:14:44,661 INFO [train.py:996] (3/4) Epoch 6, batch 3250, loss[loss=0.2262, simple_loss=0.291, pruned_loss=0.08076, over 21844.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.313, pruned_loss=0.07956, over 4281608.02 frames. ], batch size: 98, lr: 5.29e-03, grad_scale: 32.0 2023-06-22 09:15:24,349 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=934398.0, ans=0.05 2023-06-22 09:16:35,432 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.12 vs. limit=22.5 2023-06-22 09:17:13,740 INFO [train.py:996] (3/4) Epoch 6, batch 3300, loss[loss=0.2527, simple_loss=0.3444, pruned_loss=0.08052, over 21589.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3057, pruned_loss=0.07908, over 4274441.86 frames. ], batch size: 441, lr: 5.29e-03, grad_scale: 32.0 2023-06-22 09:17:18,865 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=934638.0, ans=0.125 2023-06-22 09:17:28,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=934638.0, ans=0.2 2023-06-22 09:17:36,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=934698.0, ans=0.0 2023-06-22 09:18:13,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=934758.0, ans=0.2 2023-06-22 09:18:24,229 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.938e+02 2.651e+02 2.919e+02 3.305e+02 7.329e+02, threshold=5.839e+02, percent-clipped=1.0 2023-06-22 09:18:24,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=934758.0, ans=0.025 2023-06-22 09:18:37,711 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=934818.0, ans=0.125 2023-06-22 09:19:06,289 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=11.17 vs. limit=15.0 2023-06-22 09:19:38,761 INFO [train.py:996] (3/4) Epoch 6, batch 3350, loss[loss=0.245, simple_loss=0.3176, pruned_loss=0.08622, over 21381.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3086, pruned_loss=0.07914, over 4281433.43 frames. ], batch size: 131, lr: 5.29e-03, grad_scale: 32.0 2023-06-22 09:20:50,187 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=935058.0, ans=0.125 2023-06-22 09:21:08,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=935118.0, ans=0.5 2023-06-22 09:21:11,211 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=935118.0, ans=0.125 2023-06-22 09:21:52,353 INFO [train.py:996] (3/4) Epoch 6, batch 3400, loss[loss=0.2044, simple_loss=0.2803, pruned_loss=0.06422, over 21364.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3095, pruned_loss=0.08088, over 4288688.20 frames. ], batch size: 144, lr: 5.29e-03, grad_scale: 32.0 2023-06-22 09:22:58,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=935358.0, ans=0.025 2023-06-22 09:23:02,957 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.156e+02 2.660e+02 3.094e+02 4.133e+02 6.206e+02, threshold=6.188e+02, percent-clipped=2.0 2023-06-22 09:24:19,147 INFO [train.py:996] (3/4) Epoch 6, batch 3450, loss[loss=0.2206, simple_loss=0.2641, pruned_loss=0.08856, over 20136.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3035, pruned_loss=0.07949, over 4277273.49 frames. ], batch size: 707, lr: 5.29e-03, grad_scale: 16.0 2023-06-22 09:24:59,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=935598.0, ans=0.125 2023-06-22 09:25:04,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=935598.0, ans=0.125 2023-06-22 09:25:39,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=935718.0, ans=0.125 2023-06-22 09:26:06,149 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.83 vs. limit=15.0 2023-06-22 09:26:10,288 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.73 vs. limit=15.0 2023-06-22 09:26:21,476 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=935778.0, ans=0.125 2023-06-22 09:26:30,610 INFO [train.py:996] (3/4) Epoch 6, batch 3500, loss[loss=0.2062, simple_loss=0.2677, pruned_loss=0.07239, over 21275.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3116, pruned_loss=0.08328, over 4281996.77 frames. ], batch size: 608, lr: 5.29e-03, grad_scale: 16.0 2023-06-22 09:27:11,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=935898.0, ans=0.125 2023-06-22 09:27:33,649 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.200e+02 2.716e+02 3.159e+02 3.539e+02 5.891e+02, threshold=6.318e+02, percent-clipped=0.0 2023-06-22 09:27:52,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=936018.0, ans=0.125 2023-06-22 09:28:44,940 INFO [train.py:996] (3/4) Epoch 6, batch 3550, loss[loss=0.2244, simple_loss=0.2912, pruned_loss=0.07878, over 21392.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3151, pruned_loss=0.08462, over 4287444.93 frames. ], batch size: 194, lr: 5.29e-03, grad_scale: 16.0 2023-06-22 09:29:06,193 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.85 vs. limit=15.0 2023-06-22 09:29:28,370 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=936258.0, ans=0.2 2023-06-22 09:30:51,132 INFO [train.py:996] (3/4) Epoch 6, batch 3600, loss[loss=0.2292, simple_loss=0.2973, pruned_loss=0.08061, over 21668.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3096, pruned_loss=0.08349, over 4288636.45 frames. ], batch size: 298, lr: 5.28e-03, grad_scale: 32.0 2023-06-22 09:31:06,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=936438.0, ans=0.05 2023-06-22 09:31:44,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=936558.0, ans=0.0 2023-06-22 09:32:15,644 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.063e+02 2.586e+02 3.004e+02 3.454e+02 6.703e+02, threshold=6.007e+02, percent-clipped=1.0 2023-06-22 09:32:22,510 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.75 vs. limit=22.5 2023-06-22 09:32:36,280 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=936678.0, ans=0.125 2023-06-22 09:33:23,798 INFO [train.py:996] (3/4) Epoch 6, batch 3650, loss[loss=0.1983, simple_loss=0.2828, pruned_loss=0.05692, over 21608.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3104, pruned_loss=0.08375, over 4283984.64 frames. ], batch size: 230, lr: 5.28e-03, grad_scale: 32.0 2023-06-22 09:33:52,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=936798.0, ans=0.0 2023-06-22 09:35:30,885 INFO [train.py:996] (3/4) Epoch 6, batch 3700, loss[loss=0.2177, simple_loss=0.2956, pruned_loss=0.06988, over 21839.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3098, pruned_loss=0.08255, over 4288343.51 frames. ], batch size: 332, lr: 5.28e-03, grad_scale: 32.0 2023-06-22 09:35:50,713 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=937038.0, ans=0.125 2023-06-22 09:35:53,692 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=937038.0, ans=0.0 2023-06-22 09:36:35,270 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.012e+02 2.535e+02 2.948e+02 3.482e+02 5.680e+02, threshold=5.896e+02, percent-clipped=0.0 2023-06-22 09:37:50,975 INFO [train.py:996] (3/4) Epoch 6, batch 3750, loss[loss=0.1939, simple_loss=0.254, pruned_loss=0.0669, over 21277.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3081, pruned_loss=0.08215, over 4296162.13 frames. ], batch size: 549, lr: 5.28e-03, grad_scale: 32.0 2023-06-22 09:38:49,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=937458.0, ans=0.025 2023-06-22 09:39:24,445 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=937518.0, ans=0.95 2023-06-22 09:39:40,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=937578.0, ans=0.125 2023-06-22 09:39:40,654 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=937578.0, ans=0.1 2023-06-22 09:40:15,567 INFO [train.py:996] (3/4) Epoch 6, batch 3800, loss[loss=0.2239, simple_loss=0.2992, pruned_loss=0.07427, over 21815.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3054, pruned_loss=0.08034, over 4294506.42 frames. ], batch size: 247, lr: 5.28e-03, grad_scale: 32.0 2023-06-22 09:40:59,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=937698.0, ans=10.0 2023-06-22 09:41:00,690 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=937698.0, ans=0.09899494936611666 2023-06-22 09:41:02,204 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=937698.0, ans=0.125 2023-06-22 09:41:05,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=937758.0, ans=0.2 2023-06-22 09:41:05,595 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.69 vs. limit=22.5 2023-06-22 09:41:08,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=937758.0, ans=0.0 2023-06-22 09:41:16,576 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.603e+02 2.491e+02 2.732e+02 3.396e+02 7.408e+02, threshold=5.464e+02, percent-clipped=3.0 2023-06-22 09:41:46,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=937818.0, ans=0.0 2023-06-22 09:42:20,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=937878.0, ans=0.1 2023-06-22 09:42:24,413 INFO [train.py:996] (3/4) Epoch 6, batch 3850, loss[loss=0.2272, simple_loss=0.2901, pruned_loss=0.08213, over 21875.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3051, pruned_loss=0.08087, over 4279536.75 frames. ], batch size: 107, lr: 5.28e-03, grad_scale: 16.0 2023-06-22 09:43:09,940 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.80 vs. limit=22.5 2023-06-22 09:43:23,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=938058.0, ans=0.125 2023-06-22 09:43:28,378 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=938058.0, ans=0.2 2023-06-22 09:43:58,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=938118.0, ans=0.0 2023-06-22 09:44:22,643 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.27 vs. limit=12.0 2023-06-22 09:44:27,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=938178.0, ans=0.125 2023-06-22 09:44:44,214 INFO [train.py:996] (3/4) Epoch 6, batch 3900, loss[loss=0.2246, simple_loss=0.2869, pruned_loss=0.08118, over 21594.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3007, pruned_loss=0.08062, over 4274441.79 frames. ], batch size: 212, lr: 5.28e-03, grad_scale: 16.0 2023-06-22 09:45:53,744 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.196e+02 2.660e+02 3.085e+02 3.718e+02 7.121e+02, threshold=6.170e+02, percent-clipped=3.0 2023-06-22 09:46:01,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=938418.0, ans=0.125 2023-06-22 09:47:02,407 INFO [train.py:996] (3/4) Epoch 6, batch 3950, loss[loss=0.1653, simple_loss=0.2476, pruned_loss=0.04143, over 21479.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3029, pruned_loss=0.07989, over 4274665.62 frames. ], batch size: 212, lr: 5.28e-03, grad_scale: 16.0 2023-06-22 09:47:37,810 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.06 vs. limit=15.0 2023-06-22 09:47:51,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=938658.0, ans=0.125 2023-06-22 09:48:25,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=938718.0, ans=0.1 2023-06-22 09:49:05,508 INFO [train.py:996] (3/4) Epoch 6, batch 4000, loss[loss=0.2079, simple_loss=0.2712, pruned_loss=0.0723, over 21286.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2953, pruned_loss=0.07585, over 4275212.64 frames. ], batch size: 144, lr: 5.28e-03, grad_scale: 32.0 2023-06-22 09:50:08,896 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=938958.0, ans=0.5 2023-06-22 09:50:15,095 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.994e+02 2.398e+02 2.701e+02 3.226e+02 5.808e+02, threshold=5.402e+02, percent-clipped=0.0 2023-06-22 09:51:00,386 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.47 vs. limit=15.0 2023-06-22 09:51:27,237 INFO [train.py:996] (3/4) Epoch 6, batch 4050, loss[loss=0.2725, simple_loss=0.3324, pruned_loss=0.1063, over 21617.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2937, pruned_loss=0.07395, over 4278863.30 frames. ], batch size: 507, lr: 5.28e-03, grad_scale: 32.0 2023-06-22 09:51:27,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=939138.0, ans=0.1 2023-06-22 09:52:02,107 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=939198.0, ans=0.125 2023-06-22 09:52:23,518 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.05 vs. limit=12.0 2023-06-22 09:52:30,232 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=939258.0, ans=0.125 2023-06-22 09:53:15,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=939318.0, ans=0.09899494936611666 2023-06-22 09:53:16,904 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=939318.0, ans=0.0 2023-06-22 09:53:40,574 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=939378.0, ans=0.125 2023-06-22 09:53:58,557 INFO [train.py:996] (3/4) Epoch 6, batch 4100, loss[loss=0.1971, simple_loss=0.2758, pruned_loss=0.05923, over 21543.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2956, pruned_loss=0.07428, over 4279413.21 frames. ], batch size: 212, lr: 5.28e-03, grad_scale: 32.0 2023-06-22 09:55:09,874 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.954e+02 2.299e+02 2.709e+02 3.070e+02 5.765e+02, threshold=5.418e+02, percent-clipped=2.0 2023-06-22 09:55:23,871 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.22 vs. limit=12.0 2023-06-22 09:55:26,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=939618.0, ans=0.125 2023-06-22 09:55:45,947 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=939618.0, ans=0.0 2023-06-22 09:55:46,440 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-06-22 09:56:09,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=939738.0, ans=0.025 2023-06-22 09:56:10,776 INFO [train.py:996] (3/4) Epoch 6, batch 4150, loss[loss=0.2029, simple_loss=0.2929, pruned_loss=0.05649, over 21661.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.296, pruned_loss=0.07249, over 4279546.72 frames. ], batch size: 263, lr: 5.28e-03, grad_scale: 32.0 2023-06-22 09:56:17,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=939738.0, ans=0.1 2023-06-22 09:57:29,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=939918.0, ans=0.125 2023-06-22 09:57:39,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=939918.0, ans=0.0 2023-06-22 09:58:10,165 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.40 vs. limit=15.0 2023-06-22 09:58:11,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=939978.0, ans=0.125 2023-06-22 09:58:24,500 INFO [train.py:996] (3/4) Epoch 6, batch 4200, loss[loss=0.1991, simple_loss=0.2764, pruned_loss=0.06088, over 15633.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2958, pruned_loss=0.07128, over 4267790.77 frames. ], batch size: 61, lr: 5.27e-03, grad_scale: 32.0 2023-06-22 09:59:29,379 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.62 vs. limit=15.0 2023-06-22 09:59:29,491 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.07 vs. limit=15.0 2023-06-22 09:59:32,816 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.860e+02 2.356e+02 2.693e+02 3.335e+02 6.713e+02, threshold=5.385e+02, percent-clipped=2.0 2023-06-22 10:00:45,331 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=940278.0, ans=0.0 2023-06-22 10:00:52,422 INFO [train.py:996] (3/4) Epoch 6, batch 4250, loss[loss=0.1968, simple_loss=0.2645, pruned_loss=0.06457, over 21224.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3029, pruned_loss=0.07354, over 4270508.49 frames. ], batch size: 176, lr: 5.27e-03, grad_scale: 32.0 2023-06-22 10:00:54,462 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=940338.0, ans=0.07 2023-06-22 10:01:53,627 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.68 vs. limit=12.0 2023-06-22 10:02:47,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=940578.0, ans=0.1 2023-06-22 10:03:09,329 INFO [train.py:996] (3/4) Epoch 6, batch 4300, loss[loss=0.2888, simple_loss=0.3795, pruned_loss=0.09907, over 21473.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3107, pruned_loss=0.07582, over 4269715.28 frames. ], batch size: 471, lr: 5.27e-03, grad_scale: 32.0 2023-06-22 10:04:06,440 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.41 vs. limit=22.5 2023-06-22 10:04:43,297 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.946e+02 3.391e+02 4.074e+02 6.738e+02, threshold=6.781e+02, percent-clipped=7.0 2023-06-22 10:05:15,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=940878.0, ans=0.125 2023-06-22 10:05:35,229 INFO [train.py:996] (3/4) Epoch 6, batch 4350, loss[loss=0.1863, simple_loss=0.2523, pruned_loss=0.06016, over 21543.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3079, pruned_loss=0.07535, over 4259051.14 frames. ], batch size: 247, lr: 5.27e-03, grad_scale: 16.0 2023-06-22 10:05:58,142 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=940998.0, ans=0.2 2023-06-22 10:06:19,014 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=940998.0, ans=0.125 2023-06-22 10:07:48,655 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-22 10:07:52,256 INFO [train.py:996] (3/4) Epoch 6, batch 4400, loss[loss=0.2116, simple_loss=0.2981, pruned_loss=0.06257, over 21378.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.304, pruned_loss=0.07478, over 4259662.71 frames. ], batch size: 176, lr: 5.27e-03, grad_scale: 32.0 2023-06-22 10:08:05,059 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=941238.0, ans=0.2 2023-06-22 10:08:29,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=941298.0, ans=0.0 2023-06-22 10:09:18,637 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.562e+02 2.802e+02 3.458e+02 5.737e+02, threshold=5.605e+02, percent-clipped=0.0 2023-06-22 10:09:52,555 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 10:10:00,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=941478.0, ans=0.125 2023-06-22 10:10:17,113 INFO [train.py:996] (3/4) Epoch 6, batch 4450, loss[loss=0.2256, simple_loss=0.3136, pruned_loss=0.06874, over 21570.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.311, pruned_loss=0.07629, over 4261494.49 frames. ], batch size: 230, lr: 5.27e-03, grad_scale: 32.0 2023-06-22 10:10:42,878 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=941538.0, ans=0.0 2023-06-22 10:11:04,071 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=941658.0, ans=0.125 2023-06-22 10:12:30,719 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 10:12:37,936 INFO [train.py:996] (3/4) Epoch 6, batch 4500, loss[loss=0.2346, simple_loss=0.3302, pruned_loss=0.06954, over 20116.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3121, pruned_loss=0.07772, over 4275785.35 frames. ], batch size: 702, lr: 5.27e-03, grad_scale: 32.0 2023-06-22 10:13:11,958 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=941898.0, ans=0.125 2023-06-22 10:13:19,333 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=941898.0, ans=0.0 2023-06-22 10:13:22,985 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=941958.0, ans=0.125 2023-06-22 10:13:55,555 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.947e+02 2.436e+02 2.759e+02 3.510e+02 5.897e+02, threshold=5.518e+02, percent-clipped=2.0 2023-06-22 10:14:24,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=942078.0, ans=10.0 2023-06-22 10:15:15,320 INFO [train.py:996] (3/4) Epoch 6, batch 4550, loss[loss=0.2705, simple_loss=0.3418, pruned_loss=0.09961, over 21329.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3154, pruned_loss=0.079, over 4279502.61 frames. ], batch size: 548, lr: 5.27e-03, grad_scale: 32.0 2023-06-22 10:15:47,236 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.whiten.whitening_limit, batch_count=942198.0, ans=12.0 2023-06-22 10:15:57,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=942258.0, ans=0.2 2023-06-22 10:16:12,929 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=942258.0, ans=0.125 2023-06-22 10:16:56,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=942378.0, ans=0.125 2023-06-22 10:16:56,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=942378.0, ans=0.1 2023-06-22 10:17:15,168 INFO [train.py:996] (3/4) Epoch 6, batch 4600, loss[loss=0.2303, simple_loss=0.3167, pruned_loss=0.07197, over 21650.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3183, pruned_loss=0.08068, over 4283643.88 frames. ], batch size: 389, lr: 5.27e-03, grad_scale: 32.0 2023-06-22 10:18:08,297 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.60 vs. limit=15.0 2023-06-22 10:18:38,667 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.101e+02 2.628e+02 3.061e+02 3.498e+02 7.398e+02, threshold=6.122e+02, percent-clipped=1.0 2023-06-22 10:19:41,646 INFO [train.py:996] (3/4) Epoch 6, batch 4650, loss[loss=0.1544, simple_loss=0.2328, pruned_loss=0.03796, over 21316.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3113, pruned_loss=0.07884, over 4292096.13 frames. ], batch size: 176, lr: 5.27e-03, grad_scale: 16.0 2023-06-22 10:19:46,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=942738.0, ans=0.025 2023-06-22 10:21:02,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=942918.0, ans=0.1 2023-06-22 10:21:43,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=942978.0, ans=0.025 2023-06-22 10:21:43,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=942978.0, ans=0.1 2023-06-22 10:21:47,935 INFO [train.py:996] (3/4) Epoch 6, batch 4700, loss[loss=0.2246, simple_loss=0.2707, pruned_loss=0.08928, over 20085.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3008, pruned_loss=0.07614, over 4280941.80 frames. ], batch size: 707, lr: 5.27e-03, grad_scale: 16.0 2023-06-22 10:22:18,032 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.26 vs. limit=15.0 2023-06-22 10:22:47,290 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=943158.0, ans=0.0 2023-06-22 10:22:57,990 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.865e+02 2.395e+02 2.699e+02 3.102e+02 5.296e+02, threshold=5.398e+02, percent-clipped=0.0 2023-06-22 10:23:59,399 INFO [train.py:996] (3/4) Epoch 6, batch 4750, loss[loss=0.2297, simple_loss=0.2973, pruned_loss=0.08103, over 22049.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2951, pruned_loss=0.0761, over 4282824.72 frames. ], batch size: 119, lr: 5.27e-03, grad_scale: 16.0 2023-06-22 10:25:03,929 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=943458.0, ans=0.0 2023-06-22 10:25:42,165 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=943578.0, ans=0.0 2023-06-22 10:26:15,790 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=943638.0, ans=0.2 2023-06-22 10:26:16,868 INFO [train.py:996] (3/4) Epoch 6, batch 4800, loss[loss=0.2126, simple_loss=0.2928, pruned_loss=0.06619, over 21744.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2964, pruned_loss=0.07666, over 4293027.71 frames. ], batch size: 247, lr: 5.26e-03, grad_scale: 32.0 2023-06-22 10:26:18,577 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=943638.0, ans=0.0 2023-06-22 10:27:07,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=943698.0, ans=0.1 2023-06-22 10:27:21,108 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.33 vs. limit=22.5 2023-06-22 10:27:31,099 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.170e+02 2.729e+02 2.951e+02 3.440e+02 4.423e+02, threshold=5.901e+02, percent-clipped=0.0 2023-06-22 10:28:05,293 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=943878.0, ans=0.125 2023-06-22 10:28:26,340 INFO [train.py:996] (3/4) Epoch 6, batch 4850, loss[loss=0.2253, simple_loss=0.2923, pruned_loss=0.07914, over 21674.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.2974, pruned_loss=0.0764, over 4295689.08 frames. ], batch size: 230, lr: 5.26e-03, grad_scale: 16.0 2023-06-22 10:29:21,491 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.91 vs. limit=22.5 2023-06-22 10:29:22,832 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=944058.0, ans=15.0 2023-06-22 10:29:24,627 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.20 vs. limit=6.0 2023-06-22 10:29:59,982 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 10:30:39,908 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=944178.0, ans=0.125 2023-06-22 10:30:44,930 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.89 vs. limit=15.0 2023-06-22 10:30:48,617 INFO [train.py:996] (3/4) Epoch 6, batch 4900, loss[loss=0.2494, simple_loss=0.3269, pruned_loss=0.08599, over 21292.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3002, pruned_loss=0.07773, over 4299962.65 frames. ], batch size: 159, lr: 5.26e-03, grad_scale: 16.0 2023-06-22 10:31:10,227 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.37 vs. limit=15.0 2023-06-22 10:31:10,440 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.94 vs. limit=22.5 2023-06-22 10:32:13,618 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.012e+02 2.505e+02 2.707e+02 3.125e+02 4.814e+02, threshold=5.414e+02, percent-clipped=0.0 2023-06-22 10:32:15,463 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=944418.0, ans=0.125 2023-06-22 10:33:16,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=944478.0, ans=0.125 2023-06-22 10:33:20,828 INFO [train.py:996] (3/4) Epoch 6, batch 4950, loss[loss=0.2048, simple_loss=0.3053, pruned_loss=0.05214, over 21636.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3054, pruned_loss=0.07613, over 4295822.47 frames. ], batch size: 414, lr: 5.26e-03, grad_scale: 16.0 2023-06-22 10:33:27,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=944538.0, ans=0.125 2023-06-22 10:34:36,718 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.48 vs. limit=12.0 2023-06-22 10:34:39,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=944718.0, ans=0.125 2023-06-22 10:35:01,154 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.16 vs. limit=15.0 2023-06-22 10:35:02,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=944718.0, ans=0.025 2023-06-22 10:35:31,666 INFO [train.py:996] (3/4) Epoch 6, batch 5000, loss[loss=0.2382, simple_loss=0.3144, pruned_loss=0.08099, over 21753.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3038, pruned_loss=0.07287, over 4290675.73 frames. ], batch size: 112, lr: 5.26e-03, grad_scale: 16.0 2023-06-22 10:36:52,264 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.780e+02 2.481e+02 2.862e+02 3.375e+02 4.928e+02, threshold=5.725e+02, percent-clipped=0.0 2023-06-22 10:37:01,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=945018.0, ans=0.04949747468305833 2023-06-22 10:37:47,853 INFO [train.py:996] (3/4) Epoch 6, batch 5050, loss[loss=0.2184, simple_loss=0.2916, pruned_loss=0.07259, over 21433.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.304, pruned_loss=0.07456, over 4289094.60 frames. ], batch size: 194, lr: 5.26e-03, grad_scale: 16.0 2023-06-22 10:38:49,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=945258.0, ans=0.125 2023-06-22 10:39:05,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=945318.0, ans=0.1 2023-06-22 10:39:10,746 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=945318.0, ans=0.0 2023-06-22 10:39:39,333 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=945378.0, ans=0.2 2023-06-22 10:39:42,073 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 10:40:04,975 INFO [train.py:996] (3/4) Epoch 6, batch 5100, loss[loss=0.2524, simple_loss=0.3207, pruned_loss=0.09205, over 21776.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3021, pruned_loss=0.07486, over 4294248.79 frames. ], batch size: 112, lr: 5.26e-03, grad_scale: 16.0 2023-06-22 10:40:08,982 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.78 vs. limit=15.0 2023-06-22 10:41:18,237 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.857e+02 2.681e+02 3.101e+02 3.777e+02 6.060e+02, threshold=6.201e+02, percent-clipped=2.0 2023-06-22 10:41:22,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=945618.0, ans=0.125 2023-06-22 10:41:31,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=945618.0, ans=0.125 2023-06-22 10:41:33,193 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=945618.0, ans=0.0 2023-06-22 10:42:14,432 INFO [train.py:996] (3/4) Epoch 6, batch 5150, loss[loss=0.2078, simple_loss=0.2817, pruned_loss=0.0669, over 21729.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.2992, pruned_loss=0.07494, over 4292974.74 frames. ], batch size: 247, lr: 5.26e-03, grad_scale: 16.0 2023-06-22 10:42:51,006 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=945798.0, ans=0.1 2023-06-22 10:43:02,564 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.28 vs. limit=10.0 2023-06-22 10:43:06,988 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.43 vs. limit=22.5 2023-06-22 10:43:10,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=945858.0, ans=0.0 2023-06-22 10:43:27,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=945858.0, ans=0.125 2023-06-22 10:43:46,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=945918.0, ans=0.125 2023-06-22 10:43:59,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=945918.0, ans=0.0 2023-06-22 10:44:33,440 INFO [train.py:996] (3/4) Epoch 6, batch 5200, loss[loss=0.2652, simple_loss=0.3645, pruned_loss=0.08294, over 21208.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3023, pruned_loss=0.07611, over 4289657.82 frames. ], batch size: 548, lr: 5.26e-03, grad_scale: 32.0 2023-06-22 10:45:40,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=946158.0, ans=0.125 2023-06-22 10:45:56,344 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.187e+02 2.577e+02 3.076e+02 3.772e+02 6.113e+02, threshold=6.153e+02, percent-clipped=0.0 2023-06-22 10:46:50,502 INFO [train.py:996] (3/4) Epoch 6, batch 5250, loss[loss=0.2065, simple_loss=0.2952, pruned_loss=0.05895, over 21408.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3071, pruned_loss=0.07531, over 4292012.92 frames. ], batch size: 211, lr: 5.26e-03, grad_scale: 32.0 2023-06-22 10:47:21,074 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=946338.0, ans=0.125 2023-06-22 10:47:32,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=946398.0, ans=0.0 2023-06-22 10:47:40,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=946398.0, ans=0.2 2023-06-22 10:48:04,144 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=946458.0, ans=0.05 2023-06-22 10:48:27,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=946518.0, ans=0.04949747468305833 2023-06-22 10:49:19,195 INFO [train.py:996] (3/4) Epoch 6, batch 5300, loss[loss=0.218, simple_loss=0.2879, pruned_loss=0.07406, over 21894.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3058, pruned_loss=0.07501, over 4296419.34 frames. ], batch size: 107, lr: 5.26e-03, grad_scale: 32.0 2023-06-22 10:49:19,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=946638.0, ans=0.0 2023-06-22 10:49:19,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=946638.0, ans=0.1 2023-06-22 10:50:30,810 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 2.544e+02 2.916e+02 3.415e+02 4.967e+02, threshold=5.832e+02, percent-clipped=0.0 2023-06-22 10:50:58,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=946818.0, ans=0.2 2023-06-22 10:51:16,610 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.61 vs. limit=15.0 2023-06-22 10:51:25,705 INFO [train.py:996] (3/4) Epoch 6, batch 5350, loss[loss=0.2223, simple_loss=0.2991, pruned_loss=0.07275, over 21723.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3048, pruned_loss=0.07668, over 4300461.09 frames. ], batch size: 112, lr: 5.26e-03, grad_scale: 32.0 2023-06-22 10:52:25,583 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=946998.0, ans=0.2 2023-06-22 10:52:32,901 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=947058.0, ans=0.0 2023-06-22 10:52:37,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=947058.0, ans=0.0 2023-06-22 10:53:46,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=947178.0, ans=15.0 2023-06-22 10:54:00,184 INFO [train.py:996] (3/4) Epoch 6, batch 5400, loss[loss=0.2068, simple_loss=0.2909, pruned_loss=0.06139, over 21684.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3034, pruned_loss=0.07734, over 4289295.27 frames. ], batch size: 389, lr: 5.25e-03, grad_scale: 16.0 2023-06-22 10:54:15,716 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=947298.0, ans=0.1 2023-06-22 10:54:28,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=947298.0, ans=0.2 2023-06-22 10:55:21,794 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.828e+02 2.657e+02 2.997e+02 3.766e+02 7.720e+02, threshold=5.994e+02, percent-clipped=1.0 2023-06-22 10:55:25,122 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=947418.0, ans=0.0 2023-06-22 10:56:03,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=947478.0, ans=0.125 2023-06-22 10:56:10,773 INFO [train.py:996] (3/4) Epoch 6, batch 5450, loss[loss=0.203, simple_loss=0.2916, pruned_loss=0.05716, over 21662.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3031, pruned_loss=0.07576, over 4295297.72 frames. ], batch size: 263, lr: 5.25e-03, grad_scale: 16.0 2023-06-22 10:56:31,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=947538.0, ans=0.0 2023-06-22 10:57:16,655 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.86 vs. limit=15.0 2023-06-22 10:58:26,440 INFO [train.py:996] (3/4) Epoch 6, batch 5500, loss[loss=0.1968, simple_loss=0.2915, pruned_loss=0.05103, over 21435.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3079, pruned_loss=0.07346, over 4293937.35 frames. ], batch size: 211, lr: 5.25e-03, grad_scale: 16.0 2023-06-22 10:59:00,936 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=947838.0, ans=0.125 2023-06-22 10:59:25,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=947898.0, ans=0.125 2023-06-22 10:59:28,811 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=947898.0, ans=0.1 2023-06-22 10:59:45,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=947958.0, ans=0.125 2023-06-22 10:59:54,274 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.745e+02 2.313e+02 2.665e+02 3.124e+02 5.281e+02, threshold=5.330e+02, percent-clipped=0.0 2023-06-22 11:00:47,694 INFO [train.py:996] (3/4) Epoch 6, batch 5550, loss[loss=0.2724, simple_loss=0.3621, pruned_loss=0.0913, over 21457.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3085, pruned_loss=0.07111, over 4291463.39 frames. ], batch size: 507, lr: 5.25e-03, grad_scale: 16.0 2023-06-22 11:01:16,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=948138.0, ans=0.0 2023-06-22 11:01:17,988 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=948138.0, ans=0.0 2023-06-22 11:02:51,119 INFO [train.py:996] (3/4) Epoch 6, batch 5600, loss[loss=0.1958, simple_loss=0.2994, pruned_loss=0.0461, over 21200.00 frames. ], tot_loss[loss=0.225, simple_loss=0.31, pruned_loss=0.06993, over 4287655.48 frames. ], batch size: 548, lr: 5.25e-03, grad_scale: 32.0 2023-06-22 11:03:40,908 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=948498.0, ans=0.0 2023-06-22 11:04:16,840 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.06 vs. limit=22.5 2023-06-22 11:04:21,415 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.709e+02 2.304e+02 2.857e+02 3.382e+02 5.869e+02, threshold=5.713e+02, percent-clipped=3.0 2023-06-22 11:04:52,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=948678.0, ans=0.0 2023-06-22 11:04:52,930 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.46 vs. limit=22.5 2023-06-22 11:05:06,947 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=948738.0, ans=0.0 2023-06-22 11:05:07,819 INFO [train.py:996] (3/4) Epoch 6, batch 5650, loss[loss=0.284, simple_loss=0.3767, pruned_loss=0.09559, over 21285.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3136, pruned_loss=0.07146, over 4286489.22 frames. ], batch size: 548, lr: 5.25e-03, grad_scale: 32.0 2023-06-22 11:05:36,879 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-22 11:05:59,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=948798.0, ans=0.125 2023-06-22 11:06:01,117 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=948798.0, ans=0.125 2023-06-22 11:06:04,686 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.50 vs. limit=12.0 2023-06-22 11:06:57,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=948918.0, ans=0.125 2023-06-22 11:07:00,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=948978.0, ans=0.0 2023-06-22 11:07:31,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=948978.0, ans=0.0 2023-06-22 11:07:46,662 INFO [train.py:996] (3/4) Epoch 6, batch 5700, loss[loss=0.1939, simple_loss=0.2814, pruned_loss=0.05316, over 21633.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3124, pruned_loss=0.07288, over 4281030.42 frames. ], batch size: 230, lr: 5.25e-03, grad_scale: 32.0 2023-06-22 11:09:03,669 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.956e+02 2.632e+02 3.011e+02 3.527e+02 5.738e+02, threshold=6.022e+02, percent-clipped=1.0 2023-06-22 11:09:20,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=949218.0, ans=0.125 2023-06-22 11:10:03,782 INFO [train.py:996] (3/4) Epoch 6, batch 5750, loss[loss=0.1803, simple_loss=0.2793, pruned_loss=0.04061, over 21741.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.3054, pruned_loss=0.07014, over 4281899.18 frames. ], batch size: 332, lr: 5.25e-03, grad_scale: 32.0 2023-06-22 11:10:16,782 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.28 vs. limit=15.0 2023-06-22 11:10:40,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=949398.0, ans=0.125 2023-06-22 11:11:58,040 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=12.0 2023-06-22 11:12:21,459 INFO [train.py:996] (3/4) Epoch 6, batch 5800, loss[loss=0.213, simple_loss=0.3095, pruned_loss=0.05825, over 21662.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.3038, pruned_loss=0.06864, over 4276965.60 frames. ], batch size: 230, lr: 5.25e-03, grad_scale: 32.0 2023-06-22 11:14:01,136 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.618e+02 2.381e+02 2.868e+02 4.054e+02 6.693e+02, threshold=5.736e+02, percent-clipped=1.0 2023-06-22 11:14:58,419 INFO [train.py:996] (3/4) Epoch 6, batch 5850, loss[loss=0.1984, simple_loss=0.3209, pruned_loss=0.03801, over 21168.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.3024, pruned_loss=0.06498, over 4276678.63 frames. ], batch size: 548, lr: 5.25e-03, grad_scale: 32.0 2023-06-22 11:15:17,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=949938.0, ans=0.2 2023-06-22 11:15:43,412 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.87 vs. limit=15.0 2023-06-22 11:17:07,183 INFO [train.py:996] (3/4) Epoch 6, batch 5900, loss[loss=0.2189, simple_loss=0.294, pruned_loss=0.07193, over 21594.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2959, pruned_loss=0.05967, over 4277051.85 frames. ], batch size: 471, lr: 5.25e-03, grad_scale: 32.0 2023-06-22 11:18:14,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=950358.0, ans=0.0 2023-06-22 11:18:31,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=950358.0, ans=0.2 2023-06-22 11:18:39,854 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.379e+02 1.954e+02 2.379e+02 3.002e+02 5.426e+02, threshold=4.759e+02, percent-clipped=0.0 2023-06-22 11:19:04,356 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.87 vs. limit=22.5 2023-06-22 11:19:05,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=950478.0, ans=0.125 2023-06-22 11:19:18,760 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.52 vs. limit=6.0 2023-06-22 11:19:22,075 INFO [train.py:996] (3/4) Epoch 6, batch 5950, loss[loss=0.1935, simple_loss=0.2631, pruned_loss=0.062, over 22005.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.295, pruned_loss=0.06332, over 4285320.77 frames. ], batch size: 103, lr: 5.25e-03, grad_scale: 32.0 2023-06-22 11:20:39,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=950658.0, ans=0.1 2023-06-22 11:21:37,680 INFO [train.py:996] (3/4) Epoch 6, batch 6000, loss[loss=0.1913, simple_loss=0.2573, pruned_loss=0.06267, over 21655.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2915, pruned_loss=0.06742, over 4275735.17 frames. ], batch size: 264, lr: 5.24e-03, grad_scale: 32.0 2023-06-22 11:21:37,681 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-22 11:22:28,068 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.2.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([1.9439, 2.5529, 3.8421, 3.0113], device='cuda:3') 2023-06-22 11:22:41,435 INFO [train.py:1028] (3/4) Epoch 6, validation: loss=0.2615, simple_loss=0.3543, pruned_loss=0.08434, over 1796401.00 frames. 2023-06-22 11:22:41,437 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-22 11:23:08,880 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=950898.0, ans=0.2 2023-06-22 11:23:22,748 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.25 vs. limit=22.5 2023-06-22 11:23:28,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=950958.0, ans=0.125 2023-06-22 11:23:42,525 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.25 vs. limit=22.5 2023-06-22 11:23:46,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=951018.0, ans=0.125 2023-06-22 11:23:47,005 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.133e+02 2.609e+02 2.903e+02 3.362e+02 5.705e+02, threshold=5.807e+02, percent-clipped=2.0 2023-06-22 11:24:30,289 INFO [train.py:996] (3/4) Epoch 6, batch 6050, loss[loss=0.1691, simple_loss=0.2386, pruned_loss=0.04979, over 21421.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2868, pruned_loss=0.06821, over 4264274.22 frames. ], batch size: 195, lr: 5.24e-03, grad_scale: 32.0 2023-06-22 11:26:41,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=951438.0, ans=0.2 2023-06-22 11:26:42,251 INFO [train.py:996] (3/4) Epoch 6, batch 6100, loss[loss=0.2462, simple_loss=0.3211, pruned_loss=0.08559, over 21912.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.285, pruned_loss=0.0672, over 4267644.74 frames. ], batch size: 124, lr: 5.24e-03, grad_scale: 32.0 2023-06-22 11:26:46,503 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=951438.0, ans=0.125 2023-06-22 11:28:19,200 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.729e+02 2.238e+02 2.454e+02 2.758e+02 3.934e+02, threshold=4.908e+02, percent-clipped=0.0 2023-06-22 11:28:38,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=951678.0, ans=0.125 2023-06-22 11:28:59,920 INFO [train.py:996] (3/4) Epoch 6, batch 6150, loss[loss=0.2099, simple_loss=0.2869, pruned_loss=0.06645, over 21535.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2881, pruned_loss=0.06965, over 4273071.33 frames. ], batch size: 389, lr: 5.24e-03, grad_scale: 32.0 2023-06-22 11:30:16,609 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=951858.0, ans=0.0 2023-06-22 11:31:17,600 INFO [train.py:996] (3/4) Epoch 6, batch 6200, loss[loss=0.2276, simple_loss=0.3168, pruned_loss=0.06922, over 21710.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.293, pruned_loss=0.07, over 4278833.22 frames. ], batch size: 414, lr: 5.24e-03, grad_scale: 32.0 2023-06-22 11:32:50,651 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.775e+02 2.387e+02 2.806e+02 3.164e+02 6.088e+02, threshold=5.612e+02, percent-clipped=2.0 2023-06-22 11:33:45,687 INFO [train.py:996] (3/4) Epoch 6, batch 6250, loss[loss=0.2192, simple_loss=0.2746, pruned_loss=0.08193, over 20263.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2979, pruned_loss=0.0698, over 4276318.71 frames. ], batch size: 702, lr: 5.24e-03, grad_scale: 16.0 2023-06-22 11:33:57,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=952338.0, ans=0.0 2023-06-22 11:33:58,422 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.64 vs. limit=15.0 2023-06-22 11:34:17,586 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=952398.0, ans=0.125 2023-06-22 11:34:50,338 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.32 vs. limit=22.5 2023-06-22 11:35:02,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=952518.0, ans=0.0 2023-06-22 11:35:11,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=952518.0, ans=0.2 2023-06-22 11:35:13,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=952518.0, ans=10.0 2023-06-22 11:35:20,249 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.16 vs. limit=15.0 2023-06-22 11:35:35,335 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=952578.0, ans=0.04949747468305833 2023-06-22 11:36:01,121 INFO [train.py:996] (3/4) Epoch 6, batch 6300, loss[loss=0.2197, simple_loss=0.2888, pruned_loss=0.07528, over 21772.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.3011, pruned_loss=0.06921, over 4279463.98 frames. ], batch size: 247, lr: 5.24e-03, grad_scale: 16.0 2023-06-22 11:36:04,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=952638.0, ans=0.2 2023-06-22 11:36:07,505 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=952638.0, ans=0.2 2023-06-22 11:36:23,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=952698.0, ans=0.1 2023-06-22 11:36:25,216 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.80 vs. limit=15.0 2023-06-22 11:37:18,969 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.772e+02 2.436e+02 3.137e+02 3.711e+02 6.138e+02, threshold=6.275e+02, percent-clipped=3.0 2023-06-22 11:37:32,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=952818.0, ans=0.2 2023-06-22 11:38:10,558 INFO [train.py:996] (3/4) Epoch 6, batch 6350, loss[loss=0.2555, simple_loss=0.3327, pruned_loss=0.08918, over 21452.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3034, pruned_loss=0.07271, over 4283803.02 frames. ], batch size: 131, lr: 5.24e-03, grad_scale: 16.0 2023-06-22 11:38:46,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=952998.0, ans=0.125 2023-06-22 11:38:58,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=952998.0, ans=0.1 2023-06-22 11:39:05,655 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.19 vs. limit=15.0 2023-06-22 11:40:21,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=953178.0, ans=0.125 2023-06-22 11:40:25,779 INFO [train.py:996] (3/4) Epoch 6, batch 6400, loss[loss=0.2548, simple_loss=0.33, pruned_loss=0.08981, over 21315.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3096, pruned_loss=0.07763, over 4284992.66 frames. ], batch size: 143, lr: 5.24e-03, grad_scale: 32.0 2023-06-22 11:41:34,504 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=953358.0, ans=0.125 2023-06-22 11:42:05,161 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.28 vs. limit=15.0 2023-06-22 11:42:05,389 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.148e+02 2.715e+02 2.941e+02 3.415e+02 4.411e+02, threshold=5.882e+02, percent-clipped=0.0 2023-06-22 11:42:49,199 INFO [train.py:996] (3/4) Epoch 6, batch 6450, loss[loss=0.2021, simple_loss=0.2817, pruned_loss=0.06123, over 21869.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3122, pruned_loss=0.07683, over 4287010.16 frames. ], batch size: 372, lr: 5.24e-03, grad_scale: 32.0 2023-06-22 11:44:06,296 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=953718.0, ans=0.05 2023-06-22 11:44:27,853 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.87 vs. limit=10.0 2023-06-22 11:44:38,986 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=953778.0, ans=0.0 2023-06-22 11:45:02,722 INFO [train.py:996] (3/4) Epoch 6, batch 6500, loss[loss=0.1934, simple_loss=0.2559, pruned_loss=0.06549, over 21393.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3052, pruned_loss=0.07451, over 4284963.59 frames. ], batch size: 131, lr: 5.24e-03, grad_scale: 32.0 2023-06-22 11:46:23,197 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.042e+02 2.489e+02 2.776e+02 3.304e+02 5.891e+02, threshold=5.553e+02, percent-clipped=1.0 2023-06-22 11:46:48,298 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=954018.0, ans=0.1 2023-06-22 11:46:52,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=954078.0, ans=0.125 2023-06-22 11:47:14,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=954078.0, ans=0.125 2023-06-22 11:47:16,597 INFO [train.py:996] (3/4) Epoch 6, batch 6550, loss[loss=0.2374, simple_loss=0.3087, pruned_loss=0.08302, over 21627.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3049, pruned_loss=0.07399, over 4284554.52 frames. ], batch size: 230, lr: 5.24e-03, grad_scale: 32.0 2023-06-22 11:47:27,656 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.71 vs. limit=15.0 2023-06-22 11:47:27,826 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.23 vs. limit=15.0 2023-06-22 11:48:09,534 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 11:49:05,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=954378.0, ans=0.0 2023-06-22 11:49:17,332 INFO [train.py:996] (3/4) Epoch 6, batch 6600, loss[loss=0.1881, simple_loss=0.243, pruned_loss=0.06657, over 21237.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2992, pruned_loss=0.07291, over 4267750.98 frames. ], batch size: 548, lr: 5.23e-03, grad_scale: 16.0 2023-06-22 11:49:50,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=954498.0, ans=0.0 2023-06-22 11:50:38,844 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.757e+02 2.307e+02 2.568e+02 2.890e+02 5.547e+02, threshold=5.135e+02, percent-clipped=0.0 2023-06-22 11:51:31,384 INFO [train.py:996] (3/4) Epoch 6, batch 6650, loss[loss=0.1897, simple_loss=0.2533, pruned_loss=0.06309, over 21782.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2915, pruned_loss=0.07105, over 4267140.56 frames. ], batch size: 118, lr: 5.23e-03, grad_scale: 16.0 2023-06-22 11:52:24,668 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.50 vs. limit=15.0 2023-06-22 11:53:52,924 INFO [train.py:996] (3/4) Epoch 6, batch 6700, loss[loss=0.1912, simple_loss=0.2634, pruned_loss=0.05951, over 21509.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2854, pruned_loss=0.07071, over 4273168.44 frames. ], batch size: 212, lr: 5.23e-03, grad_scale: 16.0 2023-06-22 11:54:32,666 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=15.0 2023-06-22 11:54:50,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=955158.0, ans=0.0 2023-06-22 11:54:52,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=955158.0, ans=0.2 2023-06-22 11:55:20,133 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.977e+02 2.337e+02 2.565e+02 3.029e+02 4.063e+02, threshold=5.130e+02, percent-clipped=0.0 2023-06-22 11:55:46,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=955278.0, ans=0.0 2023-06-22 11:56:01,265 INFO [train.py:996] (3/4) Epoch 6, batch 6750, loss[loss=0.243, simple_loss=0.3379, pruned_loss=0.07405, over 19817.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2829, pruned_loss=0.07107, over 4263083.08 frames. ], batch size: 703, lr: 5.23e-03, grad_scale: 16.0 2023-06-22 11:57:25,791 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=955518.0, ans=0.125 2023-06-22 11:57:37,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=955518.0, ans=0.0 2023-06-22 11:58:11,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=955638.0, ans=0.125 2023-06-22 11:58:11,989 INFO [train.py:996] (3/4) Epoch 6, batch 6800, loss[loss=0.204, simple_loss=0.2733, pruned_loss=0.06735, over 21854.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2856, pruned_loss=0.07247, over 4271377.04 frames. ], batch size: 107, lr: 5.23e-03, grad_scale: 32.0 2023-06-22 11:59:05,387 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=955698.0, ans=0.0 2023-06-22 11:59:49,861 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.043e+02 2.516e+02 2.922e+02 3.493e+02 5.598e+02, threshold=5.845e+02, percent-clipped=3.0 2023-06-22 12:00:24,112 INFO [train.py:996] (3/4) Epoch 6, batch 6850, loss[loss=0.2439, simple_loss=0.3016, pruned_loss=0.09312, over 21802.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2852, pruned_loss=0.07315, over 4267454.36 frames. ], batch size: 414, lr: 5.23e-03, grad_scale: 16.0 2023-06-22 12:00:39,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=955938.0, ans=0.125 2023-06-22 12:01:48,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=956058.0, ans=0.0 2023-06-22 12:02:06,742 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 12:02:52,554 INFO [train.py:996] (3/4) Epoch 6, batch 6900, loss[loss=0.2235, simple_loss=0.2945, pruned_loss=0.0763, over 21526.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2877, pruned_loss=0.07363, over 4278306.89 frames. ], batch size: 131, lr: 5.23e-03, grad_scale: 16.0 2023-06-22 12:02:57,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=956238.0, ans=0.125 2023-06-22 12:03:57,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=956298.0, ans=0.0 2023-06-22 12:04:40,701 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.52 vs. limit=15.0 2023-06-22 12:04:50,162 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.655e+02 2.414e+02 2.854e+02 3.721e+02 5.667e+02, threshold=5.709e+02, percent-clipped=0.0 2023-06-22 12:05:09,530 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.96 vs. limit=15.0 2023-06-22 12:05:19,761 INFO [train.py:996] (3/4) Epoch 6, batch 6950, loss[loss=0.2875, simple_loss=0.3493, pruned_loss=0.1129, over 21444.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2903, pruned_loss=0.07068, over 4276397.05 frames. ], batch size: 471, lr: 5.23e-03, grad_scale: 16.0 2023-06-22 12:05:58,026 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.82 vs. limit=15.0 2023-06-22 12:07:18,204 INFO [train.py:996] (3/4) Epoch 6, batch 7000, loss[loss=0.2102, simple_loss=0.2745, pruned_loss=0.07299, over 21627.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2924, pruned_loss=0.07342, over 4277929.51 frames. ], batch size: 298, lr: 5.23e-03, grad_scale: 16.0 2023-06-22 12:07:19,240 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.86 vs. limit=22.5 2023-06-22 12:08:14,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=956898.0, ans=0.0 2023-06-22 12:08:30,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=956958.0, ans=0.0 2023-06-22 12:08:48,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=957018.0, ans=0.1 2023-06-22 12:08:59,217 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.82 vs. limit=15.0 2023-06-22 12:08:59,605 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.746e+02 2.490e+02 2.766e+02 3.246e+02 6.090e+02, threshold=5.532e+02, percent-clipped=1.0 2023-06-22 12:09:37,434 INFO [train.py:996] (3/4) Epoch 6, batch 7050, loss[loss=0.1791, simple_loss=0.2474, pruned_loss=0.05537, over 16375.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2912, pruned_loss=0.07296, over 4266322.61 frames. ], batch size: 61, lr: 5.23e-03, grad_scale: 16.0 2023-06-22 12:10:15,307 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=957198.0, ans=0.1 2023-06-22 12:10:16,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=957198.0, ans=0.1 2023-06-22 12:10:30,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=957198.0, ans=0.1 2023-06-22 12:10:42,864 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.06 vs. limit=15.0 2023-06-22 12:11:17,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=957318.0, ans=0.0 2023-06-22 12:11:37,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=957378.0, ans=0.125 2023-06-22 12:11:55,951 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=957378.0, ans=0.0 2023-06-22 12:11:59,671 INFO [train.py:996] (3/4) Epoch 6, batch 7100, loss[loss=0.234, simple_loss=0.3105, pruned_loss=0.07875, over 20694.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2965, pruned_loss=0.07508, over 4270574.93 frames. ], batch size: 607, lr: 5.23e-03, grad_scale: 16.0 2023-06-22 12:12:05,014 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.58 vs. limit=22.5 2023-06-22 12:13:26,823 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.716e+02 2.296e+02 2.660e+02 3.134e+02 4.737e+02, threshold=5.321e+02, percent-clipped=0.0 2023-06-22 12:13:51,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=957678.0, ans=0.125 2023-06-22 12:14:15,126 INFO [train.py:996] (3/4) Epoch 6, batch 7150, loss[loss=0.2443, simple_loss=0.3149, pruned_loss=0.08684, over 21394.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2942, pruned_loss=0.07278, over 4274972.05 frames. ], batch size: 549, lr: 5.23e-03, grad_scale: 16.0 2023-06-22 12:14:19,944 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=957738.0, ans=0.1 2023-06-22 12:14:24,376 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=957738.0, ans=0.125 2023-06-22 12:15:21,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=957858.0, ans=0.125 2023-06-22 12:16:01,073 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=957978.0, ans=0.0 2023-06-22 12:16:20,722 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=958038.0, ans=0.125 2023-06-22 12:16:21,674 INFO [train.py:996] (3/4) Epoch 6, batch 7200, loss[loss=0.2346, simple_loss=0.3135, pruned_loss=0.07782, over 21764.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2974, pruned_loss=0.07563, over 4277482.11 frames. ], batch size: 102, lr: 5.23e-03, grad_scale: 32.0 2023-06-22 12:16:26,058 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.47 vs. limit=15.0 2023-06-22 12:17:38,432 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.41 vs. limit=12.0 2023-06-22 12:17:51,026 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.011e+02 2.534e+02 2.859e+02 3.479e+02 6.830e+02, threshold=5.718e+02, percent-clipped=3.0 2023-06-22 12:18:19,431 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.12 vs. limit=22.5 2023-06-22 12:18:29,956 INFO [train.py:996] (3/4) Epoch 6, batch 7250, loss[loss=0.2385, simple_loss=0.2772, pruned_loss=0.09984, over 21373.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2923, pruned_loss=0.07565, over 4277153.60 frames. ], batch size: 509, lr: 5.22e-03, grad_scale: 32.0 2023-06-22 12:18:40,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=958338.0, ans=0.0 2023-06-22 12:18:55,211 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.13 vs. limit=15.0 2023-06-22 12:19:02,586 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.01 vs. limit=15.0 2023-06-22 12:19:18,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=958458.0, ans=0.125 2023-06-22 12:19:49,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=958518.0, ans=0.0 2023-06-22 12:19:59,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=958518.0, ans=0.2 2023-06-22 12:20:38,077 INFO [train.py:996] (3/4) Epoch 6, batch 7300, loss[loss=0.1866, simple_loss=0.2498, pruned_loss=0.06166, over 21807.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2855, pruned_loss=0.07455, over 4278339.58 frames. ], batch size: 352, lr: 5.22e-03, grad_scale: 32.0 2023-06-22 12:21:13,293 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=958698.0, ans=0.0 2023-06-22 12:21:47,330 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-06-22 12:22:08,049 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.015e+02 2.524e+02 2.909e+02 3.495e+02 5.392e+02, threshold=5.818e+02, percent-clipped=0.0 2023-06-22 12:22:12,660 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.29 vs. limit=10.0 2023-06-22 12:22:45,834 INFO [train.py:996] (3/4) Epoch 6, batch 7350, loss[loss=0.2505, simple_loss=0.3168, pruned_loss=0.09208, over 21408.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2845, pruned_loss=0.07478, over 4272218.39 frames. ], batch size: 549, lr: 5.22e-03, grad_scale: 16.0 2023-06-22 12:22:46,280 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=958938.0, ans=10.0 2023-06-22 12:22:48,761 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.83 vs. limit=15.0 2023-06-22 12:23:12,911 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=958998.0, ans=0.0 2023-06-22 12:23:12,958 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=958998.0, ans=0.125 2023-06-22 12:23:39,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=958998.0, ans=0.1 2023-06-22 12:23:41,243 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=958998.0, ans=0.2 2023-06-22 12:23:47,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=959058.0, ans=0.0 2023-06-22 12:23:51,494 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=959058.0, ans=0.125 2023-06-22 12:24:16,253 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=959118.0, ans=0.125 2023-06-22 12:24:43,449 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=959178.0, ans=0.1 2023-06-22 12:24:43,467 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=959178.0, ans=10.0 2023-06-22 12:25:17,235 INFO [train.py:996] (3/4) Epoch 6, batch 7400, loss[loss=0.2621, simple_loss=0.3476, pruned_loss=0.08827, over 21619.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2917, pruned_loss=0.07579, over 4273018.45 frames. ], batch size: 441, lr: 5.22e-03, grad_scale: 16.0 2023-06-22 12:26:46,254 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=959418.0, ans=10.0 2023-06-22 12:26:47,313 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.080e+02 2.566e+02 3.026e+02 3.551e+02 6.030e+02, threshold=6.052e+02, percent-clipped=1.0 2023-06-22 12:26:58,509 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.56 vs. limit=15.0 2023-06-22 12:27:08,441 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.21 vs. limit=10.0 2023-06-22 12:27:30,396 INFO [train.py:996] (3/4) Epoch 6, batch 7450, loss[loss=0.1983, simple_loss=0.2627, pruned_loss=0.06695, over 21780.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2903, pruned_loss=0.07465, over 4264755.76 frames. ], batch size: 124, lr: 5.22e-03, grad_scale: 16.0 2023-06-22 12:27:32,422 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=959538.0, ans=0.0 2023-06-22 12:28:09,638 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.13 vs. limit=22.5 2023-06-22 12:29:38,270 INFO [train.py:996] (3/4) Epoch 6, batch 7500, loss[loss=0.2411, simple_loss=0.3489, pruned_loss=0.06664, over 21555.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2945, pruned_loss=0.07642, over 4269613.82 frames. ], batch size: 263, lr: 5.22e-03, grad_scale: 16.0 2023-06-22 12:30:41,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=959958.0, ans=0.0 2023-06-22 12:30:48,075 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=959958.0, ans=0.125 2023-06-22 12:30:52,187 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=959958.0, ans=0.125 2023-06-22 12:31:06,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=959958.0, ans=0.1 2023-06-22 12:31:28,373 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.968e+02 2.801e+02 3.373e+02 4.203e+02 7.469e+02, threshold=6.746e+02, percent-clipped=3.0 2023-06-22 12:31:47,822 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=960078.0, ans=0.125 2023-06-22 12:32:01,774 INFO [train.py:996] (3/4) Epoch 6, batch 7550, loss[loss=0.2056, simple_loss=0.3217, pruned_loss=0.04471, over 20784.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3025, pruned_loss=0.07538, over 4268809.64 frames. ], batch size: 608, lr: 5.22e-03, grad_scale: 16.0 2023-06-22 12:32:52,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=960258.0, ans=0.125 2023-06-22 12:32:54,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=960258.0, ans=0.0 2023-06-22 12:33:02,777 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=960258.0, ans=0.0 2023-06-22 12:33:28,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=960318.0, ans=0.125 2023-06-22 12:33:28,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=960318.0, ans=0.5 2023-06-22 12:33:30,734 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.16 vs. limit=22.5 2023-06-22 12:34:08,430 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=960378.0, ans=0.2 2023-06-22 12:34:12,402 INFO [train.py:996] (3/4) Epoch 6, batch 7600, loss[loss=0.2028, simple_loss=0.2769, pruned_loss=0.06442, over 21895.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3013, pruned_loss=0.07478, over 4276816.30 frames. ], batch size: 351, lr: 5.22e-03, grad_scale: 32.0 2023-06-22 12:34:12,904 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=960438.0, ans=0.0 2023-06-22 12:34:38,365 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=960498.0, ans=0.1 2023-06-22 12:34:49,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=960498.0, ans=0.125 2023-06-22 12:35:02,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=960498.0, ans=0.1 2023-06-22 12:35:33,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=960558.0, ans=0.1 2023-06-22 12:35:34,728 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=960558.0, ans=0.125 2023-06-22 12:35:59,842 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.966e+02 2.444e+02 2.746e+02 3.359e+02 4.824e+02, threshold=5.491e+02, percent-clipped=0.0 2023-06-22 12:36:39,353 INFO [train.py:996] (3/4) Epoch 6, batch 7650, loss[loss=0.2204, simple_loss=0.2851, pruned_loss=0.07785, over 21597.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3012, pruned_loss=0.07679, over 4277749.32 frames. ], batch size: 212, lr: 5.22e-03, grad_scale: 32.0 2023-06-22 12:36:46,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=960738.0, ans=0.125 2023-06-22 12:38:40,601 INFO [train.py:996] (3/4) Epoch 6, batch 7700, loss[loss=0.193, simple_loss=0.2928, pruned_loss=0.04665, over 20703.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3044, pruned_loss=0.07904, over 4280172.89 frames. ], batch size: 608, lr: 5.22e-03, grad_scale: 32.0 2023-06-22 12:38:41,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=961038.0, ans=0.04949747468305833 2023-06-22 12:39:11,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=961038.0, ans=0.125 2023-06-22 12:39:41,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=961098.0, ans=0.1 2023-06-22 12:40:15,243 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.92 vs. limit=22.5 2023-06-22 12:40:18,780 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.909e+02 2.597e+02 2.996e+02 3.499e+02 4.592e+02, threshold=5.993e+02, percent-clipped=0.0 2023-06-22 12:40:19,807 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.79 vs. limit=15.0 2023-06-22 12:40:59,810 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=961278.0, ans=0.125 2023-06-22 12:41:02,103 INFO [train.py:996] (3/4) Epoch 6, batch 7750, loss[loss=0.2181, simple_loss=0.3028, pruned_loss=0.06672, over 21367.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3081, pruned_loss=0.0782, over 4272525.08 frames. ], batch size: 131, lr: 5.22e-03, grad_scale: 32.0 2023-06-22 12:41:26,716 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=961338.0, ans=0.125 2023-06-22 12:42:12,916 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=961458.0, ans=0.0 2023-06-22 12:42:23,312 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.09 vs. limit=12.0 2023-06-22 12:42:56,197 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=961578.0, ans=0.2 2023-06-22 12:43:18,984 INFO [train.py:996] (3/4) Epoch 6, batch 7800, loss[loss=0.1858, simple_loss=0.2369, pruned_loss=0.0673, over 21360.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3096, pruned_loss=0.07885, over 4274239.17 frames. ], batch size: 131, lr: 5.22e-03, grad_scale: 32.0 2023-06-22 12:43:24,537 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=961638.0, ans=0.125 2023-06-22 12:44:29,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=961818.0, ans=0.125 2023-06-22 12:44:40,859 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 2.813e+02 3.314e+02 4.119e+02 8.453e+02, threshold=6.627e+02, percent-clipped=5.0 2023-06-22 12:44:41,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=961818.0, ans=0.125 2023-06-22 12:44:49,612 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.42 vs. limit=15.0 2023-06-22 12:44:52,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=961878.0, ans=0.125 2023-06-22 12:45:23,682 INFO [train.py:996] (3/4) Epoch 6, batch 7850, loss[loss=0.214, simple_loss=0.2807, pruned_loss=0.07362, over 21795.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3017, pruned_loss=0.0775, over 4280197.29 frames. ], batch size: 372, lr: 5.21e-03, grad_scale: 32.0 2023-06-22 12:46:22,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=962058.0, ans=0.05 2023-06-22 12:46:31,012 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=22.5 2023-06-22 12:47:21,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=962178.0, ans=0.125 2023-06-22 12:47:34,152 INFO [train.py:996] (3/4) Epoch 6, batch 7900, loss[loss=0.2772, simple_loss=0.3737, pruned_loss=0.09038, over 21639.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.2966, pruned_loss=0.07694, over 4276965.41 frames. ], batch size: 441, lr: 5.21e-03, grad_scale: 32.0 2023-06-22 12:48:08,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=962298.0, ans=0.125 2023-06-22 12:49:17,364 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 2.834e+02 3.343e+02 3.831e+02 7.219e+02, threshold=6.686e+02, percent-clipped=3.0 2023-06-22 12:50:09,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=962538.0, ans=0.125 2023-06-22 12:50:10,557 INFO [train.py:996] (3/4) Epoch 6, batch 7950, loss[loss=0.2453, simple_loss=0.3241, pruned_loss=0.08324, over 21874.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3043, pruned_loss=0.07709, over 4280736.82 frames. ], batch size: 371, lr: 5.21e-03, grad_scale: 16.0 2023-06-22 12:50:55,787 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=962598.0, ans=0.0 2023-06-22 12:51:13,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=962658.0, ans=10.0 2023-06-22 12:52:48,127 INFO [train.py:996] (3/4) Epoch 6, batch 8000, loss[loss=0.2639, simple_loss=0.3432, pruned_loss=0.09232, over 21182.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.308, pruned_loss=0.07941, over 4275008.53 frames. ], batch size: 143, lr: 5.21e-03, grad_scale: 32.0 2023-06-22 12:53:26,389 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=962958.0, ans=0.2 2023-06-22 12:54:13,208 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.63 vs. limit=12.0 2023-06-22 12:54:14,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=963018.0, ans=0.0 2023-06-22 12:54:34,223 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.065e+02 2.700e+02 3.556e+02 4.480e+02 7.069e+02, threshold=7.112e+02, percent-clipped=3.0 2023-06-22 12:55:13,308 INFO [train.py:996] (3/4) Epoch 6, batch 8050, loss[loss=0.2179, simple_loss=0.2795, pruned_loss=0.07816, over 21244.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3105, pruned_loss=0.07911, over 4271993.83 frames. ], batch size: 607, lr: 5.21e-03, grad_scale: 32.0 2023-06-22 12:56:27,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=963318.0, ans=0.1 2023-06-22 12:56:35,792 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.58 vs. limit=15.0 2023-06-22 12:56:58,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=963318.0, ans=0.125 2023-06-22 12:57:34,232 INFO [train.py:996] (3/4) Epoch 6, batch 8100, loss[loss=0.2006, simple_loss=0.2699, pruned_loss=0.06561, over 21659.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3092, pruned_loss=0.07978, over 4278435.18 frames. ], batch size: 263, lr: 5.21e-03, grad_scale: 32.0 2023-06-22 12:58:33,394 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=963498.0, ans=0.2 2023-06-22 12:59:45,220 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.277e+02 2.718e+02 3.272e+02 3.962e+02 1.016e+03, threshold=6.543e+02, percent-clipped=3.0 2023-06-22 12:59:57,264 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 12:59:58,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=963678.0, ans=0.125 2023-06-22 13:00:20,332 INFO [train.py:996] (3/4) Epoch 6, batch 8150, loss[loss=0.3274, simple_loss=0.4118, pruned_loss=0.1215, over 21484.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3166, pruned_loss=0.08091, over 4278988.80 frames. ], batch size: 507, lr: 5.21e-03, grad_scale: 16.0 2023-06-22 13:01:47,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=963918.0, ans=0.07 2023-06-22 13:02:12,362 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.51 vs. limit=15.0 2023-06-22 13:02:29,435 INFO [train.py:996] (3/4) Epoch 6, batch 8200, loss[loss=0.1845, simple_loss=0.2507, pruned_loss=0.05913, over 21347.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3096, pruned_loss=0.07858, over 4269131.83 frames. ], batch size: 131, lr: 5.21e-03, grad_scale: 16.0 2023-06-22 13:02:55,367 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=964038.0, ans=0.0 2023-06-22 13:02:59,752 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=964098.0, ans=0.05 2023-06-22 13:03:49,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=964158.0, ans=0.0 2023-06-22 13:04:16,274 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.939e+02 2.425e+02 2.871e+02 3.494e+02 6.098e+02, threshold=5.742e+02, percent-clipped=0.0 2023-06-22 13:04:59,538 INFO [train.py:996] (3/4) Epoch 6, batch 8250, loss[loss=0.2447, simple_loss=0.3462, pruned_loss=0.07157, over 20770.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3068, pruned_loss=0.0778, over 4270356.77 frames. ], batch size: 607, lr: 5.21e-03, grad_scale: 16.0 2023-06-22 13:05:18,575 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.93 vs. limit=15.0 2023-06-22 13:07:14,453 INFO [train.py:996] (3/4) Epoch 6, batch 8300, loss[loss=0.1874, simple_loss=0.2697, pruned_loss=0.05251, over 21379.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3069, pruned_loss=0.07506, over 4270446.36 frames. ], batch size: 194, lr: 5.21e-03, grad_scale: 16.0 2023-06-22 13:07:32,440 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=964698.0, ans=0.125 2023-06-22 13:08:08,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=964758.0, ans=0.125 2023-06-22 13:08:18,667 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=964758.0, ans=0.125 2023-06-22 13:08:40,408 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.855e+02 2.450e+02 2.816e+02 3.478e+02 6.310e+02, threshold=5.632e+02, percent-clipped=2.0 2023-06-22 13:09:32,563 INFO [train.py:996] (3/4) Epoch 6, batch 8350, loss[loss=0.2017, simple_loss=0.2922, pruned_loss=0.05561, over 21561.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3055, pruned_loss=0.07277, over 4274214.43 frames. ], batch size: 195, lr: 5.21e-03, grad_scale: 16.0 2023-06-22 13:09:38,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=964938.0, ans=0.125 2023-06-22 13:09:44,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=964938.0, ans=0.125 2023-06-22 13:10:08,921 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=964998.0, ans=0.0 2023-06-22 13:11:45,369 INFO [train.py:996] (3/4) Epoch 6, batch 8400, loss[loss=0.1633, simple_loss=0.2508, pruned_loss=0.03789, over 21162.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.3045, pruned_loss=0.07168, over 4269954.65 frames. ], batch size: 176, lr: 5.21e-03, grad_scale: 32.0 2023-06-22 13:11:52,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=965238.0, ans=0.04949747468305833 2023-06-22 13:12:16,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=965298.0, ans=0.125 2023-06-22 13:12:35,469 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=12.0 2023-06-22 13:12:51,887 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=965358.0, ans=0.1 2023-06-22 13:13:12,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=965418.0, ans=0.125 2023-06-22 13:13:13,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=965418.0, ans=0.025 2023-06-22 13:13:14,774 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.764e+02 2.326e+02 2.586e+02 3.002e+02 4.637e+02, threshold=5.171e+02, percent-clipped=0.0 2023-06-22 13:13:50,941 INFO [train.py:996] (3/4) Epoch 6, batch 8450, loss[loss=0.2397, simple_loss=0.2989, pruned_loss=0.09019, over 21650.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.3016, pruned_loss=0.07047, over 4277481.93 frames. ], batch size: 389, lr: 5.20e-03, grad_scale: 32.0 2023-06-22 13:14:05,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=965538.0, ans=0.0 2023-06-22 13:14:40,112 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.92 vs. limit=15.0 2023-06-22 13:15:25,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=965778.0, ans=0.04949747468305833 2023-06-22 13:15:28,609 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=965778.0, ans=0.035 2023-06-22 13:15:48,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=965778.0, ans=0.125 2023-06-22 13:16:01,017 INFO [train.py:996] (3/4) Epoch 6, batch 8500, loss[loss=0.1925, simple_loss=0.2583, pruned_loss=0.06331, over 21429.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2989, pruned_loss=0.07197, over 4283476.81 frames. ], batch size: 194, lr: 5.20e-03, grad_scale: 32.0 2023-06-22 13:16:53,579 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.12 vs. limit=22.5 2023-06-22 13:17:12,842 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.32 vs. limit=15.0 2023-06-22 13:17:26,384 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=966018.0, ans=0.2 2023-06-22 13:17:48,959 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.924e+02 2.784e+02 3.161e+02 3.760e+02 5.772e+02, threshold=6.322e+02, percent-clipped=2.0 2023-06-22 13:18:27,236 INFO [train.py:996] (3/4) Epoch 6, batch 8550, loss[loss=0.2356, simple_loss=0.2944, pruned_loss=0.08837, over 20053.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3017, pruned_loss=0.07475, over 4284022.81 frames. ], batch size: 702, lr: 5.20e-03, grad_scale: 32.0 2023-06-22 13:18:55,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=966198.0, ans=0.125 2023-06-22 13:19:03,787 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=966198.0, ans=0.125 2023-06-22 13:19:56,912 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.76 vs. limit=15.0 2023-06-22 13:20:33,430 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=966378.0, ans=0.0 2023-06-22 13:20:36,444 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=966378.0, ans=0.2 2023-06-22 13:21:01,120 INFO [train.py:996] (3/4) Epoch 6, batch 8600, loss[loss=0.2873, simple_loss=0.3545, pruned_loss=0.11, over 21431.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.307, pruned_loss=0.07705, over 4280375.99 frames. ], batch size: 471, lr: 5.20e-03, grad_scale: 32.0 2023-06-22 13:22:12,691 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.59 vs. limit=12.0 2023-06-22 13:22:56,129 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.149e+02 2.812e+02 3.242e+02 4.124e+02 6.124e+02, threshold=6.484e+02, percent-clipped=0.0 2023-06-22 13:22:56,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=966618.0, ans=0.125 2023-06-22 13:23:27,232 INFO [train.py:996] (3/4) Epoch 6, batch 8650, loss[loss=0.2633, simple_loss=0.3517, pruned_loss=0.08746, over 21477.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3132, pruned_loss=0.07846, over 4278664.83 frames. ], batch size: 507, lr: 5.20e-03, grad_scale: 32.0 2023-06-22 13:23:57,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=966798.0, ans=0.0 2023-06-22 13:24:25,379 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 13:24:37,012 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=966918.0, ans=0.125 2023-06-22 13:24:45,307 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.79 vs. limit=10.0 2023-06-22 13:25:01,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=966978.0, ans=0.125 2023-06-22 13:25:09,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=966978.0, ans=0.125 2023-06-22 13:25:11,401 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=966978.0, ans=0.1 2023-06-22 13:25:25,290 INFO [train.py:996] (3/4) Epoch 6, batch 8700, loss[loss=0.2231, simple_loss=0.2763, pruned_loss=0.08497, over 21231.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3043, pruned_loss=0.07517, over 4280525.52 frames. ], batch size: 471, lr: 5.20e-03, grad_scale: 32.0 2023-06-22 13:26:52,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=967218.0, ans=0.0 2023-06-22 13:26:56,665 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.654e+02 2.253e+02 2.607e+02 2.953e+02 4.706e+02, threshold=5.214e+02, percent-clipped=0.0 2023-06-22 13:27:27,310 INFO [train.py:996] (3/4) Epoch 6, batch 8750, loss[loss=0.2234, simple_loss=0.2843, pruned_loss=0.08125, over 21310.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.2998, pruned_loss=0.07536, over 4278154.91 frames. ], batch size: 176, lr: 5.20e-03, grad_scale: 32.0 2023-06-22 13:27:28,655 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.72 vs. limit=15.0 2023-06-22 13:28:14,342 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=967398.0, ans=0.07 2023-06-22 13:29:59,528 INFO [train.py:996] (3/4) Epoch 6, batch 8800, loss[loss=0.2873, simple_loss=0.3668, pruned_loss=0.104, over 21730.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3076, pruned_loss=0.0781, over 4278689.12 frames. ], batch size: 441, lr: 5.20e-03, grad_scale: 32.0 2023-06-22 13:30:04,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=967638.0, ans=0.5 2023-06-22 13:30:07,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=967638.0, ans=0.1 2023-06-22 13:31:44,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=967818.0, ans=0.0 2023-06-22 13:31:46,611 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.114e+02 2.702e+02 3.094e+02 3.585e+02 5.689e+02, threshold=6.187e+02, percent-clipped=2.0 2023-06-22 13:32:12,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=967878.0, ans=0.125 2023-06-22 13:32:17,688 INFO [train.py:996] (3/4) Epoch 6, batch 8850, loss[loss=0.2167, simple_loss=0.2965, pruned_loss=0.06846, over 21555.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3147, pruned_loss=0.07963, over 4271753.76 frames. ], batch size: 230, lr: 5.20e-03, grad_scale: 16.0 2023-06-22 13:32:53,921 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 13:33:47,647 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=968058.0, ans=0.125 2023-06-22 13:33:57,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=968118.0, ans=0.125 2023-06-22 13:34:09,148 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.30 vs. limit=15.0 2023-06-22 13:34:32,307 INFO [train.py:996] (3/4) Epoch 6, batch 8900, loss[loss=0.2196, simple_loss=0.2785, pruned_loss=0.08033, over 21521.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3102, pruned_loss=0.07808, over 4268260.73 frames. ], batch size: 441, lr: 5.20e-03, grad_scale: 16.0 2023-06-22 13:36:08,587 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.42 vs. limit=22.5 2023-06-22 13:36:10,071 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.64 vs. limit=10.0 2023-06-22 13:36:25,711 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.729e+02 2.633e+02 3.165e+02 3.746e+02 7.673e+02, threshold=6.331e+02, percent-clipped=6.0 2023-06-22 13:36:31,104 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.59 vs. limit=15.0 2023-06-22 13:36:50,914 INFO [train.py:996] (3/4) Epoch 6, batch 8950, loss[loss=0.199, simple_loss=0.2887, pruned_loss=0.05462, over 21605.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3075, pruned_loss=0.07728, over 4269270.14 frames. ], batch size: 263, lr: 5.20e-03, grad_scale: 16.0 2023-06-22 13:37:10,417 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=968598.0, ans=0.05 2023-06-22 13:38:44,770 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 13:38:49,282 INFO [train.py:996] (3/4) Epoch 6, batch 9000, loss[loss=0.2013, simple_loss=0.2621, pruned_loss=0.07031, over 21168.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3021, pruned_loss=0.077, over 4276769.67 frames. ], batch size: 176, lr: 5.20e-03, grad_scale: 16.0 2023-06-22 13:38:49,285 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-22 13:39:41,073 INFO [train.py:1028] (3/4) Epoch 6, validation: loss=0.2635, simple_loss=0.3541, pruned_loss=0.08643, over 1796401.00 frames. 2023-06-22 13:39:41,074 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-22 13:40:45,434 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=968958.0, ans=0.125 2023-06-22 13:40:53,009 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=969018.0, ans=0.125 2023-06-22 13:40:55,974 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=969018.0, ans=0.125 2023-06-22 13:41:04,553 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.803e+02 2.727e+02 3.183e+02 3.602e+02 6.441e+02, threshold=6.367e+02, percent-clipped=1.0 2023-06-22 13:41:08,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=969078.0, ans=0.0 2023-06-22 13:41:36,441 INFO [train.py:996] (3/4) Epoch 6, batch 9050, loss[loss=0.2163, simple_loss=0.2955, pruned_loss=0.0686, over 21685.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2987, pruned_loss=0.07352, over 4275960.02 frames. ], batch size: 298, lr: 5.20e-03, grad_scale: 16.0 2023-06-22 13:41:36,825 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=969138.0, ans=0.125 2023-06-22 13:43:02,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=969258.0, ans=0.125 2023-06-22 13:43:25,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=969318.0, ans=0.125 2023-06-22 13:43:32,543 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=969318.0, ans=0.125 2023-06-22 13:43:56,168 INFO [train.py:996] (3/4) Epoch 6, batch 9100, loss[loss=0.2262, simple_loss=0.325, pruned_loss=0.0637, over 21732.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3057, pruned_loss=0.07628, over 4279861.16 frames. ], batch size: 351, lr: 5.19e-03, grad_scale: 16.0 2023-06-22 13:43:56,769 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=969438.0, ans=0.125 2023-06-22 13:44:28,997 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.79 vs. limit=12.0 2023-06-22 13:44:31,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=969498.0, ans=0.2 2023-06-22 13:45:23,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=969618.0, ans=6.0 2023-06-22 13:45:51,241 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.628e+02 2.427e+02 2.866e+02 3.272e+02 6.065e+02, threshold=5.732e+02, percent-clipped=0.0 2023-06-22 13:46:21,659 INFO [train.py:996] (3/4) Epoch 6, batch 9150, loss[loss=0.2292, simple_loss=0.3065, pruned_loss=0.07589, over 21433.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.31, pruned_loss=0.07487, over 4272612.46 frames. ], batch size: 160, lr: 5.19e-03, grad_scale: 16.0 2023-06-22 13:46:24,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=969738.0, ans=0.125 2023-06-22 13:46:50,899 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.17 vs. limit=22.5 2023-06-22 13:46:54,691 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=969798.0, ans=0.125 2023-06-22 13:47:23,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=969858.0, ans=0.0 2023-06-22 13:48:09,073 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2023-06-22 13:48:21,415 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.36 vs. limit=15.0 2023-06-22 13:48:31,188 INFO [train.py:996] (3/4) Epoch 6, batch 9200, loss[loss=0.3191, simple_loss=0.3783, pruned_loss=0.1299, over 21375.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3104, pruned_loss=0.0737, over 4276201.48 frames. ], batch size: 507, lr: 5.19e-03, grad_scale: 32.0 2023-06-22 13:48:54,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=970038.0, ans=0.0 2023-06-22 13:49:45,389 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=970158.0, ans=0.95 2023-06-22 13:49:48,575 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.35 vs. limit=15.0 2023-06-22 13:50:20,606 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.927e+02 2.602e+02 3.064e+02 3.617e+02 6.755e+02, threshold=6.128e+02, percent-clipped=8.0 2023-06-22 13:50:35,121 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=970278.0, ans=0.2 2023-06-22 13:50:39,052 INFO [train.py:996] (3/4) Epoch 6, batch 9250, loss[loss=0.2252, simple_loss=0.2893, pruned_loss=0.08055, over 21433.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3156, pruned_loss=0.07617, over 4267878.65 frames. ], batch size: 389, lr: 5.19e-03, grad_scale: 32.0 2023-06-22 13:51:54,920 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.85 vs. limit=15.0 2023-06-22 13:52:20,121 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.53 vs. limit=15.0 2023-06-22 13:52:49,820 INFO [train.py:996] (3/4) Epoch 6, batch 9300, loss[loss=0.2273, simple_loss=0.302, pruned_loss=0.07624, over 21275.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3101, pruned_loss=0.07579, over 4265789.67 frames. ], batch size: 159, lr: 5.19e-03, grad_scale: 32.0 2023-06-22 13:53:57,110 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.79 vs. limit=15.0 2023-06-22 13:54:34,505 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=970818.0, ans=0.125 2023-06-22 13:54:44,843 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.639e+02 3.001e+02 3.479e+02 6.527e+02, threshold=6.003e+02, percent-clipped=1.0 2023-06-22 13:54:48,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=970878.0, ans=0.0 2023-06-22 13:55:04,075 INFO [train.py:996] (3/4) Epoch 6, batch 9350, loss[loss=0.2532, simple_loss=0.3346, pruned_loss=0.0859, over 21879.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3137, pruned_loss=0.07703, over 4267225.49 frames. ], batch size: 371, lr: 5.19e-03, grad_scale: 16.0 2023-06-22 13:55:19,155 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.97 vs. limit=22.5 2023-06-22 13:55:55,779 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=970998.0, ans=0.125 2023-06-22 13:57:37,322 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=971178.0, ans=0.0 2023-06-22 13:57:41,560 INFO [train.py:996] (3/4) Epoch 6, batch 9400, loss[loss=0.2168, simple_loss=0.2786, pruned_loss=0.07745, over 21247.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3161, pruned_loss=0.07802, over 4262079.20 frames. ], batch size: 159, lr: 5.19e-03, grad_scale: 16.0 2023-06-22 13:57:46,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=971238.0, ans=0.125 2023-06-22 13:58:14,158 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=971298.0, ans=0.125 2023-06-22 13:58:14,217 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=971298.0, ans=0.125 2023-06-22 13:58:43,969 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=971358.0, ans=0.0 2023-06-22 13:59:15,655 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.060e+02 2.533e+02 2.854e+02 3.549e+02 7.944e+02, threshold=5.708e+02, percent-clipped=6.0 2023-06-22 13:59:45,781 INFO [train.py:996] (3/4) Epoch 6, batch 9450, loss[loss=0.2123, simple_loss=0.2706, pruned_loss=0.07697, over 21213.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3077, pruned_loss=0.07688, over 4258321.08 frames. ], batch size: 159, lr: 5.19e-03, grad_scale: 16.0 2023-06-22 14:00:39,711 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=971598.0, ans=0.1 2023-06-22 14:01:08,810 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-22 14:01:37,372 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.32 vs. limit=15.0 2023-06-22 14:02:06,235 INFO [train.py:996] (3/4) Epoch 6, batch 9500, loss[loss=0.1773, simple_loss=0.2548, pruned_loss=0.04986, over 21329.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3001, pruned_loss=0.07524, over 4248049.88 frames. ], batch size: 176, lr: 5.19e-03, grad_scale: 16.0 2023-06-22 14:02:52,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=971898.0, ans=0.125 2023-06-22 14:03:52,202 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.065e+02 2.465e+02 2.826e+02 3.384e+02 5.228e+02, threshold=5.653e+02, percent-clipped=0.0 2023-06-22 14:04:02,801 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=972078.0, ans=0.125 2023-06-22 14:04:12,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=972078.0, ans=0.2 2023-06-22 14:04:20,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=972078.0, ans=0.0 2023-06-22 14:04:25,789 INFO [train.py:996] (3/4) Epoch 6, batch 9550, loss[loss=0.277, simple_loss=0.3484, pruned_loss=0.1028, over 21606.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3049, pruned_loss=0.07763, over 4254442.43 frames. ], batch size: 389, lr: 5.19e-03, grad_scale: 16.0 2023-06-22 14:04:27,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=972138.0, ans=0.0 2023-06-22 14:04:51,088 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.83 vs. limit=15.0 2023-06-22 14:05:06,445 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=972198.0, ans=0.125 2023-06-22 14:05:19,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=972198.0, ans=0.125 2023-06-22 14:06:22,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=972378.0, ans=0.2 2023-06-22 14:06:44,211 INFO [train.py:996] (3/4) Epoch 6, batch 9600, loss[loss=0.2135, simple_loss=0.2816, pruned_loss=0.07271, over 21882.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3081, pruned_loss=0.07989, over 4267046.72 frames. ], batch size: 298, lr: 5.19e-03, grad_scale: 32.0 2023-06-22 14:06:46,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=972438.0, ans=0.1 2023-06-22 14:07:36,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=972498.0, ans=0.05 2023-06-22 14:07:50,344 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.72 vs. limit=15.0 2023-06-22 14:08:31,432 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.969e+02 2.618e+02 2.907e+02 3.359e+02 5.518e+02, threshold=5.814e+02, percent-clipped=0.0 2023-06-22 14:09:04,274 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=972678.0, ans=0.09899494936611666 2023-06-22 14:09:09,933 INFO [train.py:996] (3/4) Epoch 6, batch 9650, loss[loss=0.252, simple_loss=0.3254, pruned_loss=0.08929, over 21715.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3086, pruned_loss=0.07967, over 4272183.00 frames. ], batch size: 351, lr: 5.19e-03, grad_scale: 32.0 2023-06-22 14:10:08,410 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=972858.0, ans=0.1 2023-06-22 14:10:28,570 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=972918.0, ans=0.2 2023-06-22 14:10:59,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=972978.0, ans=0.0 2023-06-22 14:11:17,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=972978.0, ans=0.0 2023-06-22 14:11:29,922 INFO [train.py:996] (3/4) Epoch 6, batch 9700, loss[loss=0.2006, simple_loss=0.283, pruned_loss=0.05905, over 21782.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3109, pruned_loss=0.07937, over 4279492.08 frames. ], batch size: 298, lr: 5.18e-03, grad_scale: 16.0 2023-06-22 14:12:08,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=973158.0, ans=0.1 2023-06-22 14:12:31,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=973158.0, ans=0.125 2023-06-22 14:13:01,755 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 2.601e+02 2.900e+02 3.367e+02 7.337e+02, threshold=5.800e+02, percent-clipped=1.0 2023-06-22 14:13:41,733 INFO [train.py:996] (3/4) Epoch 6, batch 9750, loss[loss=0.266, simple_loss=0.3476, pruned_loss=0.09216, over 21869.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3047, pruned_loss=0.07818, over 4273491.48 frames. ], batch size: 107, lr: 5.18e-03, grad_scale: 16.0 2023-06-22 14:13:43,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=973338.0, ans=0.0 2023-06-22 14:14:19,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=973458.0, ans=0.1 2023-06-22 14:14:57,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=973518.0, ans=0.0 2023-06-22 14:15:24,556 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=973578.0, ans=0.125 2023-06-22 14:15:38,398 INFO [train.py:996] (3/4) Epoch 6, batch 9800, loss[loss=0.2161, simple_loss=0.2853, pruned_loss=0.07348, over 21655.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3041, pruned_loss=0.07776, over 4271842.44 frames. ], batch size: 230, lr: 5.18e-03, grad_scale: 16.0 2023-06-22 14:16:18,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=973698.0, ans=0.125 2023-06-22 14:16:31,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=973758.0, ans=0.125 2023-06-22 14:17:06,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=973818.0, ans=0.0 2023-06-22 14:17:11,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=973818.0, ans=0.1 2023-06-22 14:17:28,257 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.056e+02 2.469e+02 2.952e+02 3.754e+02 9.468e+02, threshold=5.905e+02, percent-clipped=4.0 2023-06-22 14:17:40,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=973878.0, ans=0.125 2023-06-22 14:17:42,960 INFO [train.py:996] (3/4) Epoch 6, batch 9850, loss[loss=0.1931, simple_loss=0.2587, pruned_loss=0.06375, over 21791.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3007, pruned_loss=0.07747, over 4276571.06 frames. ], batch size: 102, lr: 5.18e-03, grad_scale: 16.0 2023-06-22 14:17:47,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=973938.0, ans=0.125 2023-06-22 14:17:56,072 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=973938.0, ans=0.125 2023-06-22 14:18:30,115 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=974058.0, ans=0.2 2023-06-22 14:18:52,492 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.23 vs. limit=10.0 2023-06-22 14:19:44,033 INFO [train.py:996] (3/4) Epoch 6, batch 9900, loss[loss=0.2146, simple_loss=0.29, pruned_loss=0.06959, over 21330.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.2972, pruned_loss=0.07692, over 4261877.68 frames. ], batch size: 131, lr: 5.18e-03, grad_scale: 16.0 2023-06-22 14:19:48,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=974238.0, ans=0.125 2023-06-22 14:20:34,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=974358.0, ans=0.125 2023-06-22 14:20:52,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=974358.0, ans=0.125 2023-06-22 14:20:56,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=974358.0, ans=0.125 2023-06-22 14:20:59,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=974358.0, ans=0.0 2023-06-22 14:21:09,765 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.69 vs. limit=15.0 2023-06-22 14:21:36,174 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.871e+02 2.526e+02 2.876e+02 3.339e+02 4.860e+02, threshold=5.753e+02, percent-clipped=0.0 2023-06-22 14:21:39,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=974478.0, ans=0.125 2023-06-22 14:21:39,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=974478.0, ans=0.0 2023-06-22 14:21:48,346 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=974478.0, ans=10.0 2023-06-22 14:21:55,174 INFO [train.py:996] (3/4) Epoch 6, batch 9950, loss[loss=0.2067, simple_loss=0.2671, pruned_loss=0.07312, over 21578.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.299, pruned_loss=0.0792, over 4267759.56 frames. ], batch size: 263, lr: 5.18e-03, grad_scale: 16.0 2023-06-22 14:22:15,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=974538.0, ans=0.125 2023-06-22 14:23:59,891 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=974778.0, ans=0.125 2023-06-22 14:24:22,814 INFO [train.py:996] (3/4) Epoch 6, batch 10000, loss[loss=0.2135, simple_loss=0.2798, pruned_loss=0.07365, over 21164.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.2941, pruned_loss=0.07767, over 4266925.47 frames. ], batch size: 143, lr: 5.18e-03, grad_scale: 32.0 2023-06-22 14:24:26,558 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=974838.0, ans=0.1 2023-06-22 14:24:34,689 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.69 vs. limit=10.0 2023-06-22 14:24:40,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=974898.0, ans=0.125 2023-06-22 14:25:08,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=974958.0, ans=0.04949747468305833 2023-06-22 14:25:37,401 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=974958.0, ans=0.1 2023-06-22 14:25:42,006 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=975018.0, ans=0.125 2023-06-22 14:25:55,998 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.243e+02 2.535e+02 2.869e+02 3.521e+02 5.167e+02, threshold=5.738e+02, percent-clipped=0.0 2023-06-22 14:26:30,275 INFO [train.py:996] (3/4) Epoch 6, batch 10050, loss[loss=0.1942, simple_loss=0.2666, pruned_loss=0.06088, over 21399.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.2958, pruned_loss=0.07801, over 4266766.05 frames. ], batch size: 194, lr: 5.18e-03, grad_scale: 32.0 2023-06-22 14:27:22,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=975258.0, ans=0.1 2023-06-22 14:27:23,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=975258.0, ans=0.0 2023-06-22 14:28:50,956 INFO [train.py:996] (3/4) Epoch 6, batch 10100, loss[loss=0.23, simple_loss=0.3102, pruned_loss=0.07493, over 21759.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2945, pruned_loss=0.07625, over 4265418.22 frames. ], batch size: 351, lr: 5.18e-03, grad_scale: 32.0 2023-06-22 14:28:55,382 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=22.51 vs. limit=22.5 2023-06-22 14:29:08,380 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.88 vs. limit=15.0 2023-06-22 14:29:41,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=975558.0, ans=0.125 2023-06-22 14:29:48,791 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=975558.0, ans=0.1 2023-06-22 14:30:00,447 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.40 vs. limit=22.5 2023-06-22 14:30:04,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=975558.0, ans=0.0 2023-06-22 14:30:14,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=975618.0, ans=0.2 2023-06-22 14:30:41,759 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 2.495e+02 2.944e+02 3.852e+02 6.344e+02, threshold=5.889e+02, percent-clipped=1.0 2023-06-22 14:30:56,716 INFO [train.py:996] (3/4) Epoch 6, batch 10150, loss[loss=0.249, simple_loss=0.3188, pruned_loss=0.08965, over 21513.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3005, pruned_loss=0.07829, over 4265737.38 frames. ], batch size: 389, lr: 5.18e-03, grad_scale: 32.0 2023-06-22 14:30:59,306 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.74 vs. limit=15.0 2023-06-22 14:31:30,400 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=975798.0, ans=0.125 2023-06-22 14:31:45,575 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=975798.0, ans=0.0 2023-06-22 14:31:55,477 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=975858.0, ans=0.1 2023-06-22 14:32:11,509 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=975858.0, ans=0.125 2023-06-22 14:33:06,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=976038.0, ans=0.125 2023-06-22 14:33:07,626 INFO [train.py:996] (3/4) Epoch 6, batch 10200, loss[loss=0.2133, simple_loss=0.3009, pruned_loss=0.06289, over 21839.00 frames. ], tot_loss[loss=0.226, simple_loss=0.2994, pruned_loss=0.07628, over 4256967.47 frames. ], batch size: 317, lr: 5.18e-03, grad_scale: 32.0 2023-06-22 14:34:17,439 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.56 vs. limit=10.0 2023-06-22 14:35:04,012 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.679e+02 2.209e+02 2.586e+02 3.021e+02 4.237e+02, threshold=5.173e+02, percent-clipped=0.0 2023-06-22 14:35:06,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=976278.0, ans=0.125 2023-06-22 14:35:19,336 INFO [train.py:996] (3/4) Epoch 6, batch 10250, loss[loss=0.1876, simple_loss=0.2576, pruned_loss=0.05881, over 21808.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2944, pruned_loss=0.0705, over 4258751.65 frames. ], batch size: 102, lr: 5.18e-03, grad_scale: 32.0 2023-06-22 14:35:41,051 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=976338.0, ans=0.0 2023-06-22 14:36:18,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=976398.0, ans=0.125 2023-06-22 14:36:18,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=976398.0, ans=0.0 2023-06-22 14:36:24,558 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=976458.0, ans=0.1 2023-06-22 14:36:35,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=976458.0, ans=0.125 2023-06-22 14:37:29,625 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=976578.0, ans=0.125 2023-06-22 14:37:37,566 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.11 vs. limit=22.5 2023-06-22 14:37:40,001 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.68 vs. limit=15.0 2023-06-22 14:37:49,227 INFO [train.py:996] (3/4) Epoch 6, batch 10300, loss[loss=0.2474, simple_loss=0.3393, pruned_loss=0.07769, over 21892.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2989, pruned_loss=0.07219, over 4265328.30 frames. ], batch size: 372, lr: 5.18e-03, grad_scale: 32.0 2023-06-22 14:38:55,971 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=976758.0, ans=0.0 2023-06-22 14:39:20,641 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.14 vs. limit=15.0 2023-06-22 14:39:24,004 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.583e+02 2.460e+02 2.831e+02 3.476e+02 5.397e+02, threshold=5.661e+02, percent-clipped=1.0 2023-06-22 14:40:01,748 INFO [train.py:996] (3/4) Epoch 6, batch 10350, loss[loss=0.1923, simple_loss=0.2711, pruned_loss=0.05673, over 21658.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2995, pruned_loss=0.07236, over 4263065.89 frames. ], batch size: 263, lr: 5.17e-03, grad_scale: 16.0 2023-06-22 14:40:31,313 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.29 vs. limit=6.0 2023-06-22 14:40:56,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=976998.0, ans=0.125 2023-06-22 14:41:09,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=977058.0, ans=0.0 2023-06-22 14:42:16,431 INFO [train.py:996] (3/4) Epoch 6, batch 10400, loss[loss=0.1813, simple_loss=0.2521, pruned_loss=0.05524, over 21630.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.295, pruned_loss=0.07122, over 4254235.55 frames. ], batch size: 263, lr: 5.17e-03, grad_scale: 32.0 2023-06-22 14:43:13,879 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=977298.0, ans=0.015 2023-06-22 14:43:23,398 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=977358.0, ans=0.125 2023-06-22 14:43:48,173 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=977418.0, ans=0.0 2023-06-22 14:44:05,938 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.953e+02 2.727e+02 3.248e+02 3.919e+02 5.926e+02, threshold=6.497e+02, percent-clipped=3.0 2023-06-22 14:44:06,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=977478.0, ans=0.125 2023-06-22 14:44:37,182 INFO [train.py:996] (3/4) Epoch 6, batch 10450, loss[loss=0.2017, simple_loss=0.2711, pruned_loss=0.06618, over 16985.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2987, pruned_loss=0.07383, over 4257367.82 frames. ], batch size: 61, lr: 5.17e-03, grad_scale: 32.0 2023-06-22 14:46:21,877 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-06-22 14:46:22,838 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=977718.0, ans=0.125 2023-06-22 14:46:55,131 INFO [train.py:996] (3/4) Epoch 6, batch 10500, loss[loss=0.2069, simple_loss=0.279, pruned_loss=0.06737, over 21173.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2981, pruned_loss=0.07268, over 4253405.12 frames. ], batch size: 548, lr: 5.17e-03, grad_scale: 32.0 2023-06-22 14:47:07,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=977838.0, ans=0.2 2023-06-22 14:47:39,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=977958.0, ans=0.2 2023-06-22 14:48:31,848 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.853e+02 2.295e+02 2.560e+02 3.007e+02 4.435e+02, threshold=5.120e+02, percent-clipped=0.0 2023-06-22 14:49:10,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=978138.0, ans=0.0 2023-06-22 14:49:11,524 INFO [train.py:996] (3/4) Epoch 6, batch 10550, loss[loss=0.2174, simple_loss=0.2729, pruned_loss=0.081, over 21335.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2929, pruned_loss=0.07274, over 4241833.36 frames. ], batch size: 473, lr: 5.17e-03, grad_scale: 32.0 2023-06-22 14:49:32,990 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=978198.0, ans=0.2 2023-06-22 14:50:30,736 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=978318.0, ans=0.5 2023-06-22 14:50:58,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=978378.0, ans=0.125 2023-06-22 14:51:02,700 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=978378.0, ans=0.125 2023-06-22 14:51:24,204 INFO [train.py:996] (3/4) Epoch 6, batch 10600, loss[loss=0.2258, simple_loss=0.3213, pruned_loss=0.06512, over 19903.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.288, pruned_loss=0.0714, over 4246136.16 frames. ], batch size: 703, lr: 5.17e-03, grad_scale: 32.0 2023-06-22 14:51:29,820 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.81 vs. limit=15.0 2023-06-22 14:51:44,911 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.61 vs. limit=10.0 2023-06-22 14:51:54,744 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 14:52:49,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=978618.0, ans=0.5 2023-06-22 14:53:10,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=978678.0, ans=0.07 2023-06-22 14:53:20,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=978678.0, ans=0.0 2023-06-22 14:53:20,826 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.46 vs. limit=15.0 2023-06-22 14:53:20,943 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.64 vs. limit=15.0 2023-06-22 14:53:21,279 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.874e+02 2.439e+02 2.926e+02 3.580e+02 7.545e+02, threshold=5.851e+02, percent-clipped=4.0 2023-06-22 14:53:38,921 INFO [train.py:996] (3/4) Epoch 6, batch 10650, loss[loss=0.1499, simple_loss=0.2216, pruned_loss=0.03914, over 21214.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2911, pruned_loss=0.06984, over 4245564.32 frames. ], batch size: 159, lr: 5.17e-03, grad_scale: 16.0 2023-06-22 14:53:47,556 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=12.0 2023-06-22 14:54:17,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=978798.0, ans=0.1 2023-06-22 14:54:40,677 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.11 vs. limit=15.0 2023-06-22 14:54:49,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=978858.0, ans=0.125 2023-06-22 14:55:19,384 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=978978.0, ans=0.0 2023-06-22 14:55:20,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=978978.0, ans=0.2 2023-06-22 14:55:49,528 INFO [train.py:996] (3/4) Epoch 6, batch 10700, loss[loss=0.2496, simple_loss=0.3249, pruned_loss=0.08717, over 21911.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2901, pruned_loss=0.06981, over 4244761.19 frames. ], batch size: 372, lr: 5.17e-03, grad_scale: 16.0 2023-06-22 14:56:01,715 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=979038.0, ans=0.0 2023-06-22 14:57:20,769 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=979218.0, ans=0.125 2023-06-22 14:57:50,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=979218.0, ans=0.0 2023-06-22 14:57:52,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=979278.0, ans=0.125 2023-06-22 14:58:02,678 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 2.582e+02 2.877e+02 3.268e+02 5.588e+02, threshold=5.755e+02, percent-clipped=0.0 2023-06-22 14:58:17,284 INFO [train.py:996] (3/4) Epoch 6, batch 10750, loss[loss=0.2414, simple_loss=0.3318, pruned_loss=0.07553, over 21795.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3017, pruned_loss=0.07466, over 4255673.58 frames. ], batch size: 282, lr: 5.17e-03, grad_scale: 16.0 2023-06-22 14:59:53,444 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.29 vs. limit=15.0 2023-06-22 15:00:15,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=979578.0, ans=0.125 2023-06-22 15:00:52,824 INFO [train.py:996] (3/4) Epoch 6, batch 10800, loss[loss=0.2391, simple_loss=0.3157, pruned_loss=0.08118, over 21822.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3064, pruned_loss=0.07529, over 4260278.19 frames. ], batch size: 282, lr: 5.17e-03, grad_scale: 32.0 2023-06-22 15:01:24,873 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=979698.0, ans=0.125 2023-06-22 15:01:36,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=979698.0, ans=0.0 2023-06-22 15:02:02,448 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=979758.0, ans=0.2 2023-06-22 15:02:05,583 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.13 vs. limit=22.5 2023-06-22 15:02:08,472 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.91 vs. limit=15.0 2023-06-22 15:02:18,130 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.65 vs. limit=22.5 2023-06-22 15:02:44,265 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.022e+02 2.781e+02 3.238e+02 4.047e+02 6.056e+02, threshold=6.476e+02, percent-clipped=1.0 2023-06-22 15:03:16,962 INFO [train.py:996] (3/4) Epoch 6, batch 10850, loss[loss=0.2592, simple_loss=0.3103, pruned_loss=0.1041, over 21486.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3086, pruned_loss=0.07629, over 4260961.01 frames. ], batch size: 509, lr: 5.17e-03, grad_scale: 32.0 2023-06-22 15:03:18,906 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=979938.0, ans=0.1 2023-06-22 15:03:21,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=979938.0, ans=0.0 2023-06-22 15:03:27,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=979938.0, ans=0.0 2023-06-22 15:05:23,920 INFO [train.py:996] (3/4) Epoch 6, batch 10900, loss[loss=0.2127, simple_loss=0.2872, pruned_loss=0.06913, over 21730.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3006, pruned_loss=0.07411, over 4246008.30 frames. ], batch size: 316, lr: 5.17e-03, grad_scale: 32.0 2023-06-22 15:05:27,430 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=980238.0, ans=0.0 2023-06-22 15:05:31,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=980238.0, ans=0.07 2023-06-22 15:06:20,946 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=980298.0, ans=15.0 2023-06-22 15:06:21,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=980358.0, ans=0.2 2023-06-22 15:06:25,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=980358.0, ans=0.0 2023-06-22 15:06:32,137 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=980358.0, ans=0.125 2023-06-22 15:06:39,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=980418.0, ans=0.07 2023-06-22 15:06:54,415 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=980418.0, ans=0.125 2023-06-22 15:07:09,955 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.015e+02 2.391e+02 2.681e+02 3.118e+02 5.164e+02, threshold=5.361e+02, percent-clipped=0.0 2023-06-22 15:07:34,796 INFO [train.py:996] (3/4) Epoch 6, batch 10950, loss[loss=0.2054, simple_loss=0.2715, pruned_loss=0.0697, over 21473.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2955, pruned_loss=0.0723, over 4247742.86 frames. ], batch size: 389, lr: 5.16e-03, grad_scale: 32.0 2023-06-22 15:08:51,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=980658.0, ans=0.125 2023-06-22 15:08:59,078 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=980718.0, ans=0.125 2023-06-22 15:09:28,499 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.30 vs. limit=5.0 2023-06-22 15:10:01,908 INFO [train.py:996] (3/4) Epoch 6, batch 11000, loss[loss=0.2669, simple_loss=0.3398, pruned_loss=0.09697, over 21732.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2946, pruned_loss=0.07352, over 4254932.68 frames. ], batch size: 112, lr: 5.16e-03, grad_scale: 16.0 2023-06-22 15:10:50,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=980958.0, ans=0.0 2023-06-22 15:11:02,415 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=981018.0, ans=0.125 2023-06-22 15:11:05,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=981018.0, ans=0.0 2023-06-22 15:11:39,798 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.816e+02 2.522e+02 2.854e+02 3.360e+02 5.643e+02, threshold=5.707e+02, percent-clipped=1.0 2023-06-22 15:11:55,625 INFO [train.py:996] (3/4) Epoch 6, batch 11050, loss[loss=0.2249, simple_loss=0.2863, pruned_loss=0.08177, over 21858.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2939, pruned_loss=0.07441, over 4263101.23 frames. ], batch size: 98, lr: 5.16e-03, grad_scale: 16.0 2023-06-22 15:12:53,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=981198.0, ans=0.0 2023-06-22 15:13:20,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=981318.0, ans=0.0 2023-06-22 15:13:41,230 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=981378.0, ans=0.0 2023-06-22 15:14:05,807 INFO [train.py:996] (3/4) Epoch 6, batch 11100, loss[loss=0.2474, simple_loss=0.3097, pruned_loss=0.09255, over 21291.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2919, pruned_loss=0.07434, over 4254130.58 frames. ], batch size: 471, lr: 5.16e-03, grad_scale: 16.0 2023-06-22 15:15:34,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=981618.0, ans=0.0 2023-06-22 15:15:51,520 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.003e+02 2.464e+02 2.836e+02 3.391e+02 6.300e+02, threshold=5.672e+02, percent-clipped=1.0 2023-06-22 15:16:19,349 INFO [train.py:996] (3/4) Epoch 6, batch 11150, loss[loss=0.2076, simple_loss=0.2722, pruned_loss=0.07153, over 21882.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2891, pruned_loss=0.07395, over 4251111.27 frames. ], batch size: 107, lr: 5.16e-03, grad_scale: 16.0 2023-06-22 15:17:52,105 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=981918.0, ans=0.0 2023-06-22 15:18:21,314 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.30 vs. limit=15.0 2023-06-22 15:18:21,393 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.36 vs. limit=15.0 2023-06-22 15:18:32,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=981978.0, ans=0.0 2023-06-22 15:18:36,072 INFO [train.py:996] (3/4) Epoch 6, batch 11200, loss[loss=0.2194, simple_loss=0.2779, pruned_loss=0.0804, over 21759.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.288, pruned_loss=0.07358, over 4250829.83 frames. ], batch size: 317, lr: 5.16e-03, grad_scale: 32.0 2023-06-22 15:19:06,806 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.16 vs. limit=22.5 2023-06-22 15:19:29,085 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.41 vs. limit=6.0 2023-06-22 15:19:55,195 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=982218.0, ans=0.125 2023-06-22 15:19:55,764 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.15 vs. limit=6.0 2023-06-22 15:20:21,519 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.906e+02 2.424e+02 2.651e+02 3.050e+02 4.953e+02, threshold=5.302e+02, percent-clipped=0.0 2023-06-22 15:20:35,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=982338.0, ans=0.0 2023-06-22 15:20:44,986 INFO [train.py:996] (3/4) Epoch 6, batch 11250, loss[loss=0.2471, simple_loss=0.3235, pruned_loss=0.08529, over 21782.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2877, pruned_loss=0.07316, over 4252833.81 frames. ], batch size: 118, lr: 5.16e-03, grad_scale: 32.0 2023-06-22 15:20:49,801 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=982338.0, ans=0.05 2023-06-22 15:22:20,042 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=17.81 vs. limit=15.0 2023-06-22 15:22:31,571 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.60 vs. limit=15.0 2023-06-22 15:22:48,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=982578.0, ans=0.125 2023-06-22 15:22:54,006 INFO [train.py:996] (3/4) Epoch 6, batch 11300, loss[loss=0.2016, simple_loss=0.2707, pruned_loss=0.06623, over 20808.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2892, pruned_loss=0.07362, over 4257652.90 frames. ], batch size: 609, lr: 5.16e-03, grad_scale: 32.0 2023-06-22 15:23:22,172 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=982638.0, ans=0.0 2023-06-22 15:23:29,878 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=982698.0, ans=0.0 2023-06-22 15:23:59,428 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 15:23:59,902 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-22 15:24:09,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=982758.0, ans=0.2 2023-06-22 15:24:34,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=982878.0, ans=0.0 2023-06-22 15:24:48,789 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.874e+02 2.408e+02 2.703e+02 3.080e+02 4.144e+02, threshold=5.407e+02, percent-clipped=0.0 2023-06-22 15:25:19,012 INFO [train.py:996] (3/4) Epoch 6, batch 11350, loss[loss=0.2003, simple_loss=0.3067, pruned_loss=0.04694, over 20759.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2892, pruned_loss=0.0722, over 4263630.79 frames. ], batch size: 607, lr: 5.16e-03, grad_scale: 32.0 2023-06-22 15:26:08,537 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.31 vs. limit=10.0 2023-06-22 15:26:20,832 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=983058.0, ans=0.125 2023-06-22 15:26:47,907 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=983118.0, ans=0.125 2023-06-22 15:27:42,752 INFO [train.py:996] (3/4) Epoch 6, batch 11400, loss[loss=0.2221, simple_loss=0.3099, pruned_loss=0.0672, over 21718.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2948, pruned_loss=0.07419, over 4260189.70 frames. ], batch size: 298, lr: 5.16e-03, grad_scale: 32.0 2023-06-22 15:28:15,668 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 15:29:08,232 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=983418.0, ans=0.0 2023-06-22 15:29:37,574 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.801e+02 2.455e+02 2.814e+02 3.249e+02 4.711e+02, threshold=5.629e+02, percent-clipped=0.0 2023-06-22 15:29:38,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=983478.0, ans=0.125 2023-06-22 15:29:54,453 INFO [train.py:996] (3/4) Epoch 6, batch 11450, loss[loss=0.2411, simple_loss=0.3182, pruned_loss=0.08197, over 21699.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2957, pruned_loss=0.07306, over 4260641.62 frames. ], batch size: 351, lr: 5.16e-03, grad_scale: 32.0 2023-06-22 15:29:56,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=983538.0, ans=0.125 2023-06-22 15:30:03,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=983538.0, ans=0.0 2023-06-22 15:30:44,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=983658.0, ans=0.125 2023-06-22 15:30:49,149 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.29 vs. limit=12.0 2023-06-22 15:32:07,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=983778.0, ans=0.125 2023-06-22 15:32:10,125 INFO [train.py:996] (3/4) Epoch 6, batch 11500, loss[loss=0.2082, simple_loss=0.3001, pruned_loss=0.05813, over 21769.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3004, pruned_loss=0.07511, over 4264789.41 frames. ], batch size: 298, lr: 5.16e-03, grad_scale: 32.0 2023-06-22 15:33:36,890 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=983958.0, ans=0.0 2023-06-22 15:33:47,949 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-22 15:34:04,701 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.014e+02 2.591e+02 3.004e+02 3.598e+02 5.267e+02, threshold=6.007e+02, percent-clipped=0.0 2023-06-22 15:34:34,322 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.97 vs. limit=22.5 2023-06-22 15:34:35,412 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=984078.0, ans=0.05 2023-06-22 15:34:39,604 INFO [train.py:996] (3/4) Epoch 6, batch 11550, loss[loss=0.266, simple_loss=0.3636, pruned_loss=0.08423, over 21850.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3074, pruned_loss=0.07592, over 4268308.67 frames. ], batch size: 316, lr: 5.16e-03, grad_scale: 32.0 2023-06-22 15:35:25,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=984198.0, ans=0.2 2023-06-22 15:36:46,781 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=22.5 2023-06-22 15:37:03,117 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.10 vs. limit=10.0 2023-06-22 15:37:12,902 INFO [train.py:996] (3/4) Epoch 6, batch 11600, loss[loss=0.266, simple_loss=0.3527, pruned_loss=0.08961, over 21389.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.322, pruned_loss=0.07789, over 4262561.22 frames. ], batch size: 194, lr: 5.15e-03, grad_scale: 32.0 2023-06-22 15:37:23,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=984438.0, ans=0.125 2023-06-22 15:38:02,718 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=984498.0, ans=0.0 2023-06-22 15:38:05,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=984558.0, ans=0.125 2023-06-22 15:38:07,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=984558.0, ans=0.2 2023-06-22 15:39:22,408 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.074e+02 2.936e+02 3.532e+02 4.287e+02 8.204e+02, threshold=7.063e+02, percent-clipped=1.0 2023-06-22 15:39:22,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=984678.0, ans=0.2 2023-06-22 15:39:29,706 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.29 vs. limit=15.0 2023-06-22 15:39:33,493 INFO [train.py:996] (3/4) Epoch 6, batch 11650, loss[loss=0.2081, simple_loss=0.2823, pruned_loss=0.0669, over 21805.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3269, pruned_loss=0.07829, over 4266496.31 frames. ], batch size: 124, lr: 5.15e-03, grad_scale: 32.0 2023-06-22 15:39:34,518 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.28 vs. limit=15.0 2023-06-22 15:39:44,224 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=984738.0, ans=0.1 2023-06-22 15:41:02,322 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=984918.0, ans=0.125 2023-06-22 15:41:34,981 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.91 vs. limit=15.0 2023-06-22 15:41:42,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=984978.0, ans=0.125 2023-06-22 15:41:46,241 INFO [train.py:996] (3/4) Epoch 6, batch 11700, loss[loss=0.2144, simple_loss=0.28, pruned_loss=0.07446, over 21654.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3178, pruned_loss=0.07845, over 4271443.94 frames. ], batch size: 282, lr: 5.15e-03, grad_scale: 32.0 2023-06-22 15:42:17,204 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=985098.0, ans=0.0 2023-06-22 15:43:20,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=985218.0, ans=0.125 2023-06-22 15:43:38,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=985278.0, ans=0.02 2023-06-22 15:43:42,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=985278.0, ans=0.125 2023-06-22 15:43:45,802 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.000e+02 2.631e+02 2.879e+02 3.567e+02 8.503e+02, threshold=5.757e+02, percent-clipped=1.0 2023-06-22 15:43:54,599 INFO [train.py:996] (3/4) Epoch 6, batch 11750, loss[loss=0.2348, simple_loss=0.304, pruned_loss=0.08277, over 21881.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3098, pruned_loss=0.07766, over 4266609.62 frames. ], batch size: 372, lr: 5.15e-03, grad_scale: 16.0 2023-06-22 15:44:33,996 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.20 vs. limit=22.5 2023-06-22 15:45:45,865 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=985518.0, ans=0.07 2023-06-22 15:45:48,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=985518.0, ans=0.0 2023-06-22 15:46:36,562 INFO [train.py:996] (3/4) Epoch 6, batch 11800, loss[loss=0.2416, simple_loss=0.3202, pruned_loss=0.08152, over 19989.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3105, pruned_loss=0.07877, over 4259546.18 frames. ], batch size: 702, lr: 5.15e-03, grad_scale: 16.0 2023-06-22 15:47:35,513 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 15:47:54,568 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 15:48:20,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn2.whiten.whitening_limit, batch_count=985818.0, ans=22.5 2023-06-22 15:48:32,230 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=985878.0, ans=0.125 2023-06-22 15:48:37,797 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.911e+02 2.436e+02 2.794e+02 3.118e+02 5.023e+02, threshold=5.587e+02, percent-clipped=0.0 2023-06-22 15:48:43,395 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.47 vs. limit=5.0 2023-06-22 15:48:58,501 INFO [train.py:996] (3/4) Epoch 6, batch 11850, loss[loss=0.2337, simple_loss=0.3012, pruned_loss=0.08314, over 21329.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3111, pruned_loss=0.07809, over 4261863.59 frames. ], batch size: 176, lr: 5.15e-03, grad_scale: 16.0 2023-06-22 15:49:17,814 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.23 vs. limit=15.0 2023-06-22 15:49:38,999 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=12.0 2023-06-22 15:50:05,918 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 15:51:22,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=986178.0, ans=0.0 2023-06-22 15:51:22,696 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.35 vs. limit=15.0 2023-06-22 15:51:24,772 INFO [train.py:996] (3/4) Epoch 6, batch 11900, loss[loss=0.2146, simple_loss=0.3067, pruned_loss=0.06127, over 21835.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3101, pruned_loss=0.07559, over 4268478.64 frames. ], batch size: 371, lr: 5.15e-03, grad_scale: 16.0 2023-06-22 15:51:27,044 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=986238.0, ans=0.025 2023-06-22 15:51:28,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=986238.0, ans=0.2 2023-06-22 15:51:40,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=986238.0, ans=0.125 2023-06-22 15:52:23,138 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=986298.0, ans=0.125 2023-06-22 15:53:00,333 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.06 vs. limit=12.0 2023-06-22 15:53:01,982 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.32 vs. limit=8.0 2023-06-22 15:53:13,470 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.863e+02 2.309e+02 2.613e+02 2.997e+02 4.619e+02, threshold=5.227e+02, percent-clipped=0.0 2023-06-22 15:53:41,841 INFO [train.py:996] (3/4) Epoch 6, batch 11950, loss[loss=0.1707, simple_loss=0.242, pruned_loss=0.0497, over 21194.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.31, pruned_loss=0.07255, over 4270913.89 frames. ], batch size: 143, lr: 5.15e-03, grad_scale: 16.0 2023-06-22 15:53:45,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=986538.0, ans=0.1 2023-06-22 15:54:21,644 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 15:54:21,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=986598.0, ans=0.125 2023-06-22 15:55:07,230 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=986718.0, ans=0.125 2023-06-22 15:55:52,244 INFO [train.py:996] (3/4) Epoch 6, batch 12000, loss[loss=0.1899, simple_loss=0.2602, pruned_loss=0.05975, over 15609.00 frames. ], tot_loss[loss=0.224, simple_loss=0.3053, pruned_loss=0.07132, over 4262543.46 frames. ], batch size: 61, lr: 5.15e-03, grad_scale: 32.0 2023-06-22 15:55:52,245 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-22 15:56:32,586 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.2.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.7428, 3.7668, 3.5508, 3.7655], device='cuda:3') 2023-06-22 15:56:35,827 INFO [train.py:1028] (3/4) Epoch 6, validation: loss=0.2631, simple_loss=0.3525, pruned_loss=0.08686, over 1796401.00 frames. 2023-06-22 15:56:35,828 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23918MB 2023-06-22 15:56:38,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=986838.0, ans=0.0 2023-06-22 15:57:12,798 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.08 vs. limit=15.0 2023-06-22 15:57:19,722 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=986898.0, ans=0.0 2023-06-22 15:58:18,394 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.895e+02 2.546e+02 3.142e+02 3.620e+02 6.312e+02, threshold=6.283e+02, percent-clipped=4.0 2023-06-22 15:58:41,701 INFO [train.py:996] (3/4) Epoch 6, batch 12050, loss[loss=0.2222, simple_loss=0.2874, pruned_loss=0.07848, over 21366.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3032, pruned_loss=0.07401, over 4263241.53 frames. ], batch size: 143, lr: 5.15e-03, grad_scale: 32.0 2023-06-22 15:59:36,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=987258.0, ans=0.1 2023-06-22 16:00:53,311 INFO [train.py:996] (3/4) Epoch 6, batch 12100, loss[loss=0.2424, simple_loss=0.328, pruned_loss=0.07846, over 21864.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3068, pruned_loss=0.07733, over 4269681.75 frames. ], batch size: 371, lr: 5.15e-03, grad_scale: 32.0 2023-06-22 16:01:04,446 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 16:01:11,690 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=987438.0, ans=0.2 2023-06-22 16:02:08,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=987558.0, ans=0.0 2023-06-22 16:03:10,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=987678.0, ans=0.04949747468305833 2023-06-22 16:03:12,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=987678.0, ans=0.2 2023-06-22 16:03:12,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=987678.0, ans=0.0 2023-06-22 16:03:14,376 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.727e+02 2.726e+02 3.141e+02 3.562e+02 5.633e+02, threshold=6.281e+02, percent-clipped=0.0 2023-06-22 16:03:22,117 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=987678.0, ans=0.125 2023-06-22 16:03:33,321 INFO [train.py:996] (3/4) Epoch 6, batch 12150, loss[loss=0.2678, simple_loss=0.384, pruned_loss=0.07584, over 19720.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3109, pruned_loss=0.07698, over 4262738.41 frames. ], batch size: 702, lr: 5.15e-03, grad_scale: 32.0 2023-06-22 16:04:12,894 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=987798.0, ans=0.0 2023-06-22 16:04:16,504 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=987798.0, ans=0.0 2023-06-22 16:04:35,419 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-06-22 16:04:55,568 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.62 vs. limit=22.5 2023-06-22 16:05:30,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=987978.0, ans=0.125 2023-06-22 16:05:52,879 INFO [train.py:996] (3/4) Epoch 6, batch 12200, loss[loss=0.2302, simple_loss=0.2875, pruned_loss=0.08639, over 21845.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3074, pruned_loss=0.07609, over 4260066.26 frames. ], batch size: 98, lr: 5.15e-03, grad_scale: 32.0 2023-06-22 16:06:03,250 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=988038.0, ans=0.2 2023-06-22 16:06:11,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=988098.0, ans=0.1 2023-06-22 16:06:26,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=988098.0, ans=0.0 2023-06-22 16:06:56,221 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.63 vs. limit=10.0 2023-06-22 16:07:16,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=988218.0, ans=0.0 2023-06-22 16:07:56,036 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.713e+02 2.303e+02 2.808e+02 3.526e+02 6.344e+02, threshold=5.616e+02, percent-clipped=1.0 2023-06-22 16:07:56,503 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=988278.0, ans=0.2 2023-06-22 16:07:59,857 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=988278.0, ans=0.2 2023-06-22 16:08:03,853 INFO [train.py:996] (3/4) Epoch 6, batch 12250, loss[loss=0.1631, simple_loss=0.2373, pruned_loss=0.04445, over 21527.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2982, pruned_loss=0.0724, over 4265061.77 frames. ], batch size: 195, lr: 5.14e-03, grad_scale: 16.0 2023-06-22 16:08:45,437 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=988398.0, ans=0.1 2023-06-22 16:10:13,408 INFO [train.py:996] (3/4) Epoch 6, batch 12300, loss[loss=0.2519, simple_loss=0.3404, pruned_loss=0.08173, over 21844.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2915, pruned_loss=0.06708, over 4251307.07 frames. ], batch size: 371, lr: 5.14e-03, grad_scale: 8.0 2023-06-22 16:11:00,537 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=988758.0, ans=0.0 2023-06-22 16:11:41,779 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=988818.0, ans=0.2 2023-06-22 16:12:28,479 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.384e+02 1.976e+02 2.467e+02 3.000e+02 5.737e+02, threshold=4.934e+02, percent-clipped=1.0 2023-06-22 16:12:34,333 INFO [train.py:996] (3/4) Epoch 6, batch 12350, loss[loss=0.2402, simple_loss=0.3139, pruned_loss=0.08325, over 21560.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2976, pruned_loss=0.06803, over 4254539.24 frames. ], batch size: 548, lr: 5.14e-03, grad_scale: 8.0 2023-06-22 16:13:39,217 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=989058.0, ans=0.1 2023-06-22 16:13:51,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=989118.0, ans=0.125 2023-06-22 16:14:22,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=989178.0, ans=0.125 2023-06-22 16:14:32,358 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.13 vs. limit=15.0 2023-06-22 16:14:36,938 INFO [train.py:996] (3/4) Epoch 6, batch 12400, loss[loss=0.2372, simple_loss=0.3005, pruned_loss=0.08698, over 21344.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2992, pruned_loss=0.07167, over 4263284.52 frames. ], batch size: 176, lr: 5.14e-03, grad_scale: 16.0 2023-06-22 16:15:57,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=989358.0, ans=0.125 2023-06-22 16:16:45,917 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.988e+02 2.525e+02 2.978e+02 3.587e+02 4.939e+02, threshold=5.956e+02, percent-clipped=1.0 2023-06-22 16:16:51,790 INFO [train.py:996] (3/4) Epoch 6, batch 12450, loss[loss=0.2596, simple_loss=0.3369, pruned_loss=0.09115, over 21481.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3029, pruned_loss=0.07516, over 4272044.50 frames. ], batch size: 131, lr: 5.14e-03, grad_scale: 16.0 2023-06-22 16:17:18,290 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=989538.0, ans=10.0 2023-06-22 16:17:30,105 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 16:18:32,983 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 16:18:44,570 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=989718.0, ans=0.2 2023-06-22 16:18:46,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=989718.0, ans=0.1 2023-06-22 16:19:14,376 INFO [train.py:996] (3/4) Epoch 6, batch 12500, loss[loss=0.2511, simple_loss=0.3594, pruned_loss=0.07143, over 21922.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3125, pruned_loss=0.0778, over 4269284.44 frames. ], batch size: 317, lr: 5.14e-03, grad_scale: 16.0 2023-06-22 16:19:40,432 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.06 vs. limit=15.0 2023-06-22 16:20:09,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=989898.0, ans=0.0 2023-06-22 16:20:18,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=989898.0, ans=0.125 2023-06-22 16:20:43,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=989958.0, ans=0.125 2023-06-22 16:20:48,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=989958.0, ans=0.0 2023-06-22 16:20:51,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=989958.0, ans=0.1 2023-06-22 16:21:21,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=990078.0, ans=0.0 2023-06-22 16:21:38,561 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.199e+02 2.650e+02 2.940e+02 3.345e+02 4.530e+02, threshold=5.879e+02, percent-clipped=0.0 2023-06-22 16:22:04,951 INFO [train.py:996] (3/4) Epoch 6, batch 12550, loss[loss=0.2564, simple_loss=0.3303, pruned_loss=0.0913, over 21818.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3159, pruned_loss=0.0801, over 4274687.23 frames. ], batch size: 118, lr: 5.14e-03, grad_scale: 16.0 2023-06-22 16:22:11,579 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=990138.0, ans=0.0 2023-06-22 16:22:31,231 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=990138.0, ans=0.0 2023-06-22 16:22:46,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=990198.0, ans=0.125 2023-06-22 16:23:03,197 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=990258.0, ans=0.0 2023-06-22 16:23:40,830 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=990318.0, ans=0.05 2023-06-22 16:24:14,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=990378.0, ans=0.07 2023-06-22 16:24:22,782 INFO [train.py:996] (3/4) Epoch 6, batch 12600, loss[loss=0.2059, simple_loss=0.2919, pruned_loss=0.05994, over 21595.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3141, pruned_loss=0.07797, over 4267914.44 frames. ], batch size: 230, lr: 5.14e-03, grad_scale: 16.0 2023-06-22 16:25:33,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=990618.0, ans=0.1 2023-06-22 16:25:34,498 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=990618.0, ans=0.1 2023-06-22 16:26:21,259 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.899e+02 2.398e+02 2.729e+02 3.427e+02 5.536e+02, threshold=5.458e+02, percent-clipped=0.0 2023-06-22 16:26:33,020 INFO [train.py:996] (3/4) Epoch 6, batch 12650, loss[loss=0.248, simple_loss=0.3646, pruned_loss=0.06569, over 20758.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3062, pruned_loss=0.07323, over 4274759.72 frames. ], batch size: 608, lr: 5.14e-03, grad_scale: 16.0 2023-06-22 16:27:20,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=990858.0, ans=0.0 2023-06-22 16:27:29,435 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=990858.0, ans=0.0 2023-06-22 16:28:27,143 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.55 vs. limit=15.0 2023-06-22 16:28:44,044 INFO [train.py:996] (3/4) Epoch 6, batch 12700, loss[loss=0.239, simple_loss=0.3161, pruned_loss=0.08098, over 21943.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3057, pruned_loss=0.07532, over 4280615.99 frames. ], batch size: 372, lr: 5.14e-03, grad_scale: 8.0 2023-06-22 16:28:44,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=991038.0, ans=0.0 2023-06-22 16:29:19,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=991098.0, ans=0.125 2023-06-22 16:29:27,048 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.88 vs. limit=12.0 2023-06-22 16:29:27,803 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=991158.0, ans=0.1 2023-06-22 16:29:29,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=991158.0, ans=0.04949747468305833 2023-06-22 16:30:29,463 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 16:30:47,904 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=991278.0, ans=0.0 2023-06-22 16:30:48,843 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.853e+02 2.637e+02 2.986e+02 3.360e+02 4.638e+02, threshold=5.971e+02, percent-clipped=0.0 2023-06-22 16:31:04,998 INFO [train.py:996] (3/4) Epoch 6, batch 12750, loss[loss=0.2175, simple_loss=0.2946, pruned_loss=0.07026, over 21772.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3072, pruned_loss=0.0756, over 4282596.15 frames. ], batch size: 298, lr: 5.14e-03, grad_scale: 8.0 2023-06-22 16:31:11,643 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=22.5 2023-06-22 16:32:15,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=991518.0, ans=0.0 2023-06-22 16:32:31,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=991518.0, ans=0.125 2023-06-22 16:32:46,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=991578.0, ans=0.0 2023-06-22 16:33:13,748 INFO [train.py:996] (3/4) Epoch 6, batch 12800, loss[loss=0.2104, simple_loss=0.2844, pruned_loss=0.06818, over 21812.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3065, pruned_loss=0.07651, over 4287968.64 frames. ], batch size: 247, lr: 5.14e-03, grad_scale: 16.0 2023-06-22 16:33:23,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=991638.0, ans=0.1 2023-06-22 16:34:08,652 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=991758.0, ans=0.1 2023-06-22 16:34:23,432 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=991758.0, ans=0.2 2023-06-22 16:34:33,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=991818.0, ans=0.0 2023-06-22 16:35:01,857 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=991878.0, ans=0.5 2023-06-22 16:35:03,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=991878.0, ans=0.0 2023-06-22 16:35:20,956 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.092e+02 2.507e+02 2.757e+02 3.011e+02 5.577e+02, threshold=5.515e+02, percent-clipped=0.0 2023-06-22 16:35:25,538 INFO [train.py:996] (3/4) Epoch 6, batch 12850, loss[loss=0.1953, simple_loss=0.2876, pruned_loss=0.05146, over 21735.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3094, pruned_loss=0.07854, over 4289073.38 frames. ], batch size: 247, lr: 5.14e-03, grad_scale: 16.0 2023-06-22 16:36:28,082 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=992058.0, ans=0.1 2023-06-22 16:36:35,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=992058.0, ans=0.0 2023-06-22 16:37:33,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=992178.0, ans=0.0 2023-06-22 16:37:40,703 INFO [train.py:996] (3/4) Epoch 6, batch 12900, loss[loss=0.1939, simple_loss=0.2715, pruned_loss=0.05814, over 21412.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3067, pruned_loss=0.0748, over 4280283.14 frames. ], batch size: 194, lr: 5.13e-03, grad_scale: 16.0 2023-06-22 16:37:41,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=992238.0, ans=22.5 2023-06-22 16:38:25,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=992298.0, ans=0.125 2023-06-22 16:39:06,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=992358.0, ans=0.0 2023-06-22 16:39:50,240 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.712e+02 2.250e+02 2.553e+02 2.829e+02 4.968e+02, threshold=5.106e+02, percent-clipped=0.0 2023-06-22 16:39:54,887 INFO [train.py:996] (3/4) Epoch 6, batch 12950, loss[loss=0.2242, simple_loss=0.3017, pruned_loss=0.07333, over 21933.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3049, pruned_loss=0.07284, over 4275645.83 frames. ], batch size: 317, lr: 5.13e-03, grad_scale: 16.0 2023-06-22 16:39:55,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=992538.0, ans=0.125 2023-06-22 16:40:10,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=992538.0, ans=0.0 2023-06-22 16:40:30,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=992598.0, ans=0.0 2023-06-22 16:42:07,744 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=992778.0, ans=0.0 2023-06-22 16:42:11,632 INFO [train.py:996] (3/4) Epoch 6, batch 13000, loss[loss=0.2358, simple_loss=0.3142, pruned_loss=0.07873, over 21621.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3072, pruned_loss=0.07462, over 4257777.83 frames. ], batch size: 441, lr: 5.13e-03, grad_scale: 16.0 2023-06-22 16:42:30,764 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=992838.0, ans=0.2 2023-06-22 16:42:35,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=992838.0, ans=0.125 2023-06-22 16:43:55,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=993018.0, ans=0.1 2023-06-22 16:44:03,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=993018.0, ans=0.125 2023-06-22 16:44:19,928 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.776e+02 2.566e+02 2.996e+02 3.463e+02 5.052e+02, threshold=5.993e+02, percent-clipped=0.0 2023-06-22 16:44:24,266 INFO [train.py:996] (3/4) Epoch 6, batch 13050, loss[loss=0.2245, simple_loss=0.2971, pruned_loss=0.07594, over 21872.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.3026, pruned_loss=0.07232, over 4262212.31 frames. ], batch size: 371, lr: 5.13e-03, grad_scale: 16.0 2023-06-22 16:46:01,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=993318.0, ans=0.125 2023-06-22 16:46:42,715 INFO [train.py:996] (3/4) Epoch 6, batch 13100, loss[loss=0.2547, simple_loss=0.3285, pruned_loss=0.09049, over 21304.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3059, pruned_loss=0.07275, over 4265020.27 frames. ], batch size: 159, lr: 5.13e-03, grad_scale: 16.0 2023-06-22 16:47:57,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=993558.0, ans=0.2 2023-06-22 16:48:22,763 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=993618.0, ans=0.0 2023-06-22 16:49:00,428 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.229e+02 2.692e+02 3.327e+02 4.339e+02 6.233e+02, threshold=6.654e+02, percent-clipped=2.0 2023-06-22 16:49:18,917 INFO [train.py:996] (3/4) Epoch 6, batch 13150, loss[loss=0.1819, simple_loss=0.2634, pruned_loss=0.05017, over 21584.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.308, pruned_loss=0.07518, over 4265331.58 frames. ], batch size: 263, lr: 5.13e-03, grad_scale: 16.0 2023-06-22 16:49:25,469 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=993738.0, ans=0.125 2023-06-22 16:49:35,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=993738.0, ans=0.0 2023-06-22 16:50:05,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=993798.0, ans=0.0 2023-06-22 16:50:25,651 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.79 vs. limit=15.0 2023-06-22 16:50:26,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=993858.0, ans=0.125 2023-06-22 16:51:01,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=993978.0, ans=0.2 2023-06-22 16:51:34,355 INFO [train.py:996] (3/4) Epoch 6, batch 13200, loss[loss=0.2387, simple_loss=0.3074, pruned_loss=0.08504, over 21271.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3049, pruned_loss=0.07451, over 4267994.76 frames. ], batch size: 549, lr: 5.13e-03, grad_scale: 32.0 2023-06-22 16:52:29,417 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=18.81 vs. limit=15.0 2023-06-22 16:52:30,772 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.08 vs. limit=15.0 2023-06-22 16:53:15,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=994218.0, ans=0.2 2023-06-22 16:53:45,961 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.127e+02 2.879e+02 3.252e+02 3.649e+02 5.858e+02, threshold=6.504e+02, percent-clipped=0.0 2023-06-22 16:53:56,119 INFO [train.py:996] (3/4) Epoch 6, batch 13250, loss[loss=0.2264, simple_loss=0.2957, pruned_loss=0.0785, over 21822.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.305, pruned_loss=0.077, over 4277365.44 frames. ], batch size: 107, lr: 5.13e-03, grad_scale: 32.0 2023-06-22 16:54:01,660 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.55 vs. limit=10.0 2023-06-22 16:54:25,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=994398.0, ans=0.1 2023-06-22 16:54:43,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=994458.0, ans=0.125