2023-06-17 16:37:24,870 INFO [train.py:1064] (2/4) Training started 2023-06-17 16:37:24,870 INFO [train.py:1074] (2/4) Device: cuda:2 2023-06-17 16:37:27,104 INFO [lexicon.py:168] (2/4) Loading pre-compiled data/lang_char/Linv.pt 2023-06-17 16:37:27,416 INFO [train.py:1085] (2/4) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.1', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'c51a0b9684442a88ee37f3ce0af686a04b66855b', 'k2-git-date': 'Mon May 1 21:38:03 2023', 'lhotse-version': '1.14.0.dev+git.0f812851.dirty', 'torch-version': '1.10.0+cu102', 'torch-cuda-available': True, 'torch-cuda-version': '10.2', 'python-version': '3.8', 'icefall-git-branch': 'zipformer_wenetspeech', 'icefall-git-sha1': '802bf98-dirty', 'icefall-git-date': 'Fri Jun 16 18:26:55 2023', 'icefall-path': '/star-kw/kangwei/code/icefall_wenetspeech', 'k2-path': '/ceph-hw/kangwei/code/k2_release/k2/k2/python/k2/__init__.py', 'lhotse-path': '/ceph-hw/kangwei/dev_tools/anaconda3/envs/rnnt2/lib/python3.8/site-packages/lhotse-1.14.0.dev0+git.0f812851.dirty-py3.8.egg/lhotse/__init__.py', 'hostname': 'de-74279-k2-train-7-1218101249-5d97868c7c-v8ngc', 'IP address': '10.177.77.18'}, 'world_size': 4, 'master_port': 12536, 'tensorboard': True, 'num_epochs': 12, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp_L_small'), 'lang_dir': PosixPath('data/lang_char'), 'base_lr': 0.045, 'lr_batches': 7500, 'lr_epochs': 1.5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,2,2,2,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,768,768,768,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,256,256,256,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,192,192,192,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 900, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 8, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'training_subset': 'L', 'blank_id': 0, 'vocab_size': 5537} 2023-06-17 16:37:27,417 INFO [train.py:1087] (2/4) About to create model 2023-06-17 16:37:27,975 INFO [train.py:1091] (2/4) Number of model parameters: 32327030 2023-06-17 16:37:34,419 INFO [train.py:1106] (2/4) Using DDP 2023-06-17 16:37:34,874 INFO [asr_datamodule.py:390] (2/4) About to get train cuts 2023-06-17 16:37:34,891 INFO [asr_datamodule.py:398] (2/4) About to get dev cuts 2023-06-17 16:37:34,901 INFO [asr_datamodule.py:211] (2/4) About to get Musan cuts 2023-06-17 16:37:37,786 INFO [asr_datamodule.py:216] (2/4) Enable MUSAN 2023-06-17 16:37:37,786 INFO [asr_datamodule.py:239] (2/4) Enable SpecAugment 2023-06-17 16:37:37,787 INFO [asr_datamodule.py:240] (2/4) Time warp factor: 80 2023-06-17 16:37:37,787 INFO [asr_datamodule.py:250] (2/4) Num frame mask: 10 2023-06-17 16:37:37,787 INFO [asr_datamodule.py:263] (2/4) About to create train dataset 2023-06-17 16:37:37,788 INFO [asr_datamodule.py:289] (2/4) Using DynamicBucketingSampler. 2023-06-17 16:37:42,116 INFO [asr_datamodule.py:305] (2/4) About to create train dataloader 2023-06-17 16:37:42,117 INFO [asr_datamodule.py:336] (2/4) About to create dev dataset 2023-06-17 16:37:42,798 INFO [asr_datamodule.py:354] (2/4) About to create dev dataloader 2023-06-17 16:39:51,439 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.79 vs. limit=5.0 2023-06-17 16:39:59,891 INFO [train.py:996] (2/4) Epoch 1, batch 0, loss[loss=10.29, simple_loss=9.352, pruned_loss=9.387, over 21828.00 frames. ], tot_loss[loss=10.29, simple_loss=9.352, pruned_loss=9.387, over 21828.00 frames. ], batch size: 118, lr: 2.25e-02, grad_scale: 1.0 2023-06-17 16:39:59,892 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-17 16:40:52,890 INFO [train.py:1028] (2/4) Epoch 1, validation: loss=10.49, simple_loss=9.517, pruned_loss=9.679, over 1796401.00 frames. 2023-06-17 16:40:52,891 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 22254MB 2023-06-17 16:41:02,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=0.0, ans=0.5 2023-06-17 16:41:11,351 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.16 vs. limit=7.545 2023-06-17 16:41:12,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=60.0, ans=0.5 2023-06-17 16:41:15,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=60.0, ans=0.2994 2023-06-17 16:41:28,406 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=255.46 vs. limit=7.59 2023-06-17 16:41:39,214 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=255.61 vs. limit=7.635 2023-06-17 16:41:40,689 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=255.48 vs. limit=7.635 2023-06-17 16:42:13,646 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.11 vs. limit=5.045 2023-06-17 16:42:31,418 INFO [train.py:996] (2/4) Epoch 1, batch 50, loss[loss=1.333, simple_loss=1.195, pruned_loss=1.252, over 21290.00 frames. ], tot_loss[loss=4.055, simple_loss=3.754, pruned_loss=2.973, over 950839.14 frames. ], batch size: 159, lr: 2.48e-02, grad_scale: 0.5 2023-06-17 16:42:53,960 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=5.016e-03 2023-06-17 16:43:02,860 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=31.28 vs. limit=5.075 2023-06-17 16:43:06,089 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=10.34 vs. limit=3.045 2023-06-17 16:43:11,160 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=216.09 vs. limit=7.635 2023-06-17 16:43:19,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=360.0, ans=0.483125 2023-06-17 16:43:29,619 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=193.37 vs. limit=7.6575 2023-06-17 16:43:46,808 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=75.25 vs. limit=7.815 2023-06-17 16:43:48,207 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=92.93 vs. limit=7.815 2023-06-17 16:43:54,333 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=480.0, ans=0.182 2023-06-17 16:43:54,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=480.0, ans=0.44 2023-06-17 16:44:27,141 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.93 vs. limit=4.192 2023-06-17 16:44:36,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=540.0, ans=0.4746875 2023-06-17 16:44:41,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=540.0, ans=0.4746875 2023-06-17 16:44:45,952 INFO [train.py:996] (2/4) Epoch 1, batch 100, loss[loss=1.272, simple_loss=1.11, pruned_loss=1.3, over 21566.00 frames. ], tot_loss[loss=2.544, simple_loss=2.322, pruned_loss=2.064, over 1685255.44 frames. ], batch size: 414, lr: 2.70e-02, grad_scale: 1.0 2023-06-17 16:44:49,125 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.650e+02 2.605e+02 7.361e+02 5.108e+03 2.907e+04, threshold=1.472e+03, percent-clipped=0.0 2023-06-17 16:44:57,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=600.0, ans=0.294 2023-06-17 16:45:00,357 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=660.0, ans=0.4690625 2023-06-17 16:45:00,856 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.92 vs. limit=5.33 2023-06-17 16:45:26,055 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=22.18 vs. limit=5.33 2023-06-17 16:45:41,142 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.36 vs. limit=8.04 2023-06-17 16:45:42,878 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=15.78 vs. limit=7.77 2023-06-17 16:45:45,062 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=720.0, ans=0.41000000000000003 2023-06-17 16:45:49,619 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=71.36 vs. limit=7.77 2023-06-17 16:46:24,359 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=34.56 vs. limit=5.39 2023-06-17 16:46:28,798 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=196.45 vs. limit=7.7925 2023-06-17 16:46:53,216 INFO [train.py:996] (2/4) Epoch 1, batch 150, loss[loss=0.8708, simple_loss=0.7444, pruned_loss=0.92, over 21824.00 frames. ], tot_loss[loss=1.982, simple_loss=1.784, pruned_loss=1.727, over 2264191.98 frames. ], batch size: 98, lr: 2.93e-02, grad_scale: 1.0 2023-06-17 16:48:18,489 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=55.79 vs. limit=7.905 2023-06-17 16:48:41,345 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=66.22 vs. limit=7.9275 2023-06-17 16:48:48,201 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.80 vs. limit=3.171 2023-06-17 16:48:50,949 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=60.59 vs. limit=7.9275 2023-06-17 16:48:58,450 INFO [train.py:996] (2/4) Epoch 1, batch 200, loss[loss=1.048, simple_loss=0.8995, pruned_loss=1.018, over 21789.00 frames. ], tot_loss[loss=1.669, simple_loss=1.488, pruned_loss=1.499, over 2692148.99 frames. ], batch size: 316, lr: 3.15e-02, grad_scale: 2.0 2023-06-17 16:49:01,468 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.811e+01 1.173e+02 1.419e+02 1.881e+02 2.743e+02, threshold=2.839e+02, percent-clipped=0.0 2023-06-17 16:49:08,760 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.15 vs. limit=4.48 2023-06-17 16:49:29,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1260.0, ans=0.15275 2023-06-17 16:49:32,988 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=67.80 vs. limit=7.9725 2023-06-17 16:49:37,090 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=79.07 vs. limit=7.9725 2023-06-17 16:50:04,364 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=73.64 vs. limit=7.9725 2023-06-17 16:50:05,760 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.16 vs. limit=8.445 2023-06-17 16:50:10,867 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=60.86 vs. limit=7.995 2023-06-17 16:50:17,615 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=27.88 vs. limit=8.49 2023-06-17 16:50:41,655 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=10.64 vs. limit=8.0175 2023-06-17 16:50:58,204 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=67.74 vs. limit=8.0175 2023-06-17 16:50:59,132 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1380.0, ans=0.4353125 2023-06-17 16:51:13,019 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=167.57 vs. limit=8.04 2023-06-17 16:51:18,149 INFO [train.py:996] (2/4) Epoch 1, batch 250, loss[loss=0.9894, simple_loss=0.8495, pruned_loss=0.907, over 21918.00 frames. ], tot_loss[loss=1.467, simple_loss=1.297, pruned_loss=1.331, over 3036000.99 frames. ], batch size: 316, lr: 3.38e-02, grad_scale: 2.0 2023-06-17 16:51:24,925 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1500.0, ans=0.14375 2023-06-17 16:51:27,331 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=186.44 vs. limit=8.0625 2023-06-17 16:51:30,131 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=18.32 vs. limit=8.0625 2023-06-17 16:52:17,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1620.0, ans=0.4240625 2023-06-17 16:52:17,991 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=28.68 vs. limit=8.1075 2023-06-17 16:53:12,327 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.87 vs. limit=4.696 2023-06-17 16:53:18,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1740.0, ans=0.13474999999999998 2023-06-17 16:53:23,937 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=21.37 vs. limit=8.1525 2023-06-17 16:53:28,113 INFO [train.py:996] (2/4) Epoch 1, batch 300, loss[loss=1.052, simple_loss=0.8937, pruned_loss=0.9598, over 21736.00 frames. ], tot_loss[loss=1.317, simple_loss=1.157, pruned_loss=1.197, over 3305845.15 frames. ], batch size: 298, lr: 3.60e-02, grad_scale: 4.0 2023-06-17 16:53:31,432 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 7.248e+01 1.102e+02 1.349e+02 1.694e+02 3.595e+02, threshold=2.697e+02, percent-clipped=2.0 2023-06-17 16:55:16,273 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=16.59 vs. limit=8.2425 2023-06-17 16:55:26,779 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=23.46 vs. limit=8.265 2023-06-17 16:55:32,849 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=13.68 vs. limit=8.265 2023-06-17 16:55:38,321 INFO [train.py:996] (2/4) Epoch 1, batch 350, loss[loss=0.8611, simple_loss=0.727, pruned_loss=0.7673, over 21471.00 frames. ], tot_loss[loss=1.202, simple_loss=1.048, pruned_loss=1.089, over 3522900.31 frames. ], batch size: 144, lr: 3.83e-02, grad_scale: 4.0 2023-06-17 16:55:43,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2100.0, ans=0.12125 2023-06-17 16:55:44,370 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=135.34 vs. limit=8.2875 2023-06-17 16:56:19,413 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.81 vs. limit=8.31 2023-06-17 16:56:26,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2220.0, ans=0.3959375 2023-06-17 16:56:38,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2220.0, ans=0.3959375 2023-06-17 16:57:13,081 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.39 vs. limit=8.355 2023-06-17 16:57:42,302 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2340.0, ans=0.2766 2023-06-17 16:57:43,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2340.0, ans=0.3903125 2023-06-17 16:57:46,853 INFO [train.py:996] (2/4) Epoch 1, batch 400, loss[loss=0.7842, simple_loss=0.6594, pruned_loss=0.6806, over 21549.00 frames. ], tot_loss[loss=1.114, simple_loss=0.9649, pruned_loss=1.003, over 3683232.16 frames. ], batch size: 263, lr: 4.05e-02, grad_scale: 8.0 2023-06-17 16:57:50,151 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 9.350e+01 1.224e+02 1.536e+02 2.025e+02 4.442e+02, threshold=3.072e+02, percent-clipped=8.0 2023-06-17 16:57:52,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2400.0, ans=0.046 2023-06-17 16:57:59,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=2400.0, ans=8.4 2023-06-17 16:58:00,721 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.56 vs. limit=8.4 2023-06-17 16:58:01,762 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2460.0, ans=0.3846875 2023-06-17 16:59:07,467 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=23.45 vs. limit=8.4675 2023-06-17 16:59:50,786 INFO [train.py:996] (2/4) Epoch 1, batch 450, loss[loss=0.7708, simple_loss=0.6506, pruned_loss=0.6389, over 21161.00 frames. ], tot_loss[loss=1.056, simple_loss=0.9094, pruned_loss=0.9414, over 3811267.58 frames. ], batch size: 548, lr: 4.28e-02, grad_scale: 8.0 2023-06-17 17:00:18,879 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.54 vs. limit=9.57 2023-06-17 17:00:27,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2760.0, ans=0.370625 2023-06-17 17:00:28,157 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.04 vs. limit=8.535 2023-06-17 17:00:41,111 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=2820.0, ans=0.24230000000000002 2023-06-17 17:01:03,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2820.0, ans=0.3678125 2023-06-17 17:01:03,810 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=33.29 vs. limit=8.557500000000001 2023-06-17 17:01:20,255 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2880.0, ans=0.365 2023-06-17 17:01:41,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2940.0, ans=0.08975 2023-06-17 17:01:45,396 INFO [train.py:996] (2/4) Epoch 1, batch 500, loss[loss=0.7539, simple_loss=0.634, pruned_loss=0.6123, over 21643.00 frames. ], tot_loss[loss=1.029, simple_loss=0.8811, pruned_loss=0.9026, over 3905401.21 frames. ], batch size: 264, lr: 4.49e-02, grad_scale: 8.0 2023-06-17 17:01:48,413 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 9.420e+01 1.754e+02 2.624e+02 3.522e+02 8.349e+02, threshold=5.248e+02, percent-clipped=35.0 2023-06-17 17:03:03,689 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.01 vs. limit=8.67 2023-06-17 17:03:05,250 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=13.27 vs. limit=8.67 2023-06-17 17:03:11,836 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.70 vs. limit=8.67 2023-06-17 17:03:19,493 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.18 vs. limit=9.84 2023-06-17 17:03:28,746 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=17.15 vs. limit=8.692499999999999 2023-06-17 17:03:31,808 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.15 vs. limit=8.692499999999999 2023-06-17 17:03:57,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=3240.0, ans=0.5 2023-06-17 17:03:59,773 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.12 vs. limit=3.495 2023-06-17 17:04:00,020 INFO [train.py:996] (2/4) Epoch 1, batch 550, loss[loss=0.8588, simple_loss=0.737, pruned_loss=0.6416, over 21377.00 frames. ], tot_loss[loss=0.9992, simple_loss=0.8531, pruned_loss=0.8574, over 3989449.83 frames. ], batch size: 194, lr: 4.49e-02, grad_scale: 8.0 2023-06-17 17:04:30,187 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.87 vs. limit=9.975 2023-06-17 17:04:51,901 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.31 vs. limit=5.344 2023-06-17 17:05:11,918 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=1.789e+01 2023-06-17 17:05:15,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=3420.0, ans=0.33968750000000003 2023-06-17 17:05:22,141 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=3420.0, ans=0.33968750000000003 2023-06-17 17:05:30,836 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.08 vs. limit=4.696 2023-06-17 17:05:57,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=3600.0, ans=0.33125 2023-06-17 17:05:58,335 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.20 vs. limit=5.4399999999999995 2023-06-17 17:05:59,260 INFO [train.py:996] (2/4) Epoch 1, batch 600, loss[loss=0.7339, simple_loss=0.643, pruned_loss=0.5061, over 21629.00 frames. ], tot_loss[loss=0.9678, simple_loss=0.8262, pruned_loss=0.8085, over 4054637.90 frames. ], batch size: 263, lr: 4.49e-02, grad_scale: 8.0 2023-06-17 17:06:03,306 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.326e+02 3.057e+02 4.199e+02 5.888e+02 1.512e+03, threshold=8.399e+02, percent-clipped=32.0 2023-06-17 17:06:29,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=3600.0, ans=0.33125 2023-06-17 17:07:15,624 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.66 vs. limit=8.895 2023-06-17 17:07:51,609 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=3840.0, ans=0.07600000000000001 2023-06-17 17:07:59,823 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.09 vs. limit=10.379999999999999 2023-06-17 17:08:07,408 INFO [train.py:996] (2/4) Epoch 1, batch 650, loss[loss=0.9194, simple_loss=0.7893, pruned_loss=0.6578, over 21682.00 frames. ], tot_loss[loss=0.9299, simple_loss=0.7952, pruned_loss=0.7548, over 4107411.35 frames. ], batch size: 414, lr: 4.49e-02, grad_scale: 8.0 2023-06-17 17:08:36,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=3960.0, ans=0.05149999999999999 2023-06-17 17:09:12,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=4020.0, ans=0.04991666666666667 2023-06-17 17:09:33,675 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=4020.0, ans=0.31156249999999996 2023-06-17 17:09:54,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=4080.0, ans=0.30874999999999997 2023-06-17 17:09:58,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=4140.0, ans=0.3059375 2023-06-17 17:10:10,330 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.48 vs. limit=9.0525 2023-06-17 17:10:13,191 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.19 vs. limit=6.05 2023-06-17 17:10:13,699 INFO [train.py:996] (2/4) Epoch 1, batch 700, loss[loss=0.7065, simple_loss=0.6161, pruned_loss=0.4774, over 21738.00 frames. ], tot_loss[loss=0.888, simple_loss=0.7615, pruned_loss=0.7, over 4152764.16 frames. ], batch size: 112, lr: 4.49e-02, grad_scale: 8.0 2023-06-17 17:10:14,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=4200.0, ans=6.05 2023-06-17 17:10:16,592 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 4.036e+02 7.786e+02 1.089e+03 2.394e+03, threshold=1.557e+03, percent-clipped=44.0 2023-06-17 17:10:42,789 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=4200.0, ans=0.258 2023-06-17 17:11:02,713 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=4260.0, ans=0.3003125 2023-06-17 17:11:03,216 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.53 vs. limit=9.0975 2023-06-17 17:12:22,075 INFO [train.py:996] (2/4) Epoch 1, batch 750, loss[loss=0.7576, simple_loss=0.6685, pruned_loss=0.4899, over 21792.00 frames. ], tot_loss[loss=0.8457, simple_loss=0.7274, pruned_loss=0.6481, over 4177964.81 frames. ], batch size: 351, lr: 4.49e-02, grad_scale: 8.0 2023-06-17 17:12:26,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=4500.0, ans=0.255 2023-06-17 17:12:56,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=4560.0, ans=0.2544 2023-06-17 17:13:17,749 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.96 vs. limit=10.92 2023-06-17 17:13:20,428 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.01 vs. limit=10.965 2023-06-17 17:13:21,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=4620.0, ans=0.2834375 2023-06-17 17:13:56,801 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.20 vs. limit=5.872 2023-06-17 17:14:01,345 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.24 vs. limit=9.254999999999999 2023-06-17 17:14:11,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=4740.0, ans=0.2778125 2023-06-17 17:14:30,923 INFO [train.py:996] (2/4) Epoch 1, batch 800, loss[loss=0.6722, simple_loss=0.599, pruned_loss=0.4196, over 21004.00 frames. ], tot_loss[loss=0.8024, simple_loss=0.693, pruned_loss=0.5974, over 4206043.49 frames. ], batch size: 608, lr: 4.49e-02, grad_scale: 16.0 2023-06-17 17:14:31,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=4800.0, ans=0.00982608695652174 2023-06-17 17:14:31,887 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.90 vs. limit=9.3 2023-06-17 17:14:33,928 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.077e+02 4.647e+02 7.147e+02 1.104e+03 3.003e+03, threshold=1.429e+03, percent-clipped=10.0 2023-06-17 17:14:54,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=4860.0, ans=0.04949747468305833 2023-06-17 17:16:00,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=4980.0, ans=6.245 2023-06-17 17:16:09,699 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.39 vs. limit=9.3675 2023-06-17 17:16:10,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=4980.0, ans=0.00978695652173913 2023-06-17 17:16:12,964 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.19 vs. limit=11.235 2023-06-17 17:16:20,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=5040.0, ans=0.04566666666666667 2023-06-17 17:16:20,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=5040.0, ans=0.07 2023-06-17 17:16:32,227 INFO [train.py:996] (2/4) Epoch 1, batch 850, loss[loss=0.6208, simple_loss=0.5486, pruned_loss=0.3912, over 21887.00 frames. ], tot_loss[loss=0.7662, simple_loss=0.6643, pruned_loss=0.555, over 4224896.92 frames. ], batch size: 107, lr: 4.49e-02, grad_scale: 4.0 2023-06-17 17:16:34,775 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.45 vs. limit=6.275 2023-06-17 17:16:38,866 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=5100.0, ans=0.8009999999999999 2023-06-17 17:16:59,244 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.52 vs. limit=3.774 2023-06-17 17:18:01,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=5280.0, ans=0.009721739130434783 2023-06-17 17:18:01,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=5280.0, ans=0.2525 2023-06-17 17:18:02,149 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.32 vs. limit=9.48 2023-06-17 17:18:49,064 INFO [train.py:996] (2/4) Epoch 1, batch 900, loss[loss=0.5175, simple_loss=0.4723, pruned_loss=0.2998, over 21291.00 frames. ], tot_loss[loss=0.7307, simple_loss=0.6358, pruned_loss=0.5159, over 4242225.15 frames. ], batch size: 159, lr: 4.48e-02, grad_scale: 8.0 2023-06-17 17:18:49,983 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.11 vs. limit=9.525 2023-06-17 17:18:55,062 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.046e+02 5.086e+02 7.748e+02 1.151e+03 3.891e+03, threshold=1.550e+03, percent-clipped=18.0 2023-06-17 17:19:28,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=5520.0, ans=0.7068000000000001 2023-06-17 17:19:31,034 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.66 vs. limit=6.208 2023-06-17 17:19:42,948 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.43 vs. limit=6.38 2023-06-17 17:20:03,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=5580.0, ans=0.24419999999999997 2023-06-17 17:20:30,798 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=9.615 2023-06-17 17:20:56,017 INFO [train.py:996] (2/4) Epoch 1, batch 950, loss[loss=0.6536, simple_loss=0.5806, pruned_loss=0.3998, over 21500.00 frames. ], tot_loss[loss=0.7008, simple_loss=0.6131, pruned_loss=0.4813, over 4255981.37 frames. ], batch size: 131, lr: 4.48e-02, grad_scale: 4.0 2023-06-17 17:21:25,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=5700.0, ans=0.23281249999999998 2023-06-17 17:21:33,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=5760.0, ans=0.04266666666666667 2023-06-17 17:22:01,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=5820.0, ans=0.2271875 2023-06-17 17:22:06,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=5820.0, ans=0.04241666666666667 2023-06-17 17:22:17,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=5820.0, ans=0.04241666666666667 2023-06-17 17:22:35,921 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=5940.0, ans=0.04191666666666667 2023-06-17 17:22:52,028 INFO [train.py:996] (2/4) Epoch 1, batch 1000, loss[loss=0.6722, simple_loss=0.5974, pruned_loss=0.4074, over 21439.00 frames. ], tot_loss[loss=0.6806, simple_loss=0.5976, pruned_loss=0.4563, over 4268252.06 frames. ], batch size: 508, lr: 4.48e-02, grad_scale: 8.0 2023-06-17 17:23:13,984 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.329e+02 4.662e+02 7.435e+02 1.273e+03 3.855e+03, threshold=1.487e+03, percent-clipped=17.0 2023-06-17 17:23:44,142 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.08 vs. limit=9.7725 2023-06-17 17:24:38,085 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=6180.0, ans=0.00952608695652174 2023-06-17 17:25:17,337 INFO [train.py:996] (2/4) Epoch 1, batch 1050, loss[loss=0.7012, simple_loss=0.6219, pruned_loss=0.4236, over 21529.00 frames. ], tot_loss[loss=0.6613, simple_loss=0.5829, pruned_loss=0.4335, over 4272707.89 frames. ], batch size: 471, lr: 4.48e-02, grad_scale: 8.0 2023-06-17 17:25:40,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=6300.0, ans=0.04041666666666667 2023-06-17 17:25:40,628 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=6300.0, ans=0.237 2023-06-17 17:25:49,940 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.76 vs. limit=9.8625 2023-06-17 17:26:11,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=6360.0, ans=0.04016666666666667 2023-06-17 17:26:15,969 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=6420.0, ans=0.0 2023-06-17 17:27:38,799 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=9.9525 2023-06-17 17:27:45,218 INFO [train.py:996] (2/4) Epoch 1, batch 1100, loss[loss=0.5657, simple_loss=0.5187, pruned_loss=0.3187, over 21261.00 frames. ], tot_loss[loss=0.6392, simple_loss=0.5668, pruned_loss=0.4088, over 4275452.76 frames. ], batch size: 548, lr: 4.48e-02, grad_scale: 8.0 2023-06-17 17:28:00,782 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.917e+02 3.328e+02 5.726e+02 1.108e+03 4.215e+03, threshold=1.145e+03, percent-clipped=17.0 2023-06-17 17:28:01,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=6600.0, ans=0.03916666666666667 2023-06-17 17:28:07,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=6660.0, ans=0.058375 2023-06-17 17:28:14,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=6660.0, ans=0.1878125 2023-06-17 17:28:44,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=6720.0, ans=9.2 2023-06-17 17:29:28,151 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.20 vs. limit=12.585 2023-06-17 17:30:01,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=6840.0, ans=0.3026 2023-06-17 17:30:10,715 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.16 vs. limit=12.675 2023-06-17 17:30:10,958 INFO [train.py:996] (2/4) Epoch 1, batch 1150, loss[loss=0.5573, simple_loss=0.5177, pruned_loss=0.3052, over 21789.00 frames. ], tot_loss[loss=0.6192, simple_loss=0.5522, pruned_loss=0.3874, over 4274432.88 frames. ], batch size: 316, lr: 4.47e-02, grad_scale: 4.0 2023-06-17 17:30:35,537 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=6960.0, ans=0.03766666666666667 2023-06-17 17:30:37,169 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=6960.0, ans=0.17375000000000002 2023-06-17 17:32:20,710 INFO [train.py:996] (2/4) Epoch 1, batch 1200, loss[loss=0.5991, simple_loss=0.5506, pruned_loss=0.3338, over 21809.00 frames. ], tot_loss[loss=0.6064, simple_loss=0.5438, pruned_loss=0.3712, over 4282262.24 frames. ], batch size: 124, lr: 4.47e-02, grad_scale: 8.0 2023-06-17 17:32:25,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=7200.0, ans=0.22799999999999998 2023-06-17 17:32:28,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=7200.0, ans=0.0 2023-06-17 17:32:36,898 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.515e+02 4.923e+02 7.154e+02 1.207e+03 2.545e+03, threshold=1.431e+03, percent-clipped=26.0 2023-06-17 17:34:27,343 INFO [train.py:996] (2/4) Epoch 1, batch 1250, loss[loss=0.4985, simple_loss=0.4649, pruned_loss=0.27, over 21390.00 frames. ], tot_loss[loss=0.6008, simple_loss=0.541, pruned_loss=0.3614, over 4285096.91 frames. ], batch size: 159, lr: 4.47e-02, grad_scale: 8.0 2023-06-17 17:34:27,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=7500.0, ans=0.07 2023-06-17 17:36:26,794 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=7740.0, ans=0.00918695652173913 2023-06-17 17:36:37,122 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 17:36:45,489 INFO [train.py:996] (2/4) Epoch 1, batch 1300, loss[loss=0.4745, simple_loss=0.4251, pruned_loss=0.2737, over 20850.00 frames. ], tot_loss[loss=0.5907, simple_loss=0.5348, pruned_loss=0.3489, over 4288328.72 frames. ], batch size: 608, lr: 4.47e-02, grad_scale: 8.0 2023-06-17 17:37:02,068 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.523e+02 4.002e+02 7.251e+02 1.294e+03 4.242e+03, threshold=1.450e+03, percent-clipped=21.0 2023-06-17 17:37:05,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=7860.0, ans=0.033916666666666664 2023-06-17 17:37:54,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=7920.0, ans=0.6228 2023-06-17 17:38:14,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=7980.0, ans=0.2202 2023-06-17 17:38:54,694 INFO [train.py:996] (2/4) Epoch 1, batch 1350, loss[loss=0.4673, simple_loss=0.4363, pruned_loss=0.2519, over 21743.00 frames. ], tot_loss[loss=0.5815, simple_loss=0.5293, pruned_loss=0.3379, over 4295896.94 frames. ], batch size: 247, lr: 4.46e-02, grad_scale: 8.0 2023-06-17 17:38:55,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=8100.0, ans=0.04949747468305833 2023-06-17 17:39:24,983 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 17:39:28,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=8100.0, ans=0.03291666666666667 2023-06-17 17:39:39,838 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=8160.0, ans=0.0 2023-06-17 17:39:41,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=8160.0, ans=0.009095652173913043 2023-06-17 17:41:04,055 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=8340.0, ans=0.03191666666666667 2023-06-17 17:41:08,246 INFO [train.py:996] (2/4) Epoch 1, batch 1400, loss[loss=0.4662, simple_loss=0.4293, pruned_loss=0.2564, over 14706.00 frames. ], tot_loss[loss=0.5697, simple_loss=0.5207, pruned_loss=0.3266, over 4289894.64 frames. ], batch size: 61, lr: 4.46e-02, grad_scale: 8.0 2023-06-17 17:41:12,251 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.20 vs. limit=13.8 2023-06-17 17:41:24,125 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.077e+02 4.733e+02 7.957e+02 1.163e+03 2.485e+03, threshold=1.591e+03, percent-clipped=13.0 2023-06-17 17:42:10,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=8520.0, ans=0.6018 2023-06-17 17:42:13,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=8520.0, ans=0.2148 2023-06-17 17:43:07,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=8640.0, ans=0.05 2023-06-17 17:43:12,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=8640.0, ans=0.125 2023-06-17 17:43:18,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=8640.0, ans=0.008991304347826088 2023-06-17 17:43:23,578 INFO [train.py:996] (2/4) Epoch 1, batch 1450, loss[loss=0.6021, simple_loss=0.5484, pruned_loss=0.3356, over 21815.00 frames. ], tot_loss[loss=0.5637, simple_loss=0.5166, pruned_loss=0.3198, over 4291870.40 frames. ], batch size: 332, lr: 4.46e-02, grad_scale: 8.0 2023-06-17 17:43:29,896 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=8700.0, ans=0.125 2023-06-17 17:43:57,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=8760.0, ans=0.2124 2023-06-17 17:44:33,587 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 17:44:42,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=8820.0, ans=0.125 2023-06-17 17:44:46,090 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=8820.0, ans=0.07 2023-06-17 17:44:49,768 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.89 vs. limit=9.440000000000001 2023-06-17 17:45:05,523 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.56 vs. limit=14.16 2023-06-17 17:45:34,294 INFO [train.py:996] (2/4) Epoch 1, batch 1500, loss[loss=0.4904, simple_loss=0.4598, pruned_loss=0.262, over 21853.00 frames. ], tot_loss[loss=0.5542, simple_loss=0.5102, pruned_loss=0.3108, over 4289611.35 frames. ], batch size: 282, lr: 4.46e-02, grad_scale: 8.0 2023-06-17 17:46:08,689 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.531e+02 4.868e+02 8.441e+02 1.240e+03 3.321e+03, threshold=1.688e+03, percent-clipped=12.0 2023-06-17 17:46:35,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=9120.0, ans=0.125 2023-06-17 17:47:25,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=9180.0, ans=0.125 2023-06-17 17:47:33,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=9180.0, ans=0.125 2023-06-17 17:47:57,231 INFO [train.py:996] (2/4) Epoch 1, batch 1550, loss[loss=0.512, simple_loss=0.4839, pruned_loss=0.2704, over 21508.00 frames. ], tot_loss[loss=0.5385, simple_loss=0.4998, pruned_loss=0.2973, over 4285835.22 frames. ], batch size: 473, lr: 4.45e-02, grad_scale: 8.0 2023-06-17 17:49:01,420 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=9420.0, ans=0.02741666666666667 2023-06-17 17:49:33,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=9480.0, ans=0.2052 2023-06-17 17:49:42,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=9480.0, ans=0.5682 2023-06-17 17:49:43,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=9480.0, ans=0.04949747468305833 2023-06-17 17:50:26,905 INFO [train.py:996] (2/4) Epoch 1, batch 1600, loss[loss=0.5573, simple_loss=0.5146, pruned_loss=0.3031, over 21369.00 frames. ], tot_loss[loss=0.5326, simple_loss=0.4966, pruned_loss=0.2911, over 4281516.07 frames. ], batch size: 548, lr: 4.45e-02, grad_scale: 16.0 2023-06-17 17:50:28,925 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=9600.0, ans=0.125 2023-06-17 17:50:30,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=9600.0, ans=0.125 2023-06-17 17:50:36,896 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.140e+02 3.850e+02 5.686e+02 1.025e+03 3.086e+03, threshold=1.137e+03, percent-clipped=9.0 2023-06-17 17:52:43,629 INFO [train.py:996] (2/4) Epoch 1, batch 1650, loss[loss=0.5685, simple_loss=0.5127, pruned_loss=0.3173, over 21845.00 frames. ], tot_loss[loss=0.5258, simple_loss=0.4934, pruned_loss=0.2841, over 4276170.06 frames. ], batch size: 441, lr: 4.45e-02, grad_scale: 16.0 2023-06-17 17:52:44,639 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.45 vs. limit=11.2125 2023-06-17 17:53:11,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=9900.0, ans=0.5535000000000001 2023-06-17 17:54:19,580 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.93 vs. limit=11.2575 2023-06-17 17:54:41,296 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=10140.0, ans=0.1986 2023-06-17 17:55:02,456 INFO [train.py:996] (2/4) Epoch 1, batch 1700, loss[loss=0.4692, simple_loss=0.4469, pruned_loss=0.2453, over 21175.00 frames. ], tot_loss[loss=0.5285, simple_loss=0.4973, pruned_loss=0.2836, over 4282042.37 frames. ], batch size: 607, lr: 4.44e-02, grad_scale: 8.0 2023-06-17 17:55:31,449 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.673e+02 4.663e+02 7.889e+02 1.170e+03 3.370e+03, threshold=1.578e+03, percent-clipped=25.0 2023-06-17 17:57:02,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=10380.0, ans=11.3925 2023-06-17 17:57:16,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=10440.0, ans=0.5346000000000001 2023-06-17 17:57:16,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=10440.0, ans=0.35660000000000003 2023-06-17 17:57:23,250 INFO [train.py:996] (2/4) Epoch 1, batch 1750, loss[loss=0.3822, simple_loss=0.3972, pruned_loss=0.179, over 21617.00 frames. ], tot_loss[loss=0.5108, simple_loss=0.4867, pruned_loss=0.2696, over 4283177.67 frames. ], batch size: 247, lr: 4.44e-02, grad_scale: 8.0 2023-06-17 17:58:24,898 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.28 vs. limit=11.46 2023-06-17 17:59:47,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=10740.0, ans=0.125 2023-06-17 17:59:55,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=10740.0, ans=0.125 2023-06-17 17:59:58,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=10740.0, ans=0.008534782608695652 2023-06-17 18:00:11,336 INFO [train.py:996] (2/4) Epoch 1, batch 1800, loss[loss=0.4619, simple_loss=0.4862, pruned_loss=0.2138, over 21769.00 frames. ], tot_loss[loss=0.498, simple_loss=0.4779, pruned_loss=0.2603, over 4276300.63 frames. ], batch size: 282, lr: 4.44e-02, grad_scale: 8.0 2023-06-17 18:00:28,807 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.503e+02 4.588e+02 7.695e+02 1.107e+03 4.356e+03, threshold=1.539e+03, percent-clipped=16.0 2023-06-17 18:01:11,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=10920.0, ans=0.5178 2023-06-17 18:01:17,675 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.32 vs. limit=15.69 2023-06-17 18:01:31,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=10920.0, ans=0.125 2023-06-17 18:01:52,879 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=10980.0, ans=0.125 2023-06-17 18:01:55,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=11040.0, ans=0.1896 2023-06-17 18:02:21,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=11040.0, ans=0.1896 2023-06-17 18:02:36,584 INFO [train.py:996] (2/4) Epoch 1, batch 1850, loss[loss=0.4542, simple_loss=0.449, pruned_loss=0.2282, over 21411.00 frames. ], tot_loss[loss=0.4879, simple_loss=0.4739, pruned_loss=0.2512, over 4271108.02 frames. ], batch size: 194, lr: 4.43e-02, grad_scale: 8.0 2023-06-17 18:02:50,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=11100.0, ans=0.125 2023-06-17 18:04:02,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=11280.0, ans=0.125 2023-06-17 18:04:16,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=11280.0, ans=0.5052000000000001 2023-06-17 18:04:36,531 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=11340.0, ans=10.67 2023-06-17 18:04:54,552 INFO [train.py:996] (2/4) Epoch 1, batch 1900, loss[loss=0.3763, simple_loss=0.3805, pruned_loss=0.1848, over 21210.00 frames. ], tot_loss[loss=0.4854, simple_loss=0.4723, pruned_loss=0.2493, over 4278487.69 frames. ], batch size: 143, lr: 4.43e-02, grad_scale: 8.0 2023-06-17 18:04:54,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=11400.0, ans=0.186 2023-06-17 18:04:57,040 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.45 vs. limit=4.71 2023-06-17 18:05:06,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=11400.0, ans=0.125 2023-06-17 18:05:12,012 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.200e+02 4.389e+02 6.592e+02 1.010e+03 2.305e+03, threshold=1.318e+03, percent-clipped=4.0 2023-06-17 18:05:46,267 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.67 vs. limit=16.14 2023-06-17 18:05:55,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=11520.0, ans=0.01866666666666667 2023-06-17 18:06:39,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=11640.0, ans=0.01816666666666667 2023-06-17 18:06:39,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=11640.0, ans=0.07 2023-06-17 18:07:06,309 INFO [train.py:996] (2/4) Epoch 1, batch 1950, loss[loss=0.4415, simple_loss=0.4412, pruned_loss=0.2204, over 21786.00 frames. ], tot_loss[loss=0.4772, simple_loss=0.464, pruned_loss=0.2452, over 4262271.29 frames. ], batch size: 351, lr: 4.43e-02, grad_scale: 8.0 2023-06-17 18:07:40,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=11760.0, ans=0.0 2023-06-17 18:08:53,346 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.70 vs. limit=11.9775 2023-06-17 18:09:07,362 INFO [train.py:996] (2/4) Epoch 1, batch 2000, loss[loss=0.3001, simple_loss=0.3292, pruned_loss=0.1355, over 21764.00 frames. ], tot_loss[loss=0.4636, simple_loss=0.4543, pruned_loss=0.2364, over 4263614.38 frames. ], batch size: 124, lr: 4.42e-02, grad_scale: 16.0 2023-06-17 18:09:37,815 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.517e+02 4.885e+02 7.905e+02 1.281e+03 2.485e+03, threshold=1.581e+03, percent-clipped=23.0 2023-06-17 18:10:08,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=12060.0, ans=0.125 2023-06-17 18:10:18,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=12120.0, ans=0.125 2023-06-17 18:10:58,322 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.43 vs. limit=12.067499999999999 2023-06-17 18:11:09,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=12240.0, ans=0.125 2023-06-17 18:11:39,760 INFO [train.py:996] (2/4) Epoch 1, batch 2050, loss[loss=0.5278, simple_loss=0.5298, pruned_loss=0.2629, over 21604.00 frames. ], tot_loss[loss=0.4636, simple_loss=0.4558, pruned_loss=0.2356, over 4267975.85 frames. ], batch size: 442, lr: 4.42e-02, grad_scale: 8.0 2023-06-17 18:11:46,055 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=12300.0, ans=0.125 2023-06-17 18:11:50,628 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=12300.0, ans=0.07 2023-06-17 18:12:41,423 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.88 vs. limit=12.157499999999999 2023-06-17 18:12:45,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=12420.0, ans=8.105 2023-06-17 18:13:04,541 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.24 vs. limit=16.86 2023-06-17 18:13:28,120 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=12540.0, ans=0.125 2023-06-17 18:13:46,881 INFO [train.py:996] (2/4) Epoch 1, batch 2100, loss[loss=0.4516, simple_loss=0.449, pruned_loss=0.2271, over 21729.00 frames. ], tot_loss[loss=0.4658, simple_loss=0.4593, pruned_loss=0.2361, over 4276958.55 frames. ], batch size: 282, lr: 4.42e-02, grad_scale: 8.0 2023-06-17 18:13:49,004 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=12600.0, ans=0.05 2023-06-17 18:13:50,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=12600.0, ans=0.124 2023-06-17 18:13:54,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=12600.0, ans=10.0 2023-06-17 18:13:59,749 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.516e+02 5.167e+02 7.622e+02 1.111e+03 2.066e+03, threshold=1.524e+03, percent-clipped=6.0 2023-06-17 18:15:45,566 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 18:16:06,982 INFO [train.py:996] (2/4) Epoch 1, batch 2150, loss[loss=0.4155, simple_loss=0.3893, pruned_loss=0.2208, over 20056.00 frames. ], tot_loss[loss=0.4638, simple_loss=0.4585, pruned_loss=0.2346, over 4274140.15 frames. ], batch size: 703, lr: 4.41e-02, grad_scale: 8.0 2023-06-17 18:16:15,665 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.99 vs. limit=12.3375 2023-06-17 18:17:29,976 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.72 vs. limit=4.953 2023-06-17 18:17:33,008 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=13080.0, ans=0.125 2023-06-17 18:17:34,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=13080.0, ans=0.1692 2023-06-17 18:18:07,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=13140.0, ans=0.00801304347826087 2023-06-17 18:18:29,032 INFO [train.py:996] (2/4) Epoch 1, batch 2200, loss[loss=0.4092, simple_loss=0.4328, pruned_loss=0.1928, over 21805.00 frames. ], tot_loss[loss=0.4621, simple_loss=0.4605, pruned_loss=0.2318, over 4268522.01 frames. ], batch size: 371, lr: 4.41e-02, grad_scale: 8.0 2023-06-17 18:18:48,045 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.412e+02 4.529e+02 5.924e+02 1.033e+03 2.265e+03, threshold=1.185e+03, percent-clipped=8.0 2023-06-17 18:18:57,237 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 18:20:35,189 INFO [train.py:996] (2/4) Epoch 1, batch 2250, loss[loss=0.4106, simple_loss=0.4149, pruned_loss=0.2031, over 21787.00 frames. ], tot_loss[loss=0.4538, simple_loss=0.456, pruned_loss=0.2258, over 4272942.91 frames. ], batch size: 371, lr: 4.40e-02, grad_scale: 8.0 2023-06-17 18:21:09,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=13560.0, ans=8.39 2023-06-17 18:21:33,409 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=13620.0, ans=0.125 2023-06-17 18:22:17,125 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=13740.0, ans=0.125 2023-06-17 18:22:17,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=13740.0, ans=0.09899494936611666 2023-06-17 18:22:34,762 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=13800.0, ans=0.125 2023-06-17 18:22:35,734 INFO [train.py:996] (2/4) Epoch 1, batch 2300, loss[loss=0.3531, simple_loss=0.3683, pruned_loss=0.169, over 21274.00 frames. ], tot_loss[loss=0.4469, simple_loss=0.4492, pruned_loss=0.2223, over 4276054.35 frames. ], batch size: 176, lr: 4.40e-02, grad_scale: 8.0 2023-06-17 18:22:39,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=13800.0, ans=0.162 2023-06-17 18:22:54,880 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.161e+02 4.349e+02 7.156e+02 9.563e+02 2.862e+03, threshold=1.431e+03, percent-clipped=11.0 2023-06-17 18:23:11,636 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=13920.0, ans=0.125 2023-06-17 18:24:01,586 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=18.37 vs. limit=17.985 2023-06-17 18:24:50,683 INFO [train.py:996] (2/4) Epoch 1, batch 2350, loss[loss=0.4198, simple_loss=0.4262, pruned_loss=0.2067, over 21432.00 frames. ], tot_loss[loss=0.4444, simple_loss=0.4476, pruned_loss=0.2206, over 4269768.28 frames. ], batch size: 131, lr: 4.40e-02, grad_scale: 8.0 2023-06-17 18:24:53,544 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.32 vs. limit=12.05 2023-06-17 18:24:59,206 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.88 vs. limit=12.05 2023-06-17 18:25:37,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=14160.0, ans=0.007666666666666669 2023-06-17 18:25:40,530 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.10 vs. limit=9.687999999999999 2023-06-17 18:25:56,368 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.32 vs. limit=12.11 2023-06-17 18:26:22,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=14280.0, ans=0.125 2023-06-17 18:26:42,107 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.26 vs. limit=8.585 2023-06-17 18:27:03,559 INFO [train.py:996] (2/4) Epoch 1, batch 2400, loss[loss=0.4865, simple_loss=0.4885, pruned_loss=0.2423, over 21353.00 frames. ], tot_loss[loss=0.4492, simple_loss=0.4528, pruned_loss=0.2228, over 4269967.17 frames. ], batch size: 143, lr: 4.39e-02, grad_scale: 16.0 2023-06-17 18:27:24,516 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.741e+02 4.803e+02 6.612e+02 1.169e+03 2.103e+03, threshold=1.322e+03, percent-clipped=15.0 2023-06-17 18:27:58,859 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=14460.0, ans=0.39390000000000003 2023-06-17 18:28:23,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=14580.0, ans=0.10419999999999999 2023-06-17 18:28:36,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=14580.0, ans=0.125 2023-06-17 18:29:04,436 INFO [train.py:996] (2/4) Epoch 1, batch 2450, loss[loss=0.3882, simple_loss=0.4124, pruned_loss=0.182, over 21502.00 frames. ], tot_loss[loss=0.451, simple_loss=0.4547, pruned_loss=0.2237, over 4268915.69 frames. ], batch size: 230, lr: 4.39e-02, grad_scale: 16.0 2023-06-17 18:30:47,066 INFO [train.py:996] (2/4) Epoch 1, batch 2500, loss[loss=0.5769, simple_loss=0.5399, pruned_loss=0.3069, over 21765.00 frames. ], tot_loss[loss=0.443, simple_loss=0.4476, pruned_loss=0.2192, over 4268530.78 frames. ], batch size: 441, lr: 4.38e-02, grad_scale: 16.0 2023-06-17 18:30:54,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=15000.0, ans=0.007608695652173913 2023-06-17 18:31:09,192 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.058e+02 3.980e+02 5.871e+02 7.826e+02 2.441e+03, threshold=1.174e+03, percent-clipped=5.0 2023-06-17 18:31:10,247 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.81 vs. limit=13.1475 2023-06-17 18:32:00,527 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=15120.0, ans=0.0036666666666666722 2023-06-17 18:32:56,403 INFO [train.py:996] (2/4) Epoch 1, batch 2550, loss[loss=0.4787, simple_loss=0.4686, pruned_loss=0.2444, over 21464.00 frames. ], tot_loss[loss=0.4404, simple_loss=0.4473, pruned_loss=0.2167, over 4261330.01 frames. ], batch size: 211, lr: 4.38e-02, grad_scale: 16.0 2023-06-17 18:33:13,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=15360.0, ans=0.4304 2023-06-17 18:34:28,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=15480.0, ans=0.002166666666666671 2023-06-17 18:34:36,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=15480.0, ans=0.002166666666666671 2023-06-17 18:34:41,921 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=15540.0, ans=0.35609999999999997 2023-06-17 18:34:51,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=15540.0, ans=0.1 2023-06-17 18:35:02,570 INFO [train.py:996] (2/4) Epoch 1, batch 2600, loss[loss=0.4459, simple_loss=0.4489, pruned_loss=0.2214, over 21798.00 frames. ], tot_loss[loss=0.443, simple_loss=0.4514, pruned_loss=0.2173, over 4258545.42 frames. ], batch size: 247, lr: 4.37e-02, grad_scale: 16.0 2023-06-17 18:35:08,800 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=15600.0, ans=0.07 2023-06-17 18:35:13,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=15600.0, ans=0.14400000000000002 2023-06-17 18:35:16,864 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.481e+02 4.497e+02 6.428e+02 1.038e+03 2.322e+03, threshold=1.286e+03, percent-clipped=17.0 2023-06-17 18:36:08,700 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.51 vs. limit=13.395 2023-06-17 18:37:03,976 INFO [train.py:996] (2/4) Epoch 1, batch 2650, loss[loss=0.3992, simple_loss=0.4125, pruned_loss=0.1929, over 21524.00 frames. ], tot_loss[loss=0.4409, simple_loss=0.4503, pruned_loss=0.2158, over 4258678.70 frames. ], batch size: 194, lr: 4.37e-02, grad_scale: 8.0 2023-06-17 18:37:36,908 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=15960.0, ans=0.125 2023-06-17 18:37:41,988 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.37 vs. limit=9.004999999999999 2023-06-17 18:38:06,389 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.31 vs. limit=13.5075 2023-06-17 18:38:33,904 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.95 vs. limit=13.530000000000001 2023-06-17 18:38:52,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=16140.0, ans=0.125 2023-06-17 18:39:09,546 INFO [train.py:996] (2/4) Epoch 1, batch 2700, loss[loss=0.3841, simple_loss=0.3974, pruned_loss=0.1854, over 21834.00 frames. ], tot_loss[loss=0.4307, simple_loss=0.4419, pruned_loss=0.2098, over 4257628.06 frames. ], batch size: 118, lr: 4.36e-02, grad_scale: 8.0 2023-06-17 18:39:10,510 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.29 vs. limit=13.1 2023-06-17 18:39:17,570 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.50 vs. limit=13.575 2023-06-17 18:39:28,088 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.221e+02 4.133e+02 5.896e+02 7.988e+02 2.040e+03, threshold=1.179e+03, percent-clipped=10.0 2023-06-17 18:39:45,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=16260.0, ans=0.125 2023-06-17 18:40:33,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=16380.0, ans=0.125 2023-06-17 18:41:13,165 INFO [train.py:996] (2/4) Epoch 1, batch 2750, loss[loss=0.5148, simple_loss=0.4795, pruned_loss=0.2751, over 21737.00 frames. ], tot_loss[loss=0.4292, simple_loss=0.4414, pruned_loss=0.2086, over 4259007.85 frames. ], batch size: 508, lr: 4.36e-02, grad_scale: 8.0 2023-06-17 18:42:05,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=16560.0, ans=0.125 2023-06-17 18:43:35,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=16740.0, ans=0.125 2023-06-17 18:43:47,054 INFO [train.py:996] (2/4) Epoch 1, batch 2800, loss[loss=0.4146, simple_loss=0.4249, pruned_loss=0.2022, over 21130.00 frames. ], tot_loss[loss=0.4353, simple_loss=0.4477, pruned_loss=0.2115, over 4260751.64 frames. ], batch size: 143, lr: 4.36e-02, grad_scale: 16.0 2023-06-17 18:43:49,419 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=16.31 vs. limit=20.1 2023-06-17 18:44:10,872 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=16800.0, ans=0.31200000000000006 2023-06-17 18:44:12,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=16800.0, ans=0.07 2023-06-17 18:44:18,557 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.642e+02 4.703e+02 6.814e+02 1.223e+03 2.130e+03, threshold=1.363e+03, percent-clipped=25.0 2023-06-17 18:45:21,351 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.31 vs. limit=10.792 2023-06-17 18:45:50,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=17040.0, ans=0.125 2023-06-17 18:46:05,394 INFO [train.py:996] (2/4) Epoch 1, batch 2850, loss[loss=0.487, simple_loss=0.4811, pruned_loss=0.2465, over 21436.00 frames. ], tot_loss[loss=0.4336, simple_loss=0.4452, pruned_loss=0.211, over 4259394.85 frames. ], batch size: 507, lr: 4.35e-02, grad_scale: 16.0 2023-06-17 18:47:26,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=17280.0, ans=0.1272 2023-06-17 18:48:05,973 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.59 vs. limit=14.002500000000001 2023-06-17 18:48:20,745 INFO [train.py:996] (2/4) Epoch 1, batch 2900, loss[loss=0.4931, simple_loss=0.5161, pruned_loss=0.2351, over 20829.00 frames. ], tot_loss[loss=0.4302, simple_loss=0.4417, pruned_loss=0.2094, over 4259397.35 frames. ], batch size: 607, lr: 4.35e-02, grad_scale: 16.0 2023-06-17 18:48:24,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=17400.0, ans=0.126 2023-06-17 18:48:48,692 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.105e+02 4.392e+02 5.988e+02 8.416e+02 1.775e+03, threshold=1.198e+03, percent-clipped=6.0 2023-06-17 18:48:55,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=17460.0, ans=0.007073913043478261 2023-06-17 18:49:43,361 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.02 vs. limit=14.07 2023-06-17 18:49:50,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=17580.0, ans=10.0 2023-06-17 18:50:24,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=17640.0, ans=0.12360000000000002 2023-06-17 18:50:57,610 INFO [train.py:996] (2/4) Epoch 1, batch 2950, loss[loss=0.3981, simple_loss=0.4486, pruned_loss=0.1738, over 21799.00 frames. ], tot_loss[loss=0.4291, simple_loss=0.4425, pruned_loss=0.2078, over 4272902.60 frames. ], batch size: 298, lr: 4.34e-02, grad_scale: 8.0 2023-06-17 18:51:00,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=17700.0, ans=0.125 2023-06-17 18:51:02,940 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=18.50 vs. limit=20.775 2023-06-17 18:51:27,641 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=14.16 2023-06-17 18:51:51,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=17820.0, ans=0.0069956521739130435 2023-06-17 18:51:57,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=17820.0, ans=0.12180000000000002 2023-06-17 18:51:57,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=17820.0, ans=0.125 2023-06-17 18:52:35,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=17940.0, ans=0.125 2023-06-17 18:53:14,233 INFO [train.py:996] (2/4) Epoch 1, batch 3000, loss[loss=0.4868, simple_loss=0.4923, pruned_loss=0.2406, over 21537.00 frames. ], tot_loss[loss=0.4325, simple_loss=0.447, pruned_loss=0.209, over 4277182.50 frames. ], batch size: 441, lr: 4.34e-02, grad_scale: 8.0 2023-06-17 18:53:14,234 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-17 18:54:01,786 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.5.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([2.4115, 4.4216, 4.3520, 4.4050], device='cuda:2') 2023-06-17 18:54:05,137 INFO [train.py:1028] (2/4) Epoch 1, validation: loss=0.3426, simple_loss=0.4236, pruned_loss=0.1308, over 1796401.00 frames. 2023-06-17 18:54:05,138 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-17 18:54:20,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=18000.0, ans=0.12000000000000002 2023-06-17 18:54:33,157 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.648e+02 4.493e+02 5.938e+02 7.860e+02 2.320e+03, threshold=1.188e+03, percent-clipped=8.0 2023-06-17 18:55:24,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=18180.0, ans=0.26370000000000005 2023-06-17 18:55:57,264 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=6.71 vs. limit=14.34 2023-06-17 18:56:05,091 INFO [train.py:996] (2/4) Epoch 1, batch 3050, loss[loss=0.3337, simple_loss=0.3764, pruned_loss=0.1455, over 21625.00 frames. ], tot_loss[loss=0.4276, simple_loss=0.4448, pruned_loss=0.2052, over 4280703.36 frames. ], batch size: 230, lr: 4.33e-02, grad_scale: 8.0 2023-06-17 18:56:44,514 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=18360.0, ans=21.27 2023-06-17 18:56:45,355 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=18360.0, ans=0.125 2023-06-17 18:57:14,267 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.27 vs. limit=14.407499999999999 2023-06-17 18:57:33,179 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=4.28 vs. limit=14.43 2023-06-17 18:57:46,286 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.52 vs. limit=14.43 2023-06-17 18:58:29,010 INFO [train.py:996] (2/4) Epoch 1, batch 3100, loss[loss=0.3779, simple_loss=0.3998, pruned_loss=0.1781, over 21455.00 frames. ], tot_loss[loss=0.4238, simple_loss=0.4428, pruned_loss=0.2024, over 4279730.11 frames. ], batch size: 211, lr: 4.33e-02, grad_scale: 8.0 2023-06-17 18:58:44,234 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.00 vs. limit=14.475 2023-06-17 18:58:50,767 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.527e+02 3.860e+02 4.806e+02 7.218e+02 1.901e+03, threshold=9.611e+02, percent-clipped=6.0 2023-06-17 18:59:29,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=18720.0, ans=0.11280000000000001 2023-06-17 18:59:59,694 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=18780.0, ans=0.24270000000000003 2023-06-17 19:00:14,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=18840.0, ans=0.125 2023-06-17 19:00:33,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=18840.0, ans=0.00677391304347826 2023-06-17 19:00:55,770 INFO [train.py:996] (2/4) Epoch 1, batch 3150, loss[loss=0.5482, simple_loss=0.5737, pruned_loss=0.2613, over 21260.00 frames. ], tot_loss[loss=0.4272, simple_loss=0.4452, pruned_loss=0.2046, over 4282480.29 frames. ], batch size: 548, lr: 4.32e-02, grad_scale: 8.0 2023-06-17 19:03:25,394 INFO [train.py:996] (2/4) Epoch 1, batch 3200, loss[loss=0.4096, simple_loss=0.4069, pruned_loss=0.2061, over 20075.00 frames. ], tot_loss[loss=0.4223, simple_loss=0.4419, pruned_loss=0.2014, over 4277774.42 frames. ], batch size: 707, lr: 4.32e-02, grad_scale: 16.0 2023-06-17 19:03:25,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=19200.0, ans=0.125 2023-06-17 19:03:28,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=19200.0, ans=0.10800000000000001 2023-06-17 19:04:02,305 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.111e+02 3.900e+02 5.632e+02 8.444e+02 2.494e+03, threshold=1.126e+03, percent-clipped=20.0 2023-06-17 19:05:03,664 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.38 vs. limit=14.7675 2023-06-17 19:05:18,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=19440.0, ans=0.1056 2023-06-17 19:05:52,669 INFO [train.py:996] (2/4) Epoch 1, batch 3250, loss[loss=0.3782, simple_loss=0.3818, pruned_loss=0.1873, over 21363.00 frames. ], tot_loss[loss=0.4228, simple_loss=0.4408, pruned_loss=0.2024, over 4277487.87 frames. ], batch size: 194, lr: 4.31e-02, grad_scale: 16.0 2023-06-17 19:06:19,103 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=19560.0, ans=0.0 2023-06-17 19:07:59,356 INFO [train.py:996] (2/4) Epoch 1, batch 3300, loss[loss=0.5037, simple_loss=0.4933, pruned_loss=0.2571, over 21771.00 frames. ], tot_loss[loss=0.4204, simple_loss=0.4377, pruned_loss=0.2016, over 4272532.15 frames. ], batch size: 441, lr: 4.31e-02, grad_scale: 16.0 2023-06-17 19:08:06,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=19800.0, ans=0.035 2023-06-17 19:08:08,721 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=19800.0, ans=0.497 2023-06-17 19:08:20,954 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.209e+02 3.802e+02 5.443e+02 8.160e+02 1.939e+03, threshold=1.089e+03, percent-clipped=11.0 2023-06-17 19:09:08,520 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=19920.0, ans=0.125 2023-06-17 19:09:25,073 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.27 vs. limit=14.9925 2023-06-17 19:09:27,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=19980.0, ans=0.125 2023-06-17 19:09:35,308 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.24 vs. limit=14.99 2023-06-17 19:10:20,566 INFO [train.py:996] (2/4) Epoch 1, batch 3350, loss[loss=0.412, simple_loss=0.4192, pruned_loss=0.2024, over 21052.00 frames. ], tot_loss[loss=0.4195, simple_loss=0.4395, pruned_loss=0.1998, over 4276697.61 frames. ], batch size: 607, lr: 4.30e-02, grad_scale: 8.0 2023-06-17 19:10:22,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=20100.0, ans=0.2 2023-06-17 19:10:24,968 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.92 vs. limit=15.0 2023-06-17 19:10:45,001 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.86 vs. limit=15.0 2023-06-17 19:11:12,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=20160.0, ans=0.125 2023-06-17 19:11:30,235 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=20220.0, ans=0.125 2023-06-17 19:11:31,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=20220.0, ans=0.125 2023-06-17 19:12:23,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=20340.0, ans=0.125 2023-06-17 19:12:43,831 INFO [train.py:996] (2/4) Epoch 1, batch 3400, loss[loss=0.3979, simple_loss=0.4224, pruned_loss=0.1867, over 21774.00 frames. ], tot_loss[loss=0.4205, simple_loss=0.4392, pruned_loss=0.201, over 4278276.90 frames. ], batch size: 351, lr: 4.29e-02, grad_scale: 8.0 2023-06-17 19:13:03,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=20460.0, ans=0.125 2023-06-17 19:13:07,263 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.317e+02 4.511e+02 6.532e+02 8.905e+02 1.651e+03, threshold=1.306e+03, percent-clipped=8.0 2023-06-17 19:13:50,316 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.36 vs. limit=6.0 2023-06-17 19:13:56,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=20580.0, ans=0.006395652173913044 2023-06-17 19:14:58,844 INFO [train.py:996] (2/4) Epoch 1, batch 3450, loss[loss=0.416, simple_loss=0.4181, pruned_loss=0.2069, over 21423.00 frames. ], tot_loss[loss=0.417, simple_loss=0.4349, pruned_loss=0.1996, over 4280145.15 frames. ], batch size: 389, lr: 4.29e-02, grad_scale: 8.0 2023-06-17 19:15:53,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=20760.0, ans=0.125 2023-06-17 19:16:13,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=20820.0, ans=0.1 2023-06-17 19:17:20,361 INFO [train.py:996] (2/4) Epoch 1, batch 3500, loss[loss=0.4726, simple_loss=0.4751, pruned_loss=0.235, over 21481.00 frames. ], tot_loss[loss=0.4271, simple_loss=0.4448, pruned_loss=0.2047, over 4275020.74 frames. ], batch size: 211, lr: 4.28e-02, grad_scale: 8.0 2023-06-17 19:17:34,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=21000.0, ans=0.125 2023-06-17 19:17:58,121 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.542e+02 4.045e+02 5.374e+02 7.279e+02 2.253e+03, threshold=1.075e+03, percent-clipped=5.0 2023-06-17 19:19:34,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=21240.0, ans=0.125 2023-06-17 19:19:45,179 INFO [train.py:996] (2/4) Epoch 1, batch 3550, loss[loss=0.4315, simple_loss=0.43, pruned_loss=0.2165, over 21543.00 frames. ], tot_loss[loss=0.4283, simple_loss=0.4463, pruned_loss=0.2051, over 4281565.26 frames. ], batch size: 441, lr: 4.28e-02, grad_scale: 8.0 2023-06-17 19:20:04,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=21300.0, ans=0.0 2023-06-17 19:20:17,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=21300.0, ans=0.5 2023-06-17 19:20:20,873 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.06 vs. limit=15.0 2023-06-17 19:20:22,451 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.54 vs. limit=6.0 2023-06-17 19:20:27,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=21360.0, ans=0.0 2023-06-17 19:20:45,186 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=21420.0, ans=0.1 2023-06-17 19:20:48,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=21420.0, ans=0.1 2023-06-17 19:21:39,132 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=21480.0, ans=0.2 2023-06-17 19:21:55,026 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=21540.0, ans=0.125 2023-06-17 19:21:59,191 INFO [train.py:996] (2/4) Epoch 1, batch 3600, loss[loss=0.4524, simple_loss=0.448, pruned_loss=0.2284, over 21599.00 frames. ], tot_loss[loss=0.4234, simple_loss=0.44, pruned_loss=0.2034, over 4276500.75 frames. ], batch size: 415, lr: 4.27e-02, grad_scale: 16.0 2023-06-17 19:22:43,392 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.059e+02 3.884e+02 5.320e+02 7.505e+02 1.580e+03, threshold=1.064e+03, percent-clipped=11.0 2023-06-17 19:22:45,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=21660.0, ans=0.125 2023-06-17 19:22:48,958 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.60 vs. limit=15.0 2023-06-17 19:22:50,155 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.60 vs. limit=15.0 2023-06-17 19:24:43,922 INFO [train.py:996] (2/4) Epoch 1, batch 3650, loss[loss=0.384, simple_loss=0.4229, pruned_loss=0.1726, over 21835.00 frames. ], tot_loss[loss=0.4244, simple_loss=0.442, pruned_loss=0.2034, over 4264266.94 frames. ], batch size: 371, lr: 4.27e-02, grad_scale: 16.0 2023-06-17 19:25:38,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=22020.0, ans=0.125 2023-06-17 19:26:53,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=22140.0, ans=0.125 2023-06-17 19:26:58,879 INFO [train.py:996] (2/4) Epoch 1, batch 3700, loss[loss=0.4589, simple_loss=0.4687, pruned_loss=0.2245, over 21861.00 frames. ], tot_loss[loss=0.42, simple_loss=0.4389, pruned_loss=0.2005, over 4272965.51 frames. ], batch size: 414, lr: 4.26e-02, grad_scale: 16.0 2023-06-17 19:27:30,078 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.339e+02 4.188e+02 5.889e+02 8.889e+02 2.124e+03, threshold=1.178e+03, percent-clipped=16.0 2023-06-17 19:27:39,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=22260.0, ans=0.0 2023-06-17 19:29:19,303 INFO [train.py:996] (2/4) Epoch 1, batch 3750, loss[loss=0.4494, simple_loss=0.4551, pruned_loss=0.2218, over 21738.00 frames. ], tot_loss[loss=0.414, simple_loss=0.4337, pruned_loss=0.1971, over 4280855.26 frames. ], batch size: 441, lr: 4.26e-02, grad_scale: 16.0 2023-06-17 19:29:31,130 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=22500.0, ans=0.1 2023-06-17 19:30:11,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=22560.0, ans=0.2 2023-06-17 19:30:30,953 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=16.66 vs. limit=22.5 2023-06-17 19:30:34,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=22620.0, ans=0.1 2023-06-17 19:31:47,823 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=15.54 vs. limit=15.0 2023-06-17 19:31:54,519 INFO [train.py:996] (2/4) Epoch 1, batch 3800, loss[loss=0.4774, simple_loss=0.4906, pruned_loss=0.2321, over 21836.00 frames. ], tot_loss[loss=0.4093, simple_loss=0.4307, pruned_loss=0.1939, over 4277618.38 frames. ], batch size: 118, lr: 4.25e-02, grad_scale: 16.0 2023-06-17 19:32:10,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=22860.0, ans=0.2 2023-06-17 19:32:11,770 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 3.353e+02 4.457e+02 7.554e+02 3.391e+03, threshold=8.914e+02, percent-clipped=13.0 2023-06-17 19:33:17,318 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=22980.0, ans=0.2 2023-06-17 19:33:58,069 INFO [train.py:996] (2/4) Epoch 1, batch 3850, loss[loss=0.412, simple_loss=0.43, pruned_loss=0.197, over 20997.00 frames. ], tot_loss[loss=0.4094, simple_loss=0.4294, pruned_loss=0.1947, over 4259620.97 frames. ], batch size: 608, lr: 4.24e-02, grad_scale: 8.0 2023-06-17 19:33:58,539 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=23100.0, ans=0.125 2023-06-17 19:35:05,843 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 19:35:41,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=23340.0, ans=0.1 2023-06-17 19:36:07,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=23340.0, ans=0.015 2023-06-17 19:36:14,346 INFO [train.py:996] (2/4) Epoch 1, batch 3900, loss[loss=0.3872, simple_loss=0.4114, pruned_loss=0.1815, over 21590.00 frames. ], tot_loss[loss=0.4064, simple_loss=0.4261, pruned_loss=0.1934, over 4268244.77 frames. ], batch size: 263, lr: 4.24e-02, grad_scale: 8.0 2023-06-17 19:36:29,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=23400.0, ans=0.0 2023-06-17 19:36:30,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=23400.0, ans=0.125 2023-06-17 19:36:32,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=23400.0, ans=0.2 2023-06-17 19:36:38,862 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.237e+02 3.460e+02 4.927e+02 7.077e+02 1.688e+03, threshold=9.853e+02, percent-clipped=16.0 2023-06-17 19:37:56,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=23580.0, ans=0.0 2023-06-17 19:37:57,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=23580.0, ans=0.125 2023-06-17 19:38:00,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=23580.0, ans=0.125 2023-06-17 19:38:38,972 INFO [train.py:996] (2/4) Epoch 1, batch 3950, loss[loss=0.3245, simple_loss=0.3614, pruned_loss=0.1438, over 21815.00 frames. ], tot_loss[loss=0.4022, simple_loss=0.424, pruned_loss=0.1902, over 4271602.06 frames. ], batch size: 124, lr: 4.23e-02, grad_scale: 8.0 2023-06-17 19:38:56,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=23760.0, ans=0.125 2023-06-17 19:39:21,936 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.96 vs. limit=15.0 2023-06-17 19:41:01,645 INFO [train.py:996] (2/4) Epoch 1, batch 4000, loss[loss=0.3103, simple_loss=0.3423, pruned_loss=0.1391, over 21839.00 frames. ], tot_loss[loss=0.3935, simple_loss=0.4167, pruned_loss=0.1852, over 4266067.71 frames. ], batch size: 98, lr: 4.23e-02, grad_scale: 16.0 2023-06-17 19:41:16,823 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=24060.0, ans=0.125 2023-06-17 19:41:35,119 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.256e+02 3.734e+02 4.906e+02 6.607e+02 1.436e+03, threshold=9.812e+02, percent-clipped=4.0 2023-06-17 19:41:58,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=24120.0, ans=0.1 2023-06-17 19:42:06,114 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=24120.0, ans=0.07 2023-06-17 19:42:13,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=24180.0, ans=0.05 2023-06-17 19:43:23,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=24240.0, ans=0.125 2023-06-17 19:43:26,360 INFO [train.py:996] (2/4) Epoch 1, batch 4050, loss[loss=0.3492, simple_loss=0.3779, pruned_loss=0.1603, over 21378.00 frames. ], tot_loss[loss=0.3907, simple_loss=0.4157, pruned_loss=0.1828, over 4258305.71 frames. ], batch size: 131, lr: 4.22e-02, grad_scale: 4.0 2023-06-17 19:43:32,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=24300.0, ans=0.005586956521739131 2023-06-17 19:43:33,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=24300.0, ans=0.125 2023-06-17 19:44:35,695 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=24420.0, ans=0.125 2023-06-17 19:44:59,539 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=24480.0, ans=0.0055478260869565215 2023-06-17 19:45:31,427 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=24540.0, ans=0.005534782608695652 2023-06-17 19:45:42,498 INFO [train.py:996] (2/4) Epoch 1, batch 4100, loss[loss=0.3976, simple_loss=0.4344, pruned_loss=0.1804, over 21686.00 frames. ], tot_loss[loss=0.3912, simple_loss=0.4165, pruned_loss=0.1829, over 4266525.44 frames. ], batch size: 389, lr: 4.22e-02, grad_scale: 8.0 2023-06-17 19:46:02,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=24600.0, ans=0.1 2023-06-17 19:46:04,520 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=24600.0, ans=0.04949747468305833 2023-06-17 19:46:34,092 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.511e+02 3.802e+02 5.077e+02 7.572e+02 1.841e+03, threshold=1.015e+03, percent-clipped=11.0 2023-06-17 19:47:52,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=24840.0, ans=0.1 2023-06-17 19:48:02,817 INFO [train.py:996] (2/4) Epoch 1, batch 4150, loss[loss=0.2821, simple_loss=0.3525, pruned_loss=0.1058, over 21209.00 frames. ], tot_loss[loss=0.3814, simple_loss=0.4131, pruned_loss=0.1749, over 4270827.19 frames. ], batch size: 176, lr: 4.21e-02, grad_scale: 8.0 2023-06-17 19:49:31,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=25080.0, ans=0.1 2023-06-17 19:49:33,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=25080.0, ans=0.1 2023-06-17 19:50:32,236 INFO [train.py:996] (2/4) Epoch 1, batch 4200, loss[loss=0.4576, simple_loss=0.503, pruned_loss=0.2061, over 21597.00 frames. ], tot_loss[loss=0.3807, simple_loss=0.4124, pruned_loss=0.1745, over 4265608.67 frames. ], batch size: 389, lr: 4.20e-02, grad_scale: 8.0 2023-06-17 19:50:53,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=25200.0, ans=0.0 2023-06-17 19:51:03,592 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=25200.0, ans=0.0 2023-06-17 19:51:08,655 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.72 vs. limit=22.5 2023-06-17 19:51:18,828 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 3.271e+02 4.382e+02 6.312e+02 1.234e+03, threshold=8.764e+02, percent-clipped=8.0 2023-06-17 19:52:10,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=25380.0, ans=0.0053521739130434785 2023-06-17 19:52:30,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=25440.0, ans=0.125 2023-06-17 19:52:41,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=25440.0, ans=0.0 2023-06-17 19:52:50,448 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 19:53:04,744 INFO [train.py:996] (2/4) Epoch 1, batch 4250, loss[loss=0.5771, simple_loss=0.5525, pruned_loss=0.3009, over 21331.00 frames. ], tot_loss[loss=0.3889, simple_loss=0.4203, pruned_loss=0.1787, over 4261999.07 frames. ], batch size: 507, lr: 4.20e-02, grad_scale: 8.0 2023-06-17 19:53:30,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=25500.0, ans=0.0 2023-06-17 19:53:34,652 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=25560.0, ans=0.125 2023-06-17 19:53:45,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=25560.0, ans=0.125 2023-06-17 19:54:17,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=25620.0, ans=0.125 2023-06-17 19:54:33,464 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.22 vs. limit=15.0 2023-06-17 19:55:33,832 INFO [train.py:996] (2/4) Epoch 1, batch 4300, loss[loss=0.4296, simple_loss=0.4326, pruned_loss=0.2133, over 21544.00 frames. ], tot_loss[loss=0.3983, simple_loss=0.429, pruned_loss=0.1838, over 4266832.90 frames. ], batch size: 548, lr: 4.19e-02, grad_scale: 8.0 2023-06-17 19:55:38,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=25800.0, ans=0.125 2023-06-17 19:55:52,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=25800.0, ans=0.2 2023-06-17 19:56:30,573 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.330e+02 4.361e+02 6.749e+02 9.023e+02 1.594e+03, threshold=1.350e+03, percent-clipped=28.0 2023-06-17 19:56:37,162 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 19:56:48,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=25920.0, ans=0.0 2023-06-17 19:57:29,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=25980.0, ans=0.125 2023-06-17 19:57:53,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=26040.0, ans=0.1 2023-06-17 19:58:04,456 INFO [train.py:996] (2/4) Epoch 1, batch 4350, loss[loss=0.3508, simple_loss=0.3738, pruned_loss=0.1639, over 21628.00 frames. ], tot_loss[loss=0.3942, simple_loss=0.4252, pruned_loss=0.1816, over 4272130.52 frames. ], batch size: 282, lr: 4.19e-02, grad_scale: 8.0 2023-06-17 19:58:30,186 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.83 vs. limit=6.0 2023-06-17 19:58:40,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=26220.0, ans=0.2 2023-06-17 19:58:42,977 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=16.02 vs. limit=22.5 2023-06-17 19:59:15,875 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=26220.0, ans=0.1 2023-06-17 20:00:06,830 INFO [train.py:996] (2/4) Epoch 1, batch 4400, loss[loss=0.3455, simple_loss=0.3873, pruned_loss=0.1519, over 21695.00 frames. ], tot_loss[loss=0.3937, simple_loss=0.4228, pruned_loss=0.1823, over 4265558.76 frames. ], batch size: 247, lr: 4.18e-02, grad_scale: 16.0 2023-06-17 20:00:07,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=26400.0, ans=0.125 2023-06-17 20:00:53,224 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.109e+02 3.739e+02 5.540e+02 6.939e+02 1.405e+03, threshold=1.108e+03, percent-clipped=1.0 2023-06-17 20:01:35,270 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=26520.0, ans=0.0 2023-06-17 20:02:32,683 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.12 vs. limit=15.0 2023-06-17 20:02:35,561 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.11 vs. limit=12.0 2023-06-17 20:02:38,870 INFO [train.py:996] (2/4) Epoch 1, batch 4450, loss[loss=0.539, simple_loss=0.5472, pruned_loss=0.2654, over 21530.00 frames. ], tot_loss[loss=0.3978, simple_loss=0.4298, pruned_loss=0.1829, over 4266312.50 frames. ], batch size: 471, lr: 4.17e-02, grad_scale: 8.0 2023-06-17 20:02:50,834 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=26700.0, ans=10.0 2023-06-17 20:04:08,675 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=26880.0, ans=0.125 2023-06-17 20:04:43,945 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=26940.0, ans=0.0 2023-06-17 20:04:46,945 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=27000.0, ans=0.125 2023-06-17 20:04:47,886 INFO [train.py:996] (2/4) Epoch 1, batch 4500, loss[loss=0.3643, simple_loss=0.4095, pruned_loss=0.1596, over 21211.00 frames. ], tot_loss[loss=0.4008, simple_loss=0.4313, pruned_loss=0.1852, over 4275388.88 frames. ], batch size: 159, lr: 4.17e-02, grad_scale: 8.0 2023-06-17 20:04:51,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=27000.0, ans=0.125 2023-06-17 20:05:17,430 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=27060.0, ans=0.004986956521739131 2023-06-17 20:05:36,795 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=27060.0, ans=0.125 2023-06-17 20:05:40,586 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.441e+02 4.481e+02 5.907e+02 7.861e+02 1.389e+03, threshold=1.181e+03, percent-clipped=9.0 2023-06-17 20:07:03,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=27240.0, ans=0.004947826086956522 2023-06-17 20:07:14,396 INFO [train.py:996] (2/4) Epoch 1, batch 4550, loss[loss=0.5116, simple_loss=0.5114, pruned_loss=0.2559, over 21826.00 frames. ], tot_loss[loss=0.4024, simple_loss=0.4344, pruned_loss=0.1852, over 4278467.39 frames. ], batch size: 441, lr: 4.16e-02, grad_scale: 4.0 2023-06-17 20:08:04,323 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.29 vs. limit=10.0 2023-06-17 20:09:37,114 INFO [train.py:996] (2/4) Epoch 1, batch 4600, loss[loss=0.3232, simple_loss=0.377, pruned_loss=0.1347, over 21622.00 frames. ], tot_loss[loss=0.4039, simple_loss=0.436, pruned_loss=0.1858, over 4279695.14 frames. ], batch size: 263, lr: 4.15e-02, grad_scale: 8.0 2023-06-17 20:10:27,797 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.731e+02 3.841e+02 4.647e+02 5.664e+02 1.586e+03, threshold=9.294e+02, percent-clipped=2.0 2023-06-17 20:10:28,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=27660.0, ans=0.0 2023-06-17 20:11:15,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=27780.0, ans=0.1 2023-06-17 20:11:15,855 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.16 vs. limit=15.0 2023-06-17 20:11:19,386 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=27780.0, ans=0.125 2023-06-17 20:12:02,333 INFO [train.py:996] (2/4) Epoch 1, batch 4650, loss[loss=0.3766, simple_loss=0.3784, pruned_loss=0.1874, over 20313.00 frames. ], tot_loss[loss=0.3916, simple_loss=0.4242, pruned_loss=0.1795, over 4276061.82 frames. ], batch size: 703, lr: 4.15e-02, grad_scale: 8.0 2023-06-17 20:12:04,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=27900.0, ans=0.125 2023-06-17 20:13:33,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=28020.0, ans=0.125 2023-06-17 20:13:51,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=28080.0, ans=0.004765217391304348 2023-06-17 20:13:53,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=28080.0, ans=0.125 2023-06-17 20:14:00,896 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=28140.0, ans=0.0 2023-06-17 20:14:13,956 INFO [train.py:996] (2/4) Epoch 1, batch 4700, loss[loss=0.3324, simple_loss=0.3554, pruned_loss=0.1547, over 21465.00 frames. ], tot_loss[loss=0.382, simple_loss=0.4137, pruned_loss=0.1752, over 4270987.33 frames. ], batch size: 195, lr: 4.14e-02, grad_scale: 8.0 2023-06-17 20:14:34,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=28200.0, ans=0.004739130434782609 2023-06-17 20:14:52,711 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=28260.0, ans=0.125 2023-06-17 20:15:16,494 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.200e+02 4.151e+02 4.936e+02 6.766e+02 1.742e+03, threshold=9.871e+02, percent-clipped=9.0 2023-06-17 20:15:38,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=28320.0, ans=0.004713043478260869 2023-06-17 20:16:19,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=28440.0, ans=0.2 2023-06-17 20:16:31,464 INFO [train.py:996] (2/4) Epoch 1, batch 4750, loss[loss=0.4205, simple_loss=0.4261, pruned_loss=0.2074, over 21865.00 frames. ], tot_loss[loss=0.3794, simple_loss=0.4081, pruned_loss=0.1753, over 4264906.68 frames. ], batch size: 371, lr: 4.14e-02, grad_scale: 8.0 2023-06-17 20:17:47,289 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=28620.0, ans=0.125 2023-06-17 20:17:57,061 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=28620.0, ans=10.0 2023-06-17 20:18:52,087 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 20:19:00,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=28740.0, ans=0.0 2023-06-17 20:19:08,346 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.64 vs. limit=10.0 2023-06-17 20:19:08,691 INFO [train.py:996] (2/4) Epoch 1, batch 4800, loss[loss=0.3479, simple_loss=0.3807, pruned_loss=0.1575, over 21635.00 frames. ], tot_loss[loss=0.3825, simple_loss=0.4106, pruned_loss=0.1772, over 4265324.30 frames. ], batch size: 247, lr: 4.13e-02, grad_scale: 16.0 2023-06-17 20:19:33,629 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.564e+02 4.275e+02 5.086e+02 6.755e+02 1.816e+03, threshold=1.017e+03, percent-clipped=8.0 2023-06-17 20:20:23,070 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=28980.0, ans=0.0 2023-06-17 20:21:04,165 INFO [train.py:996] (2/4) Epoch 1, batch 4850, loss[loss=0.3598, simple_loss=0.3909, pruned_loss=0.1643, over 21434.00 frames. ], tot_loss[loss=0.3819, simple_loss=0.4089, pruned_loss=0.1775, over 4266950.23 frames. ], batch size: 211, lr: 4.12e-02, grad_scale: 16.0 2023-06-17 20:21:24,066 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.42 vs. limit=15.0 2023-06-17 20:21:42,403 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=29160.0, ans=6.0 2023-06-17 20:21:46,253 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=29160.0, ans=0.1 2023-06-17 20:22:40,791 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.60 vs. limit=15.0 2023-06-17 20:22:44,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=29280.0, ans=0.125 2023-06-17 20:22:45,921 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 20:23:32,235 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=29340.0, ans=0.1 2023-06-17 20:23:37,867 INFO [train.py:996] (2/4) Epoch 1, batch 4900, loss[loss=0.3654, simple_loss=0.4152, pruned_loss=0.1578, over 21332.00 frames. ], tot_loss[loss=0.3861, simple_loss=0.4132, pruned_loss=0.1795, over 4272428.24 frames. ], batch size: 176, lr: 4.12e-02, grad_scale: 16.0 2023-06-17 20:24:01,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=29460.0, ans=0.125 2023-06-17 20:24:04,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=29460.0, ans=0.125 2023-06-17 20:24:08,468 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.531e+02 3.410e+02 4.347e+02 5.276e+02 1.356e+03, threshold=8.693e+02, percent-clipped=2.0 2023-06-17 20:25:52,531 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.93 vs. limit=22.5 2023-06-17 20:25:57,160 INFO [train.py:996] (2/4) Epoch 1, batch 4950, loss[loss=0.3231, simple_loss=0.3849, pruned_loss=0.1307, over 21284.00 frames. ], tot_loss[loss=0.3844, simple_loss=0.4161, pruned_loss=0.1764, over 4274641.80 frames. ], batch size: 176, lr: 4.11e-02, grad_scale: 16.0 2023-06-17 20:26:02,171 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=29700.0, ans=0.125 2023-06-17 20:26:03,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=29700.0, ans=0.1 2023-06-17 20:26:03,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=29700.0, ans=0.0 2023-06-17 20:27:10,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=29880.0, ans=0.1 2023-06-17 20:27:56,997 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.72 vs. limit=15.0 2023-06-17 20:28:07,655 INFO [train.py:996] (2/4) Epoch 1, batch 5000, loss[loss=0.2895, simple_loss=0.3615, pruned_loss=0.1088, over 21476.00 frames. ], tot_loss[loss=0.3794, simple_loss=0.4145, pruned_loss=0.1722, over 4273389.33 frames. ], batch size: 194, lr: 4.10e-02, grad_scale: 16.0 2023-06-17 20:28:41,580 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=30060.0, ans=0.1 2023-06-17 20:28:42,515 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.621e+02 3.507e+02 4.413e+02 5.456e+02 1.135e+03, threshold=8.826e+02, percent-clipped=2.0 2023-06-17 20:29:18,905 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=30120.0, ans=0.1 2023-06-17 20:29:35,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=30180.0, ans=0.0 2023-06-17 20:29:36,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=30180.0, ans=0.1 2023-06-17 20:30:11,510 INFO [train.py:996] (2/4) Epoch 1, batch 5050, loss[loss=0.3905, simple_loss=0.4178, pruned_loss=0.1816, over 21346.00 frames. ], tot_loss[loss=0.3805, simple_loss=0.4144, pruned_loss=0.1733, over 4269650.33 frames. ], batch size: 159, lr: 4.10e-02, grad_scale: 16.0 2023-06-17 20:30:13,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=30300.0, ans=0.125 2023-06-17 20:30:35,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=30300.0, ans=0.125 2023-06-17 20:30:38,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=30300.0, ans=0.125 2023-06-17 20:30:38,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=30300.0, ans=0.125 2023-06-17 20:31:01,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=30360.0, ans=0.1 2023-06-17 20:31:03,032 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=30360.0, ans=0.2 2023-06-17 20:31:31,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=30480.0, ans=0.1 2023-06-17 20:31:40,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=30480.0, ans=0.1 2023-06-17 20:32:24,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=30540.0, ans=0.2 2023-06-17 20:32:32,733 INFO [train.py:996] (2/4) Epoch 1, batch 5100, loss[loss=0.4567, simple_loss=0.5408, pruned_loss=0.1863, over 19669.00 frames. ], tot_loss[loss=0.3842, simple_loss=0.4176, pruned_loss=0.1755, over 4272306.31 frames. ], batch size: 702, lr: 4.09e-02, grad_scale: 16.0 2023-06-17 20:32:57,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=30600.0, ans=0.0 2023-06-17 20:33:25,228 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.491e+02 3.812e+02 4.644e+02 6.399e+02 1.305e+03, threshold=9.287e+02, percent-clipped=10.0 2023-06-17 20:34:31,283 INFO [train.py:996] (2/4) Epoch 1, batch 5150, loss[loss=0.4575, simple_loss=0.5316, pruned_loss=0.1917, over 19788.00 frames. ], tot_loss[loss=0.3843, simple_loss=0.417, pruned_loss=0.1758, over 4278742.27 frames. ], batch size: 702, lr: 4.09e-02, grad_scale: 16.0 2023-06-17 20:34:54,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=30960.0, ans=0.0 2023-06-17 20:35:39,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=31020.0, ans=0.125 2023-06-17 20:36:41,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=31140.0, ans=0.2 2023-06-17 20:36:41,652 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-06-17 20:36:52,078 INFO [train.py:996] (2/4) Epoch 1, batch 5200, loss[loss=0.3088, simple_loss=0.3517, pruned_loss=0.133, over 21287.00 frames. ], tot_loss[loss=0.3799, simple_loss=0.4131, pruned_loss=0.1734, over 4267712.91 frames. ], batch size: 176, lr: 4.08e-02, grad_scale: 32.0 2023-06-17 20:37:49,920 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.764e+02 3.911e+02 4.920e+02 6.306e+02 1.130e+03, threshold=9.840e+02, percent-clipped=5.0 2023-06-17 20:37:50,384 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=31260.0, ans=0.0 2023-06-17 20:37:55,699 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.15 vs. limit=10.0 2023-06-17 20:38:54,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=31440.0, ans=0.125 2023-06-17 20:39:01,686 INFO [train.py:996] (2/4) Epoch 1, batch 5250, loss[loss=0.3866, simple_loss=0.4522, pruned_loss=0.1605, over 21726.00 frames. ], tot_loss[loss=0.3806, simple_loss=0.4173, pruned_loss=0.172, over 4273513.94 frames. ], batch size: 351, lr: 4.07e-02, grad_scale: 32.0 2023-06-17 20:39:02,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=31500.0, ans=0.1 2023-06-17 20:40:00,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=31560.0, ans=0.1 2023-06-17 20:40:06,774 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=31560.0, ans=0.125 2023-06-17 20:40:20,177 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=31620.0, ans=0.2 2023-06-17 20:40:23,103 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=31620.0, ans=0.1 2023-06-17 20:41:23,878 INFO [train.py:996] (2/4) Epoch 1, batch 5300, loss[loss=0.3832, simple_loss=0.4033, pruned_loss=0.1815, over 21895.00 frames. ], tot_loss[loss=0.3823, simple_loss=0.4173, pruned_loss=0.1736, over 4279908.35 frames. ], batch size: 298, lr: 4.07e-02, grad_scale: 32.0 2023-06-17 20:41:28,576 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=31800.0, ans=0.125 2023-06-17 20:42:02,746 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.98 vs. limit=22.5 2023-06-17 20:42:26,440 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.181e+02 3.571e+02 4.372e+02 6.573e+02 1.564e+03, threshold=8.743e+02, percent-clipped=7.0 2023-06-17 20:43:10,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=31980.0, ans=10.0 2023-06-17 20:43:47,264 INFO [train.py:996] (2/4) Epoch 1, batch 5350, loss[loss=0.4458, simple_loss=0.5198, pruned_loss=0.1859, over 19588.00 frames. ], tot_loss[loss=0.3837, simple_loss=0.4167, pruned_loss=0.1754, over 4282332.53 frames. ], batch size: 702, lr: 4.06e-02, grad_scale: 32.0 2023-06-17 20:43:49,061 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=32100.0, ans=0.0 2023-06-17 20:44:48,717 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=32220.0, ans=0.1 2023-06-17 20:44:54,901 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.36 vs. limit=15.0 2023-06-17 20:45:26,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=32280.0, ans=0.125 2023-06-17 20:45:27,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=32280.0, ans=0.125 2023-06-17 20:45:38,281 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.87 vs. limit=15.0 2023-06-17 20:46:12,772 INFO [train.py:996] (2/4) Epoch 1, batch 5400, loss[loss=0.3645, simple_loss=0.3944, pruned_loss=0.1672, over 21049.00 frames. ], tot_loss[loss=0.3842, simple_loss=0.4155, pruned_loss=0.1764, over 4279504.00 frames. ], batch size: 607, lr: 4.05e-02, grad_scale: 32.0 2023-06-17 20:46:39,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=32400.0, ans=0.125 2023-06-17 20:46:57,634 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.109e+02 3.975e+02 4.705e+02 6.164e+02 1.546e+03, threshold=9.411e+02, percent-clipped=5.0 2023-06-17 20:47:31,527 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=32580.0, ans=0.125 2023-06-17 20:47:44,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=32640.0, ans=0.2 2023-06-17 20:47:47,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=32640.0, ans=0.125 2023-06-17 20:48:16,844 INFO [train.py:996] (2/4) Epoch 1, batch 5450, loss[loss=0.2957, simple_loss=0.3528, pruned_loss=0.1193, over 21453.00 frames. ], tot_loss[loss=0.3813, simple_loss=0.4143, pruned_loss=0.1741, over 4282522.94 frames. ], batch size: 212, lr: 4.05e-02, grad_scale: 32.0 2023-06-17 20:48:48,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=32760.0, ans=0.003747826086956521 2023-06-17 20:49:14,643 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=32820.0, ans=0.125 2023-06-17 20:49:41,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=32880.0, ans=0.0037217391304347827 2023-06-17 20:50:37,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=32940.0, ans=0.09899494936611666 2023-06-17 20:50:39,055 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=33000.0, ans=0.125 2023-06-17 20:50:39,974 INFO [train.py:996] (2/4) Epoch 1, batch 5500, loss[loss=0.3927, simple_loss=0.474, pruned_loss=0.1557, over 21769.00 frames. ], tot_loss[loss=0.3782, simple_loss=0.418, pruned_loss=0.1691, over 4286821.07 frames. ], batch size: 351, lr: 4.04e-02, grad_scale: 32.0 2023-06-17 20:51:16,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=33060.0, ans=0.0 2023-06-17 20:51:30,002 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.403e+02 3.468e+02 4.497e+02 5.642e+02 1.011e+03, threshold=8.995e+02, percent-clipped=2.0 2023-06-17 20:52:12,338 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=33180.0, ans=0.125 2023-06-17 20:52:24,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=33180.0, ans=0.003656521739130435 2023-06-17 20:53:14,078 INFO [train.py:996] (2/4) Epoch 1, batch 5550, loss[loss=0.2548, simple_loss=0.3103, pruned_loss=0.09961, over 21735.00 frames. ], tot_loss[loss=0.3686, simple_loss=0.413, pruned_loss=0.1621, over 4290253.28 frames. ], batch size: 124, lr: 4.03e-02, grad_scale: 32.0 2023-06-17 20:53:38,115 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 20:54:05,395 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.17 vs. limit=15.0 2023-06-17 20:54:11,579 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.82 vs. limit=8.0 2023-06-17 20:55:05,694 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=33540.0, ans=0.125 2023-06-17 20:55:14,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=33540.0, ans=0.2 2023-06-17 20:55:30,607 INFO [train.py:996] (2/4) Epoch 1, batch 5600, loss[loss=0.2863, simple_loss=0.3515, pruned_loss=0.1106, over 21281.00 frames. ], tot_loss[loss=0.3632, simple_loss=0.41, pruned_loss=0.1582, over 4286837.42 frames. ], batch size: 176, lr: 4.03e-02, grad_scale: 32.0 2023-06-17 20:56:05,403 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.367e+02 3.459e+02 4.877e+02 6.636e+02 1.371e+03, threshold=9.753e+02, percent-clipped=8.0 2023-06-17 20:56:21,386 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=33720.0, ans=0.125 2023-06-17 20:56:44,054 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=5.68 vs. limit=15.0 2023-06-17 20:57:02,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=33780.0, ans=0.1 2023-06-17 20:57:24,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=33840.0, ans=0.125 2023-06-17 20:57:49,982 INFO [train.py:996] (2/4) Epoch 1, batch 5650, loss[loss=0.3967, simple_loss=0.4157, pruned_loss=0.1889, over 21895.00 frames. ], tot_loss[loss=0.366, simple_loss=0.4109, pruned_loss=0.1606, over 4282477.91 frames. ], batch size: 316, lr: 4.02e-02, grad_scale: 32.0 2023-06-17 20:57:50,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=33900.0, ans=0.125 2023-06-17 21:00:01,363 INFO [train.py:996] (2/4) Epoch 1, batch 5700, loss[loss=0.4059, simple_loss=0.4468, pruned_loss=0.1824, over 21573.00 frames. ], tot_loss[loss=0.369, simple_loss=0.4108, pruned_loss=0.1636, over 4278834.52 frames. ], batch size: 441, lr: 4.02e-02, grad_scale: 32.0 2023-06-17 21:00:40,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=34260.0, ans=0.2 2023-06-17 21:01:02,127 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.613e+02 3.406e+02 4.056e+02 5.524e+02 1.397e+03, threshold=8.113e+02, percent-clipped=5.0 2023-06-17 21:01:10,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=34320.0, ans=0.025 2023-06-17 21:01:49,795 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=34380.0, ans=0.125 2023-06-17 21:02:14,141 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_na.min_abs, batch_count=34440.0, ans=0.02 2023-06-17 21:02:15,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=34440.0, ans=0.125 2023-06-17 21:02:26,272 INFO [train.py:996] (2/4) Epoch 1, batch 5750, loss[loss=0.3162, simple_loss=0.3833, pruned_loss=0.1245, over 21727.00 frames. ], tot_loss[loss=0.3631, simple_loss=0.4066, pruned_loss=0.1598, over 4279494.99 frames. ], batch size: 351, lr: 4.01e-02, grad_scale: 32.0 2023-06-17 21:03:01,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=34560.0, ans=0.09899494936611666 2023-06-17 21:04:13,892 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.75 vs. limit=10.0 2023-06-17 21:04:53,013 INFO [train.py:996] (2/4) Epoch 1, batch 5800, loss[loss=0.3882, simple_loss=0.409, pruned_loss=0.1837, over 21208.00 frames. ], tot_loss[loss=0.3595, simple_loss=0.4053, pruned_loss=0.1568, over 4286212.36 frames. ], batch size: 607, lr: 4.00e-02, grad_scale: 32.0 2023-06-17 21:05:11,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=34860.0, ans=0.0032913043478260866 2023-06-17 21:05:33,095 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.587e+02 3.596e+02 4.280e+02 5.879e+02 1.064e+03, threshold=8.560e+02, percent-clipped=6.0 2023-06-17 21:06:04,392 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=34920.0, ans=0.0 2023-06-17 21:06:09,670 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.51 vs. limit=10.0 2023-06-17 21:07:15,162 INFO [train.py:996] (2/4) Epoch 1, batch 5850, loss[loss=0.2182, simple_loss=0.2916, pruned_loss=0.07236, over 21746.00 frames. ], tot_loss[loss=0.3492, simple_loss=0.4004, pruned_loss=0.149, over 4284701.06 frames. ], batch size: 124, lr: 4.00e-02, grad_scale: 32.0 2023-06-17 21:07:18,762 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=35100.0, ans=0.125 2023-06-17 21:07:38,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=35160.0, ans=10.0 2023-06-17 21:08:15,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=35220.0, ans=0.125 2023-06-17 21:08:35,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=35280.0, ans=0.02 2023-06-17 21:09:01,331 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.12 vs. limit=10.0 2023-06-17 21:09:28,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=35400.0, ans=0.125 2023-06-17 21:09:29,299 INFO [train.py:996] (2/4) Epoch 1, batch 5900, loss[loss=0.4402, simple_loss=0.4399, pruned_loss=0.2203, over 21749.00 frames. ], tot_loss[loss=0.3335, simple_loss=0.3882, pruned_loss=0.1394, over 4286899.13 frames. ], batch size: 441, lr: 3.99e-02, grad_scale: 32.0 2023-06-17 21:09:41,177 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.13 vs. limit=8.0 2023-06-17 21:10:06,012 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.665e+02 2.990e+02 3.626e+02 5.703e+02 1.926e+03, threshold=7.252e+02, percent-clipped=11.0 2023-06-17 21:10:31,765 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.05 vs. limit=15.0 2023-06-17 21:10:32,794 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=35580.0, ans=0.2 2023-06-17 21:10:34,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=35580.0, ans=0.125 2023-06-17 21:10:37,568 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.94 vs. limit=12.0 2023-06-17 21:10:44,833 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=23.98 vs. limit=15.0 2023-06-17 21:11:09,788 INFO [train.py:996] (2/4) Epoch 1, batch 5950, loss[loss=0.3558, simple_loss=0.3859, pruned_loss=0.1628, over 21753.00 frames. ], tot_loss[loss=0.3409, simple_loss=0.3896, pruned_loss=0.1461, over 4284780.50 frames. ], batch size: 112, lr: 3.98e-02, grad_scale: 32.0 2023-06-17 21:11:14,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=35700.0, ans=0.125 2023-06-17 21:11:45,351 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=35760.0, ans=0.0030956521739130428 2023-06-17 21:12:08,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=35820.0, ans=0.0030826086956521745 2023-06-17 21:12:43,146 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-17 21:13:16,688 INFO [train.py:996] (2/4) Epoch 1, batch 6000, loss[loss=0.3291, simple_loss=0.3517, pruned_loss=0.1533, over 21606.00 frames. ], tot_loss[loss=0.346, simple_loss=0.3886, pruned_loss=0.1517, over 4285204.50 frames. ], batch size: 247, lr: 3.98e-02, grad_scale: 32.0 2023-06-17 21:13:16,690 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-17 21:14:09,298 INFO [train.py:1028] (2/4) Epoch 1, validation: loss=0.3443, simple_loss=0.428, pruned_loss=0.1303, over 1796401.00 frames. 2023-06-17 21:14:09,300 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-17 21:14:40,177 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.337e+02 3.654e+02 4.651e+02 5.653e+02 9.533e+02, threshold=9.302e+02, percent-clipped=10.0 2023-06-17 21:14:58,890 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=36120.0, ans=0.125 2023-06-17 21:15:28,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=36180.0, ans=0.025 2023-06-17 21:15:43,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=36240.0, ans=0.125 2023-06-17 21:16:08,863 INFO [train.py:996] (2/4) Epoch 1, batch 6050, loss[loss=0.3082, simple_loss=0.3397, pruned_loss=0.1383, over 21564.00 frames. ], tot_loss[loss=0.3458, simple_loss=0.384, pruned_loss=0.1538, over 4276595.15 frames. ], batch size: 213, lr: 3.97e-02, grad_scale: 32.0 2023-06-17 21:16:19,514 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=12.0 2023-06-17 21:17:01,456 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.54 vs. limit=15.0 2023-06-17 21:17:07,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=36420.0, ans=0.1 2023-06-17 21:17:09,971 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=15.0 2023-06-17 21:17:46,714 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.97 vs. limit=15.0 2023-06-17 21:17:51,280 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.57 vs. limit=22.5 2023-06-17 21:17:51,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=36540.0, ans=0.125 2023-06-17 21:18:23,272 INFO [train.py:996] (2/4) Epoch 1, batch 6100, loss[loss=0.4285, simple_loss=0.4408, pruned_loss=0.2081, over 21914.00 frames. ], tot_loss[loss=0.3453, simple_loss=0.3841, pruned_loss=0.1532, over 4270685.66 frames. ], batch size: 124, lr: 3.96e-02, grad_scale: 32.0 2023-06-17 21:19:00,229 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.172e+02 3.419e+02 4.285e+02 5.583e+02 1.372e+03, threshold=8.569e+02, percent-clipped=6.0 2023-06-17 21:19:02,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=36720.0, ans=0.125 2023-06-17 21:19:03,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=36720.0, ans=0.04949747468305833 2023-06-17 21:19:04,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=36720.0, ans=0.0 2023-06-17 21:19:31,503 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=36780.0, ans=0.1 2023-06-17 21:20:17,380 INFO [train.py:996] (2/4) Epoch 1, batch 6150, loss[loss=0.3159, simple_loss=0.3719, pruned_loss=0.13, over 21770.00 frames. ], tot_loss[loss=0.3503, simple_loss=0.387, pruned_loss=0.1567, over 4270233.56 frames. ], batch size: 282, lr: 3.96e-02, grad_scale: 16.0 2023-06-17 21:20:35,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=36900.0, ans=0.0 2023-06-17 21:20:38,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=36900.0, ans=0.1 2023-06-17 21:20:56,257 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=36960.0, ans=0.95 2023-06-17 21:20:57,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=36960.0, ans=0.2 2023-06-17 21:21:07,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=37020.0, ans=0.1 2023-06-17 21:21:46,206 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.16 vs. limit=15.0 2023-06-17 21:21:51,125 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=37080.0, ans=0.002808695652173913 2023-06-17 21:22:41,972 INFO [train.py:996] (2/4) Epoch 1, batch 6200, loss[loss=0.4974, simple_loss=0.5272, pruned_loss=0.2338, over 21573.00 frames. ], tot_loss[loss=0.3512, simple_loss=0.389, pruned_loss=0.1567, over 4265888.11 frames. ], batch size: 471, lr: 3.95e-02, grad_scale: 16.0 2023-06-17 21:22:42,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=37200.0, ans=0.125 2023-06-17 21:22:44,146 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=37200.0, ans=0.0 2023-06-17 21:22:45,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=37200.0, ans=0.1 2023-06-17 21:22:47,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=37200.0, ans=15.0 2023-06-17 21:23:13,964 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.483e+02 3.238e+02 4.206e+02 5.371e+02 1.012e+03, threshold=8.413e+02, percent-clipped=2.0 2023-06-17 21:23:24,461 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=37320.0, ans=0.0 2023-06-17 21:24:54,229 INFO [train.py:996] (2/4) Epoch 1, batch 6250, loss[loss=0.324, simple_loss=0.4042, pruned_loss=0.1219, over 21765.00 frames. ], tot_loss[loss=0.3519, simple_loss=0.3919, pruned_loss=0.156, over 4266220.93 frames. ], batch size: 332, lr: 3.94e-02, grad_scale: 16.0 2023-06-17 21:25:08,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=37500.0, ans=0.125 2023-06-17 21:25:26,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=37560.0, ans=0.125 2023-06-17 21:26:46,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=37740.0, ans=0.1 2023-06-17 21:27:20,506 INFO [train.py:996] (2/4) Epoch 1, batch 6300, loss[loss=0.3784, simple_loss=0.4052, pruned_loss=0.1758, over 21867.00 frames. ], tot_loss[loss=0.3553, simple_loss=0.3983, pruned_loss=0.1561, over 4266332.48 frames. ], batch size: 124, lr: 3.94e-02, grad_scale: 16.0 2023-06-17 21:28:01,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=37860.0, ans=0.0026391304347826083 2023-06-17 21:28:10,417 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.404e+02 3.757e+02 4.825e+02 6.859e+02 1.465e+03, threshold=9.649e+02, percent-clipped=15.0 2023-06-17 21:28:15,340 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=37920.0, ans=0.125 2023-06-17 21:28:25,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=37920.0, ans=0.0 2023-06-17 21:29:22,754 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.04 vs. limit=22.5 2023-06-17 21:29:27,517 INFO [train.py:996] (2/4) Epoch 1, batch 6350, loss[loss=0.4083, simple_loss=0.4295, pruned_loss=0.1936, over 21334.00 frames. ], tot_loss[loss=0.3691, simple_loss=0.4076, pruned_loss=0.1653, over 4271392.19 frames. ], batch size: 176, lr: 3.93e-02, grad_scale: 16.0 2023-06-17 21:29:35,214 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=38100.0, ans=0.125 2023-06-17 21:30:48,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=38220.0, ans=15.0 2023-06-17 21:31:46,187 INFO [train.py:996] (2/4) Epoch 1, batch 6400, loss[loss=0.384, simple_loss=0.4153, pruned_loss=0.1763, over 21358.00 frames. ], tot_loss[loss=0.3803, simple_loss=0.4163, pruned_loss=0.1722, over 4270587.82 frames. ], batch size: 159, lr: 3.92e-02, grad_scale: 32.0 2023-06-17 21:32:35,825 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.409e+02 3.736e+02 4.493e+02 6.013e+02 1.011e+03, threshold=8.985e+02, percent-clipped=1.0 2023-06-17 21:34:03,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=38700.0, ans=0.125 2023-06-17 21:34:04,666 INFO [train.py:996] (2/4) Epoch 1, batch 6450, loss[loss=0.3087, simple_loss=0.3648, pruned_loss=0.1263, over 21819.00 frames. ], tot_loss[loss=0.3743, simple_loss=0.4139, pruned_loss=0.1674, over 4272514.55 frames. ], batch size: 118, lr: 3.92e-02, grad_scale: 32.0 2023-06-17 21:34:25,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=38760.0, ans=0.07 2023-06-17 21:34:33,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=38760.0, ans=0.125 2023-06-17 21:35:07,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=38820.0, ans=0.125 2023-06-17 21:35:47,523 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=38940.0, ans=0.2 2023-06-17 21:35:52,421 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 21:36:06,398 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=39000.0, ans=0.125 2023-06-17 21:36:12,924 INFO [train.py:996] (2/4) Epoch 1, batch 6500, loss[loss=0.3204, simple_loss=0.357, pruned_loss=0.1419, over 21770.00 frames. ], tot_loss[loss=0.3689, simple_loss=0.4059, pruned_loss=0.1659, over 4265756.49 frames. ], batch size: 102, lr: 3.91e-02, grad_scale: 32.0 2023-06-17 21:36:45,255 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.50 vs. limit=15.0 2023-06-17 21:36:59,799 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 3.413e+02 4.587e+02 6.044e+02 1.414e+03, threshold=9.175e+02, percent-clipped=8.0 2023-06-17 21:38:25,142 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=15.0 2023-06-17 21:38:35,543 INFO [train.py:996] (2/4) Epoch 1, batch 6550, loss[loss=0.3579, simple_loss=0.3994, pruned_loss=0.1582, over 21807.00 frames. ], tot_loss[loss=0.367, simple_loss=0.4057, pruned_loss=0.1641, over 4271587.74 frames. ], batch size: 332, lr: 3.91e-02, grad_scale: 32.0 2023-06-17 21:39:43,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=39480.0, ans=0.125 2023-06-17 21:40:32,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=39540.0, ans=0.125 2023-06-17 21:40:35,058 INFO [train.py:996] (2/4) Epoch 1, batch 6600, loss[loss=0.3909, simple_loss=0.3934, pruned_loss=0.1942, over 21379.00 frames. ], tot_loss[loss=0.3652, simple_loss=0.4022, pruned_loss=0.1641, over 4268276.64 frames. ], batch size: 508, lr: 3.90e-02, grad_scale: 32.0 2023-06-17 21:40:47,838 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.70 vs. limit=15.0 2023-06-17 21:41:23,086 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.188e+02 3.711e+02 4.295e+02 5.281e+02 1.119e+03, threshold=8.590e+02, percent-clipped=2.0 2023-06-17 21:41:44,386 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.38 vs. limit=15.0 2023-06-17 21:42:27,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=39840.0, ans=0.125 2023-06-17 21:42:34,154 INFO [train.py:996] (2/4) Epoch 1, batch 6650, loss[loss=0.3402, simple_loss=0.3812, pruned_loss=0.1496, over 21847.00 frames. ], tot_loss[loss=0.3566, simple_loss=0.3944, pruned_loss=0.1594, over 4272096.48 frames. ], batch size: 373, lr: 3.89e-02, grad_scale: 32.0 2023-06-17 21:42:40,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=39900.0, ans=0.035 2023-06-17 21:42:54,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=39900.0, ans=0.1 2023-06-17 21:43:50,724 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=40020.0, ans=0.125 2023-06-17 21:44:49,224 INFO [train.py:996] (2/4) Epoch 1, batch 6700, loss[loss=0.2953, simple_loss=0.3215, pruned_loss=0.1346, over 21427.00 frames. ], tot_loss[loss=0.3529, simple_loss=0.389, pruned_loss=0.1584, over 4273956.54 frames. ], batch size: 212, lr: 3.89e-02, grad_scale: 32.0 2023-06-17 21:45:11,356 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=40200.0, ans=0.0021304347826086954 2023-06-17 21:45:40,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=40260.0, ans=0.125 2023-06-17 21:45:41,586 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.343e+02 3.533e+02 4.521e+02 6.052e+02 1.154e+03, threshold=9.041e+02, percent-clipped=5.0 2023-06-17 21:45:58,869 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=40320.0, ans=0.5 2023-06-17 21:46:01,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=40320.0, ans=0.0 2023-06-17 21:46:12,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=40380.0, ans=0.0 2023-06-17 21:46:27,152 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=40380.0, ans=0.025 2023-06-17 21:46:42,977 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.13 vs. limit=12.0 2023-06-17 21:47:11,741 INFO [train.py:996] (2/4) Epoch 1, batch 6750, loss[loss=0.3209, simple_loss=0.3551, pruned_loss=0.1433, over 15228.00 frames. ], tot_loss[loss=0.3507, simple_loss=0.3865, pruned_loss=0.1575, over 4265954.85 frames. ], batch size: 60, lr: 3.88e-02, grad_scale: 32.0 2023-06-17 21:47:17,670 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.51 vs. limit=15.0 2023-06-17 21:47:17,782 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=24.28 vs. limit=22.5 2023-06-17 21:47:49,117 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.25 vs. limit=15.0 2023-06-17 21:48:18,864 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.86 vs. limit=15.0 2023-06-17 21:48:38,716 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 21:48:48,990 INFO [train.py:996] (2/4) Epoch 1, batch 6800, loss[loss=0.4177, simple_loss=0.4029, pruned_loss=0.2163, over 21415.00 frames. ], tot_loss[loss=0.3546, simple_loss=0.3875, pruned_loss=0.1609, over 4266997.56 frames. ], batch size: 508, lr: 3.87e-02, grad_scale: 32.0 2023-06-17 21:48:49,501 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=40800.0, ans=0.002 2023-06-17 21:49:29,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=40860.0, ans=0.125 2023-06-17 21:49:37,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=40860.0, ans=0.125 2023-06-17 21:49:38,276 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.364e+02 3.387e+02 4.190e+02 5.566e+02 1.112e+03, threshold=8.380e+02, percent-clipped=6.0 2023-06-17 21:49:51,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=40920.0, ans=0.2 2023-06-17 21:50:09,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=40980.0, ans=0.0019608695652173908 2023-06-17 21:50:36,590 INFO [train.py:996] (2/4) Epoch 1, batch 6850, loss[loss=0.3658, simple_loss=0.3835, pruned_loss=0.174, over 21507.00 frames. ], tot_loss[loss=0.3535, simple_loss=0.3838, pruned_loss=0.1616, over 4271584.56 frames. ], batch size: 194, lr: 3.87e-02, grad_scale: 32.0 2023-06-17 21:50:54,762 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=41160.0, ans=0.1 2023-06-17 21:50:56,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=41160.0, ans=0.125 2023-06-17 21:51:23,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=41220.0, ans=0.1 2023-06-17 21:51:23,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=41220.0, ans=0.001908695652173914 2023-06-17 21:51:23,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=41220.0, ans=0.001908695652173914 2023-06-17 21:52:08,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=41280.0, ans=0.1 2023-06-17 21:52:30,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=41340.0, ans=0.125 2023-06-17 21:52:33,797 INFO [train.py:996] (2/4) Epoch 1, batch 6900, loss[loss=0.328, simple_loss=0.3701, pruned_loss=0.143, over 21336.00 frames. ], tot_loss[loss=0.355, simple_loss=0.3847, pruned_loss=0.1627, over 4273243.47 frames. ], batch size: 159, lr: 3.86e-02, grad_scale: 32.0 2023-06-17 21:52:34,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=41400.0, ans=0.125 2023-06-17 21:53:31,726 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.711e+02 3.488e+02 4.127e+02 5.234e+02 1.332e+03, threshold=8.254e+02, percent-clipped=6.0 2023-06-17 21:53:59,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=41520.0, ans=0.0018434782608695664 2023-06-17 21:54:54,145 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.49 vs. limit=15.0 2023-06-17 21:54:58,877 INFO [train.py:996] (2/4) Epoch 1, batch 6950, loss[loss=0.3517, simple_loss=0.3943, pruned_loss=0.1546, over 21092.00 frames. ], tot_loss[loss=0.3493, simple_loss=0.3845, pruned_loss=0.1571, over 4269113.64 frames. ], batch size: 607, lr: 3.85e-02, grad_scale: 32.0 2023-06-17 21:55:28,460 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=41700.0, ans=0.025 2023-06-17 21:55:38,235 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=41760.0, ans=0.125 2023-06-17 21:56:05,024 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=41820.0, ans=0.1 2023-06-17 21:56:24,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=41820.0, ans=0.0017782608695652187 2023-06-17 21:57:23,332 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.73 vs. limit=15.0 2023-06-17 21:57:23,692 INFO [train.py:996] (2/4) Epoch 1, batch 7000, loss[loss=0.3721, simple_loss=0.3865, pruned_loss=0.1788, over 21440.00 frames. ], tot_loss[loss=0.3583, simple_loss=0.3904, pruned_loss=0.1632, over 4272419.05 frames. ], batch size: 389, lr: 3.85e-02, grad_scale: 32.0 2023-06-17 21:57:28,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=42000.0, ans=0.0017391304347826094 2023-06-17 21:58:00,859 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.247e+02 4.182e+02 5.502e+02 7.553e+02 1.291e+03, threshold=1.100e+03, percent-clipped=22.0 2023-06-17 21:59:34,221 INFO [train.py:996] (2/4) Epoch 1, batch 7050, loss[loss=0.3168, simple_loss=0.3681, pruned_loss=0.1328, over 21626.00 frames. ], tot_loss[loss=0.3563, simple_loss=0.39, pruned_loss=0.1613, over 4264026.22 frames. ], batch size: 230, lr: 3.84e-02, grad_scale: 32.0 2023-06-17 22:00:17,679 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.80 vs. limit=12.0 2023-06-17 22:00:56,967 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=42420.0, ans=0.1 2023-06-17 22:01:20,627 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=42480.0, ans=0.0 2023-06-17 22:01:32,618 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=42540.0, ans=0.125 2023-06-17 22:01:42,843 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=42540.0, ans=0.0 2023-06-17 22:01:55,179 INFO [train.py:996] (2/4) Epoch 1, batch 7100, loss[loss=0.3884, simple_loss=0.4181, pruned_loss=0.1794, over 21669.00 frames. ], tot_loss[loss=0.3629, simple_loss=0.3985, pruned_loss=0.1636, over 4259896.41 frames. ], batch size: 441, lr: 3.83e-02, grad_scale: 32.0 2023-06-17 22:02:53,219 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.999e+02 3.347e+02 4.129e+02 5.601e+02 1.207e+03, threshold=8.258e+02, percent-clipped=3.0 2023-06-17 22:03:05,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=42720.0, ans=0.125 2023-06-17 22:03:10,601 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.34 vs. limit=12.0 2023-06-17 22:03:12,231 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.78 vs. limit=15.0 2023-06-17 22:03:44,775 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.78 vs. limit=15.0 2023-06-17 22:04:10,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=42900.0, ans=0.125 2023-06-17 22:04:11,658 INFO [train.py:996] (2/4) Epoch 1, batch 7150, loss[loss=0.4821, simple_loss=0.4797, pruned_loss=0.2423, over 21418.00 frames. ], tot_loss[loss=0.358, simple_loss=0.3962, pruned_loss=0.1599, over 4252046.85 frames. ], batch size: 471, lr: 3.83e-02, grad_scale: 32.0 2023-06-17 22:04:49,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=42960.0, ans=0.0015304347826086955 2023-06-17 22:05:04,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=43020.0, ans=0.0 2023-06-17 22:06:10,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=43140.0, ans=0.1 2023-06-17 22:06:33,813 INFO [train.py:996] (2/4) Epoch 1, batch 7200, loss[loss=0.4071, simple_loss=0.4336, pruned_loss=0.1903, over 21381.00 frames. ], tot_loss[loss=0.3613, simple_loss=0.3974, pruned_loss=0.1626, over 4257125.38 frames. ], batch size: 549, lr: 3.82e-02, grad_scale: 32.0 2023-06-17 22:07:00,088 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.12 vs. limit=22.5 2023-06-17 22:07:00,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=43260.0, ans=0.2 2023-06-17 22:07:18,232 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.931e+02 3.282e+02 4.227e+02 5.672e+02 1.166e+03, threshold=8.454e+02, percent-clipped=6.0 2023-06-17 22:07:44,918 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.49 vs. limit=15.0 2023-06-17 22:07:45,021 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.11 vs. limit=10.0 2023-06-17 22:08:49,366 INFO [train.py:996] (2/4) Epoch 1, batch 7250, loss[loss=0.3072, simple_loss=0.3395, pruned_loss=0.1374, over 21616.00 frames. ], tot_loss[loss=0.3586, simple_loss=0.3921, pruned_loss=0.1626, over 4266009.40 frames. ], batch size: 282, lr: 3.82e-02, grad_scale: 32.0 2023-06-17 22:08:55,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=43500.0, ans=0.0014130434782608694 2023-06-17 22:10:27,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=43740.0, ans=0.0 2023-06-17 22:10:40,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=43740.0, ans=0.1 2023-06-17 22:10:42,179 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.18 vs. limit=10.0 2023-06-17 22:10:47,943 INFO [train.py:996] (2/4) Epoch 1, batch 7300, loss[loss=0.2985, simple_loss=0.3299, pruned_loss=0.1335, over 21675.00 frames. ], tot_loss[loss=0.3523, simple_loss=0.3844, pruned_loss=0.1601, over 4253806.89 frames. ], batch size: 333, lr: 3.81e-02, grad_scale: 32.0 2023-06-17 22:11:37,792 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=43860.0, ans=0.125 2023-06-17 22:11:39,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=43860.0, ans=0.0 2023-06-17 22:11:54,765 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.416e+02 3.511e+02 4.213e+02 5.491e+02 1.019e+03, threshold=8.426e+02, percent-clipped=2.0 2023-06-17 22:12:31,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=43980.0, ans=0.0 2023-06-17 22:12:59,905 INFO [train.py:996] (2/4) Epoch 1, batch 7350, loss[loss=0.531, simple_loss=0.5034, pruned_loss=0.2793, over 21336.00 frames. ], tot_loss[loss=0.3515, simple_loss=0.382, pruned_loss=0.1605, over 4250178.88 frames. ], batch size: 507, lr: 3.80e-02, grad_scale: 32.0 2023-06-17 22:14:28,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=44280.0, ans=0.125 2023-06-17 22:14:55,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=44340.0, ans=0.2 2023-06-17 22:14:59,895 INFO [train.py:996] (2/4) Epoch 1, batch 7400, loss[loss=0.3557, simple_loss=0.3721, pruned_loss=0.1696, over 21218.00 frames. ], tot_loss[loss=0.3601, simple_loss=0.3902, pruned_loss=0.165, over 4256730.24 frames. ], batch size: 176, lr: 3.80e-02, grad_scale: 32.0 2023-06-17 22:15:47,316 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.591e+02 3.927e+02 4.881e+02 6.398e+02 1.158e+03, threshold=9.762e+02, percent-clipped=7.0 2023-06-17 22:15:47,697 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=44460.0, ans=0.0 2023-06-17 22:17:03,165 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=44700.0, ans=0.125 2023-06-17 22:17:04,094 INFO [train.py:996] (2/4) Epoch 1, batch 7450, loss[loss=0.3256, simple_loss=0.3513, pruned_loss=0.15, over 21323.00 frames. ], tot_loss[loss=0.3589, simple_loss=0.3887, pruned_loss=0.1646, over 4251201.77 frames. ], batch size: 131, lr: 3.79e-02, grad_scale: 32.0 2023-06-17 22:18:18,266 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=44820.0, ans=0.1 2023-06-17 22:18:40,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=44880.0, ans=0.125 2023-06-17 22:19:09,757 INFO [train.py:996] (2/4) Epoch 1, batch 7500, loss[loss=0.3583, simple_loss=0.431, pruned_loss=0.1428, over 21703.00 frames. ], tot_loss[loss=0.3615, simple_loss=0.3927, pruned_loss=0.1652, over 4256730.54 frames. ], batch size: 298, lr: 3.78e-02, grad_scale: 32.0 2023-06-17 22:20:07,708 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.413e+02 3.630e+02 4.462e+02 5.942e+02 1.492e+03, threshold=8.924e+02, percent-clipped=4.0 2023-06-17 22:20:31,937 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=45180.0, ans=0.2 2023-06-17 22:20:59,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=45240.0, ans=0.125 2023-06-17 22:21:05,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=45300.0, ans=0.125 2023-06-17 22:21:06,586 INFO [train.py:996] (2/4) Epoch 1, batch 7550, loss[loss=0.3862, simple_loss=0.3974, pruned_loss=0.1875, over 21119.00 frames. ], tot_loss[loss=0.3608, simple_loss=0.3985, pruned_loss=0.1616, over 4265314.55 frames. ], batch size: 608, lr: 3.78e-02, grad_scale: 32.0 2023-06-17 22:21:30,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=45360.0, ans=0.0 2023-06-17 22:21:56,283 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.73 vs. limit=15.0 2023-06-17 22:22:02,925 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=45420.0, ans=0.0 2023-06-17 22:22:03,437 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=17.29 vs. limit=22.5 2023-06-17 22:22:23,922 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.99 vs. limit=15.0 2023-06-17 22:22:39,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=45540.0, ans=10.0 2023-06-17 22:22:42,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=45600.0, ans=0.125 2023-06-17 22:22:43,402 INFO [train.py:996] (2/4) Epoch 1, batch 7600, loss[loss=0.3616, simple_loss=0.3922, pruned_loss=0.1655, over 21787.00 frames. ], tot_loss[loss=0.3574, simple_loss=0.3968, pruned_loss=0.1589, over 4269717.27 frames. ], batch size: 247, lr: 3.77e-02, grad_scale: 32.0 2023-06-17 22:23:02,139 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.46 vs. limit=15.0 2023-06-17 22:23:15,591 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 3.356e+02 4.120e+02 5.350e+02 1.313e+03, threshold=8.240e+02, percent-clipped=1.0 2023-06-17 22:24:12,998 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.66 vs. limit=22.5 2023-06-17 22:24:44,510 INFO [train.py:996] (2/4) Epoch 1, batch 7650, loss[loss=0.3452, simple_loss=0.3739, pruned_loss=0.1583, over 21604.00 frames. ], tot_loss[loss=0.3599, simple_loss=0.3962, pruned_loss=0.1618, over 4275861.45 frames. ], batch size: 212, lr: 3.77e-02, grad_scale: 32.0 2023-06-17 22:24:48,070 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=45900.0, ans=10.0 2023-06-17 22:24:59,554 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=45900.0, ans=0.125 2023-06-17 22:25:11,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=45960.0, ans=0.0008782608695652172 2023-06-17 22:26:10,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=46080.0, ans=0.2 2023-06-17 22:26:11,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=46080.0, ans=0.2 2023-06-17 22:26:35,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=46140.0, ans=0.025 2023-06-17 22:27:08,020 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.29 vs. limit=15.0 2023-06-17 22:27:09,906 INFO [train.py:996] (2/4) Epoch 1, batch 7700, loss[loss=0.3885, simple_loss=0.4205, pruned_loss=0.1783, over 21555.00 frames. ], tot_loss[loss=0.3672, simple_loss=0.4013, pruned_loss=0.1665, over 4279332.00 frames. ], batch size: 230, lr: 3.76e-02, grad_scale: 32.0 2023-06-17 22:28:07,865 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.438e+02 3.716e+02 4.576e+02 5.497e+02 9.392e+02, threshold=9.152e+02, percent-clipped=2.0 2023-06-17 22:28:15,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=46320.0, ans=0.0 2023-06-17 22:28:26,315 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=46380.0, ans=0.125 2023-06-17 22:28:51,549 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 22:29:09,552 INFO [train.py:996] (2/4) Epoch 1, batch 7750, loss[loss=0.452, simple_loss=0.504, pruned_loss=0.2, over 21768.00 frames. ], tot_loss[loss=0.3734, simple_loss=0.4087, pruned_loss=0.169, over 4273540.09 frames. ], batch size: 332, lr: 3.75e-02, grad_scale: 32.0 2023-06-17 22:30:07,576 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 22:30:09,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=46620.0, ans=0.125 2023-06-17 22:30:15,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=46620.0, ans=0.125 2023-06-17 22:30:58,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=46740.0, ans=0.125 2023-06-17 22:31:14,894 INFO [train.py:996] (2/4) Epoch 1, batch 7800, loss[loss=0.3245, simple_loss=0.3695, pruned_loss=0.1398, over 21783.00 frames. ], tot_loss[loss=0.3739, simple_loss=0.4108, pruned_loss=0.1685, over 4273567.53 frames. ], batch size: 282, lr: 3.75e-02, grad_scale: 32.0 2023-06-17 22:31:45,636 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=46800.0, ans=0.125 2023-06-17 22:32:11,162 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.281e+02 3.644e+02 4.604e+02 5.812e+02 1.073e+03, threshold=9.208e+02, percent-clipped=1.0 2023-06-17 22:33:10,956 INFO [train.py:996] (2/4) Epoch 1, batch 7850, loss[loss=0.2975, simple_loss=0.3376, pruned_loss=0.1287, over 21532.00 frames. ], tot_loss[loss=0.3676, simple_loss=0.4027, pruned_loss=0.1662, over 4263027.21 frames. ], batch size: 230, lr: 3.74e-02, grad_scale: 32.0 2023-06-17 22:33:38,659 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.38 vs. limit=10.0 2023-06-17 22:33:53,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=47220.0, ans=0.07 2023-06-17 22:33:59,759 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=47220.0, ans=0.025 2023-06-17 22:34:50,992 INFO [train.py:996] (2/4) Epoch 1, batch 7900, loss[loss=0.2986, simple_loss=0.3312, pruned_loss=0.133, over 21114.00 frames. ], tot_loss[loss=0.3631, simple_loss=0.3976, pruned_loss=0.1643, over 4266158.29 frames. ], batch size: 143, lr: 3.73e-02, grad_scale: 32.0 2023-06-17 22:35:25,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=47460.0, ans=0.07 2023-06-17 22:35:28,925 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.540e+02 3.542e+02 4.364e+02 5.493e+02 1.086e+03, threshold=8.728e+02, percent-clipped=7.0 2023-06-17 22:36:30,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=47580.0, ans=0.0 2023-06-17 22:36:58,039 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=15.0 2023-06-17 22:37:15,570 INFO [train.py:996] (2/4) Epoch 1, batch 7950, loss[loss=0.4319, simple_loss=0.4587, pruned_loss=0.2026, over 21321.00 frames. ], tot_loss[loss=0.3631, simple_loss=0.4011, pruned_loss=0.1626, over 4268372.98 frames. ], batch size: 548, lr: 3.73e-02, grad_scale: 32.0 2023-06-17 22:37:20,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=47700.0, ans=0.125 2023-06-17 22:37:21,235 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.95 vs. limit=10.0 2023-06-17 22:37:34,866 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.45 vs. limit=5.0 2023-06-17 22:37:43,816 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=47760.0, ans=0.0004869565217391295 2023-06-17 22:38:18,343 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=12.0 2023-06-17 22:38:19,754 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.88 vs. limit=6.0 2023-06-17 22:39:38,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=48000.0, ans=0.0 2023-06-17 22:39:38,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=48000.0, ans=0.125 2023-06-17 22:39:42,793 INFO [train.py:996] (2/4) Epoch 1, batch 8000, loss[loss=0.4008, simple_loss=0.4467, pruned_loss=0.1774, over 21869.00 frames. ], tot_loss[loss=0.3739, simple_loss=0.4088, pruned_loss=0.1695, over 4268126.90 frames. ], batch size: 372, lr: 3.72e-02, grad_scale: 32.0 2023-06-17 22:39:55,384 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=48000.0, ans=0.1 2023-06-17 22:40:17,333 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.314e+02 4.071e+02 5.066e+02 6.362e+02 1.546e+03, threshold=1.013e+03, percent-clipped=8.0 2023-06-17 22:41:04,997 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=48180.0, ans=0.5 2023-06-17 22:41:51,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=48240.0, ans=0.125 2023-06-17 22:42:17,229 INFO [train.py:996] (2/4) Epoch 1, batch 8050, loss[loss=0.4112, simple_loss=0.4737, pruned_loss=0.1743, over 21240.00 frames. ], tot_loss[loss=0.3769, simple_loss=0.4124, pruned_loss=0.1708, over 4264126.83 frames. ], batch size: 548, lr: 3.72e-02, grad_scale: 32.0 2023-06-17 22:42:22,517 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=48300.0, ans=0.0 2023-06-17 22:42:33,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=48360.0, ans=0.0003565217391304342 2023-06-17 22:44:07,447 INFO [train.py:996] (2/4) Epoch 1, batch 8100, loss[loss=0.3451, simple_loss=0.3751, pruned_loss=0.1575, over 21830.00 frames. ], tot_loss[loss=0.3726, simple_loss=0.4087, pruned_loss=0.1683, over 4264919.37 frames. ], batch size: 282, lr: 3.71e-02, grad_scale: 32.0 2023-06-17 22:44:35,071 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=48660.0, ans=0.035 2023-06-17 22:45:13,361 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.843e+02 3.716e+02 4.690e+02 6.352e+02 1.621e+03, threshold=9.381e+02, percent-clipped=6.0 2023-06-17 22:45:15,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=48720.0, ans=0.04949747468305833 2023-06-17 22:45:36,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=48780.0, ans=0.0002652173913043482 2023-06-17 22:45:39,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=48780.0, ans=0.07 2023-06-17 22:45:52,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=48780.0, ans=0.125 2023-06-17 22:46:10,392 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=48840.0, ans=0.0 2023-06-17 22:46:41,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=48840.0, ans=0.125 2023-06-17 22:46:45,263 INFO [train.py:996] (2/4) Epoch 1, batch 8150, loss[loss=0.3351, simple_loss=0.42, pruned_loss=0.1251, over 21713.00 frames. ], tot_loss[loss=0.3764, simple_loss=0.4161, pruned_loss=0.1684, over 4274683.50 frames. ], batch size: 351, lr: 3.70e-02, grad_scale: 32.0 2023-06-17 22:46:57,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=48900.0, ans=0.125 2023-06-17 22:47:44,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=49020.0, ans=0.1 2023-06-17 22:47:51,160 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.31 vs. limit=15.0 2023-06-17 22:48:34,256 INFO [train.py:996] (2/4) Epoch 1, batch 8200, loss[loss=0.3968, simple_loss=0.3945, pruned_loss=0.1995, over 21466.00 frames. ], tot_loss[loss=0.3684, simple_loss=0.4077, pruned_loss=0.1645, over 4262383.47 frames. ], batch size: 441, lr: 3.70e-02, grad_scale: 32.0 2023-06-17 22:49:19,275 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.379e+02 3.766e+02 4.530e+02 5.768e+02 1.043e+03, threshold=9.060e+02, percent-clipped=2.0 2023-06-17 22:49:29,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=49320.0, ans=0.125 2023-06-17 22:49:43,352 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.97 vs. limit=22.5 2023-06-17 22:50:11,969 INFO [train.py:996] (2/4) Epoch 1, batch 8250, loss[loss=0.3715, simple_loss=0.4241, pruned_loss=0.1595, over 21826.00 frames. ], tot_loss[loss=0.3672, simple_loss=0.4059, pruned_loss=0.1642, over 4264210.38 frames. ], batch size: 371, lr: 3.69e-02, grad_scale: 32.0 2023-06-17 22:50:12,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=49500.0, ans=0.125 2023-06-17 22:50:24,374 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=49500.0, ans=0.125 2023-06-17 22:50:30,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=49500.0, ans=0.04949747468305833 2023-06-17 22:50:38,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=49560.0, ans=0.1 2023-06-17 22:51:06,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=49620.0, ans=0.125 2023-06-17 22:51:08,801 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.01 vs. limit=22.5 2023-06-17 22:51:55,146 INFO [train.py:996] (2/4) Epoch 1, batch 8300, loss[loss=0.2831, simple_loss=0.3353, pruned_loss=0.1155, over 21791.00 frames. ], tot_loss[loss=0.3612, simple_loss=0.4019, pruned_loss=0.1602, over 4262477.06 frames. ], batch size: 124, lr: 3.68e-02, grad_scale: 32.0 2023-06-17 22:52:34,297 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.218e+02 3.435e+02 4.224e+02 5.428e+02 8.537e+02, threshold=8.449e+02, percent-clipped=0.0 2023-06-17 22:52:45,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=49920.0, ans=0.0 2023-06-17 22:53:07,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=49980.0, ans=0.125 2023-06-17 22:53:32,139 INFO [train.py:996] (2/4) Epoch 1, batch 8350, loss[loss=0.3496, simple_loss=0.344, pruned_loss=0.1776, over 20111.00 frames. ], tot_loss[loss=0.3552, simple_loss=0.3976, pruned_loss=0.1564, over 4267999.78 frames. ], batch size: 704, lr: 3.68e-02, grad_scale: 32.0 2023-06-17 22:55:15,325 INFO [train.py:996] (2/4) Epoch 1, batch 8400, loss[loss=0.3605, simple_loss=0.4087, pruned_loss=0.1561, over 21682.00 frames. ], tot_loss[loss=0.3525, simple_loss=0.3971, pruned_loss=0.1539, over 4273943.98 frames. ], batch size: 441, lr: 3.67e-02, grad_scale: 32.0 2023-06-17 22:55:41,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=50400.0, ans=0.125 2023-06-17 22:56:05,635 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.614e+02 4.068e+02 5.029e+02 6.288e+02 1.067e+03, threshold=1.006e+03, percent-clipped=6.0 2023-06-17 22:56:15,883 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 22:56:20,917 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=15.0 2023-06-17 22:56:30,510 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=50580.0, ans=0.125 2023-06-17 22:56:56,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=50640.0, ans=0.0 2023-06-17 22:57:03,215 INFO [train.py:996] (2/4) Epoch 1, batch 8450, loss[loss=0.3669, simple_loss=0.3899, pruned_loss=0.172, over 21771.00 frames. ], tot_loss[loss=0.3551, simple_loss=0.3976, pruned_loss=0.1563, over 4283109.85 frames. ], batch size: 298, lr: 3.67e-02, grad_scale: 32.0 2023-06-17 22:57:08,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=50700.0, ans=0.05 2023-06-17 22:57:22,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=50760.0, ans=15.0 2023-06-17 22:57:40,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff3.min_abs, batch_count=50760.0, ans=0.2 2023-06-17 22:58:14,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=50880.0, ans=0.0 2023-06-17 22:58:24,664 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.95 vs. limit=15.0 2023-06-17 22:58:47,892 INFO [train.py:996] (2/4) Epoch 1, batch 8500, loss[loss=0.4304, simple_loss=0.4421, pruned_loss=0.2094, over 21509.00 frames. ], tot_loss[loss=0.3544, simple_loss=0.3934, pruned_loss=0.1577, over 4279204.86 frames. ], batch size: 389, lr: 3.66e-02, grad_scale: 32.0 2023-06-17 22:59:10,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=51060.0, ans=0.1 2023-06-17 22:59:27,733 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.579e+02 3.628e+02 4.473e+02 5.603e+02 9.273e+02, threshold=8.945e+02, percent-clipped=0.0 2023-06-17 22:59:32,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=51120.0, ans=0.0 2023-06-17 22:59:47,553 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=51120.0, ans=0.125 2023-06-17 23:00:05,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=51240.0, ans=0.125 2023-06-17 23:00:20,510 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=51240.0, ans=0.125 2023-06-17 23:00:31,209 INFO [train.py:996] (2/4) Epoch 1, batch 8550, loss[loss=0.3221, simple_loss=0.3775, pruned_loss=0.1334, over 21427.00 frames. ], tot_loss[loss=0.3615, simple_loss=0.3995, pruned_loss=0.1617, over 4271211.74 frames. ], batch size: 211, lr: 3.65e-02, grad_scale: 32.0 2023-06-17 23:00:38,159 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.34 vs. limit=22.5 2023-06-17 23:01:40,723 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.55 vs. limit=15.0 2023-06-17 23:02:15,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=51540.0, ans=0.125 2023-06-17 23:02:29,146 INFO [train.py:996] (2/4) Epoch 1, batch 8600, loss[loss=0.4316, simple_loss=0.4581, pruned_loss=0.2025, over 21882.00 frames. ], tot_loss[loss=0.3654, simple_loss=0.4054, pruned_loss=0.1627, over 4272964.96 frames. ], batch size: 371, lr: 3.65e-02, grad_scale: 32.0 2023-06-17 23:02:54,255 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=51660.0, ans=0.0 2023-06-17 23:03:10,040 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.658e+02 3.952e+02 4.835e+02 6.888e+02 1.478e+03, threshold=9.670e+02, percent-clipped=13.0 2023-06-17 23:04:09,956 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=51840.0, ans=0.125 2023-06-17 23:04:29,692 INFO [train.py:996] (2/4) Epoch 1, batch 8650, loss[loss=0.2488, simple_loss=0.3244, pruned_loss=0.08663, over 21714.00 frames. ], tot_loss[loss=0.3676, simple_loss=0.4099, pruned_loss=0.1627, over 4272702.14 frames. ], batch size: 282, lr: 3.64e-02, grad_scale: 32.0 2023-06-17 23:04:31,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=51900.0, ans=0.0 2023-06-17 23:04:42,156 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.62 vs. limit=15.0 2023-06-17 23:04:47,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=51900.0, ans=0.1 2023-06-17 23:04:48,850 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 23:05:06,653 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.18 vs. limit=10.0 2023-06-17 23:05:34,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=52020.0, ans=0.1 2023-06-17 23:06:08,470 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.43 vs. limit=15.0 2023-06-17 23:06:19,681 INFO [train.py:996] (2/4) Epoch 1, batch 8700, loss[loss=0.2782, simple_loss=0.3332, pruned_loss=0.1116, over 21630.00 frames. ], tot_loss[loss=0.3551, simple_loss=0.3983, pruned_loss=0.156, over 4272757.88 frames. ], batch size: 263, lr: 3.64e-02, grad_scale: 32.0 2023-06-17 23:06:59,823 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.126e+02 3.304e+02 4.220e+02 5.103e+02 8.221e+02, threshold=8.441e+02, percent-clipped=0.0 2023-06-17 23:07:00,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=52320.0, ans=0.125 2023-06-17 23:07:22,486 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=20.20 vs. limit=22.5 2023-06-17 23:07:42,215 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 23:07:54,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=52440.0, ans=0.125 2023-06-17 23:07:58,792 INFO [train.py:996] (2/4) Epoch 1, batch 8750, loss[loss=0.4602, simple_loss=0.521, pruned_loss=0.1997, over 20870.00 frames. ], tot_loss[loss=0.3552, simple_loss=0.3944, pruned_loss=0.158, over 4278634.18 frames. ], batch size: 607, lr: 3.63e-02, grad_scale: 32.0 2023-06-17 23:08:00,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=52500.0, ans=0.2 2023-06-17 23:08:07,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=52500.0, ans=0.0 2023-06-17 23:08:13,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=52500.0, ans=0.0 2023-06-17 23:09:54,980 INFO [train.py:996] (2/4) Epoch 1, batch 8800, loss[loss=0.5086, simple_loss=0.5344, pruned_loss=0.2414, over 21483.00 frames. ], tot_loss[loss=0.3661, simple_loss=0.405, pruned_loss=0.1636, over 4282530.95 frames. ], batch size: 471, lr: 3.62e-02, grad_scale: 32.0 2023-06-17 23:10:43,125 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=52860.0, ans=0.0 2023-06-17 23:10:45,676 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.532e+02 4.248e+02 5.624e+02 7.492e+02 1.328e+03, threshold=1.125e+03, percent-clipped=14.0 2023-06-17 23:11:55,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=53040.0, ans=0.5 2023-06-17 23:11:59,497 INFO [train.py:996] (2/4) Epoch 1, batch 8850, loss[loss=0.3288, simple_loss=0.3781, pruned_loss=0.1397, over 21446.00 frames. ], tot_loss[loss=0.3737, simple_loss=0.4139, pruned_loss=0.1668, over 4285690.82 frames. ], batch size: 211, lr: 3.62e-02, grad_scale: 32.0 2023-06-17 23:12:16,679 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.12 vs. limit=5.0 2023-06-17 23:13:38,033 INFO [train.py:996] (2/4) Epoch 1, batch 8900, loss[loss=0.3131, simple_loss=0.3488, pruned_loss=0.1387, over 21201.00 frames. ], tot_loss[loss=0.3696, simple_loss=0.409, pruned_loss=0.1651, over 4286169.56 frames. ], batch size: 176, lr: 3.61e-02, grad_scale: 32.0 2023-06-17 23:13:40,621 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.54 vs. limit=22.5 2023-06-17 23:14:12,502 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.567e+02 3.765e+02 4.627e+02 5.813e+02 9.173e+02, threshold=9.253e+02, percent-clipped=0.0 2023-06-17 23:14:38,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=53520.0, ans=0.125 2023-06-17 23:14:40,431 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.19 vs. limit=12.0 2023-06-17 23:15:28,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=53640.0, ans=0.125 2023-06-17 23:15:39,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=53640.0, ans=0.1 2023-06-17 23:15:46,980 INFO [train.py:996] (2/4) Epoch 1, batch 8950, loss[loss=0.3021, simple_loss=0.3383, pruned_loss=0.1329, over 21332.00 frames. ], tot_loss[loss=0.3676, simple_loss=0.409, pruned_loss=0.1631, over 4277957.27 frames. ], batch size: 131, lr: 3.61e-02, grad_scale: 32.0 2023-06-17 23:16:55,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=53820.0, ans=0.125 2023-06-17 23:17:14,799 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=53880.0, ans=0.0 2023-06-17 23:17:19,603 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.69 vs. limit=15.0 2023-06-17 23:17:29,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=53940.0, ans=0.125 2023-06-17 23:17:50,561 INFO [train.py:996] (2/4) Epoch 1, batch 9000, loss[loss=0.3312, simple_loss=0.3763, pruned_loss=0.1431, over 21708.00 frames. ], tot_loss[loss=0.3608, simple_loss=0.3992, pruned_loss=0.1612, over 4279596.45 frames. ], batch size: 282, lr: 3.60e-02, grad_scale: 32.0 2023-06-17 23:17:50,561 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-17 23:18:38,704 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.0.layers.1.self_attn_weights, attn_weights_entropy = tensor([5.5803, 5.1616, 5.4952, 5.1435], device='cuda:2') 2023-06-17 23:18:41,570 INFO [train.py:1028] (2/4) Epoch 1, validation: loss=0.3222, simple_loss=0.4116, pruned_loss=0.1164, over 1796401.00 frames. 2023-06-17 23:18:41,580 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-17 23:19:27,224 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.702e+02 3.716e+02 4.512e+02 5.914e+02 1.006e+03, threshold=9.023e+02, percent-clipped=2.0 2023-06-17 23:20:23,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=54240.0, ans=0.1 2023-06-17 23:20:25,630 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.49 vs. limit=15.0 2023-06-17 23:20:43,149 INFO [train.py:996] (2/4) Epoch 1, batch 9050, loss[loss=0.4114, simple_loss=0.4394, pruned_loss=0.1917, over 21607.00 frames. ], tot_loss[loss=0.3545, simple_loss=0.3968, pruned_loss=0.1561, over 4274885.23 frames. ], batch size: 389, lr: 3.59e-02, grad_scale: 32.0 2023-06-17 23:21:08,280 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.93 vs. limit=6.0 2023-06-17 23:21:29,193 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.30 vs. limit=12.0 2023-06-17 23:21:31,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=54420.0, ans=0.1 2023-06-17 23:22:10,839 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.28 vs. limit=15.0 2023-06-17 23:22:43,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=54540.0, ans=0.0 2023-06-17 23:22:56,139 INFO [train.py:996] (2/4) Epoch 1, batch 9100, loss[loss=0.3202, simple_loss=0.3922, pruned_loss=0.1241, over 21783.00 frames. ], tot_loss[loss=0.3626, simple_loss=0.4044, pruned_loss=0.1604, over 4266226.17 frames. ], batch size: 282, lr: 3.59e-02, grad_scale: 32.0 2023-06-17 23:23:30,795 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.153e+02 3.462e+02 4.314e+02 5.762e+02 1.601e+03, threshold=8.627e+02, percent-clipped=9.0 2023-06-17 23:23:32,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=54720.0, ans=0.125 2023-06-17 23:24:33,672 INFO [train.py:996] (2/4) Epoch 1, batch 9150, loss[loss=0.5027, simple_loss=0.517, pruned_loss=0.2441, over 21545.00 frames. ], tot_loss[loss=0.3588, simple_loss=0.4053, pruned_loss=0.1562, over 4269835.37 frames. ], batch size: 508, lr: 3.58e-02, grad_scale: 32.0 2023-06-17 23:25:18,032 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.06 vs. limit=15.0 2023-06-17 23:25:40,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=55020.0, ans=0.0 2023-06-17 23:25:41,656 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 23:26:17,927 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.56 vs. limit=10.0 2023-06-17 23:26:47,974 INFO [train.py:996] (2/4) Epoch 1, batch 9200, loss[loss=0.3442, simple_loss=0.3958, pruned_loss=0.1463, over 21390.00 frames. ], tot_loss[loss=0.3551, simple_loss=0.4037, pruned_loss=0.1532, over 4272378.07 frames. ], batch size: 194, lr: 3.58e-02, grad_scale: 32.0 2023-06-17 23:26:52,798 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=55200.0, ans=0.0 2023-06-17 23:27:24,791 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.410e+02 3.827e+02 4.968e+02 6.967e+02 1.252e+03, threshold=9.935e+02, percent-clipped=13.0 2023-06-17 23:28:47,259 INFO [train.py:996] (2/4) Epoch 1, batch 9250, loss[loss=0.3988, simple_loss=0.3917, pruned_loss=0.203, over 21302.00 frames. ], tot_loss[loss=0.366, simple_loss=0.4099, pruned_loss=0.1611, over 4273841.66 frames. ], batch size: 507, lr: 3.57e-02, grad_scale: 16.0 2023-06-17 23:28:49,700 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.08 vs. limit=22.5 2023-06-17 23:29:22,335 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=55620.0, ans=0.0 2023-06-17 23:29:28,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=55620.0, ans=0.0 2023-06-17 23:29:30,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=55620.0, ans=0.0 2023-06-17 23:29:35,354 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=12.30 vs. limit=12.0 2023-06-17 23:29:43,163 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=55680.0, ans=0.1 2023-06-17 23:30:15,254 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-06-17 23:30:25,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=55800.0, ans=0.025 2023-06-17 23:30:26,449 INFO [train.py:996] (2/4) Epoch 1, batch 9300, loss[loss=0.3341, simple_loss=0.3872, pruned_loss=0.1405, over 21719.00 frames. ], tot_loss[loss=0.3646, simple_loss=0.4054, pruned_loss=0.162, over 4270189.87 frames. ], batch size: 282, lr: 3.57e-02, grad_scale: 16.0 2023-06-17 23:30:56,338 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=55860.0, ans=0.125 2023-06-17 23:31:10,768 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.819e+02 4.444e+02 5.472e+02 6.937e+02 1.249e+03, threshold=1.094e+03, percent-clipped=7.0 2023-06-17 23:31:15,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=55920.0, ans=0.95 2023-06-17 23:32:32,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=56040.0, ans=0.125 2023-06-17 23:32:35,230 INFO [train.py:996] (2/4) Epoch 1, batch 9350, loss[loss=0.3977, simple_loss=0.4361, pruned_loss=0.1796, over 21863.00 frames. ], tot_loss[loss=0.3702, simple_loss=0.4131, pruned_loss=0.1637, over 4274015.65 frames. ], batch size: 118, lr: 3.56e-02, grad_scale: 16.0 2023-06-17 23:33:18,440 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.74 vs. limit=15.0 2023-06-17 23:34:32,809 INFO [train.py:996] (2/4) Epoch 1, batch 9400, loss[loss=0.3432, simple_loss=0.3825, pruned_loss=0.152, over 21498.00 frames. ], tot_loss[loss=0.3726, simple_loss=0.4156, pruned_loss=0.1648, over 4276256.70 frames. ], batch size: 389, lr: 3.55e-02, grad_scale: 16.0 2023-06-17 23:34:35,199 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.87 vs. limit=10.0 2023-06-17 23:35:10,784 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.51 vs. limit=22.5 2023-06-17 23:35:11,610 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=56460.0, ans=0.125 2023-06-17 23:35:26,245 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.443e+02 3.466e+02 4.550e+02 6.027e+02 9.606e+02, threshold=9.099e+02, percent-clipped=0.0 2023-06-17 23:35:39,515 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=56520.0, ans=0.0 2023-06-17 23:35:47,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=56580.0, ans=0.0 2023-06-17 23:36:23,003 INFO [train.py:996] (2/4) Epoch 1, batch 9450, loss[loss=0.3622, simple_loss=0.388, pruned_loss=0.1683, over 21988.00 frames. ], tot_loss[loss=0.3674, simple_loss=0.4073, pruned_loss=0.1638, over 4271659.39 frames. ], batch size: 103, lr: 3.55e-02, grad_scale: 16.0 2023-06-17 23:36:30,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=56700.0, ans=0.1 2023-06-17 23:37:36,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=56880.0, ans=0.0 2023-06-17 23:38:00,024 INFO [train.py:996] (2/4) Epoch 1, batch 9500, loss[loss=0.3272, simple_loss=0.3723, pruned_loss=0.141, over 21417.00 frames. ], tot_loss[loss=0.3585, simple_loss=0.3969, pruned_loss=0.16, over 4268050.36 frames. ], batch size: 194, lr: 3.54e-02, grad_scale: 16.0 2023-06-17 23:38:09,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=57000.0, ans=0.125 2023-06-17 23:38:37,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=57060.0, ans=0.025 2023-06-17 23:38:50,369 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.933e+02 3.365e+02 4.382e+02 5.694e+02 1.167e+03, threshold=8.764e+02, percent-clipped=2.0 2023-06-17 23:39:03,104 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-06-17 23:39:27,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=57180.0, ans=0.125 2023-06-17 23:39:56,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=57240.0, ans=0.0 2023-06-17 23:40:02,009 INFO [train.py:996] (2/4) Epoch 1, batch 9550, loss[loss=0.4106, simple_loss=0.4607, pruned_loss=0.1803, over 21770.00 frames. ], tot_loss[loss=0.3647, simple_loss=0.4019, pruned_loss=0.1638, over 4265898.42 frames. ], batch size: 332, lr: 3.54e-02, grad_scale: 16.0 2023-06-17 23:40:07,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=57300.0, ans=0.1 2023-06-17 23:40:30,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=57360.0, ans=0.1 2023-06-17 23:41:46,731 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=57540.0, ans=0.0 2023-06-17 23:41:54,756 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.12 vs. limit=22.5 2023-06-17 23:41:55,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=57540.0, ans=0.1 2023-06-17 23:41:59,561 INFO [train.py:996] (2/4) Epoch 1, batch 9600, loss[loss=0.3577, simple_loss=0.3874, pruned_loss=0.164, over 21533.00 frames. ], tot_loss[loss=0.3686, simple_loss=0.4057, pruned_loss=0.1657, over 4269216.76 frames. ], batch size: 548, lr: 3.53e-02, grad_scale: 32.0 2023-06-17 23:42:37,392 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.67 vs. limit=15.0 2023-06-17 23:42:47,218 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.94 vs. limit=15.0 2023-06-17 23:42:54,717 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.299e+02 3.734e+02 4.517e+02 6.077e+02 1.128e+03, threshold=9.035e+02, percent-clipped=4.0 2023-06-17 23:43:07,810 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=57720.0, ans=0.1 2023-06-17 23:43:21,109 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=57780.0, ans=0.1 2023-06-17 23:43:34,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=57840.0, ans=0.0 2023-06-17 23:43:54,540 INFO [train.py:996] (2/4) Epoch 1, batch 9650, loss[loss=0.356, simple_loss=0.3926, pruned_loss=0.1597, over 21942.00 frames. ], tot_loss[loss=0.3645, simple_loss=0.4018, pruned_loss=0.1636, over 4277107.11 frames. ], batch size: 316, lr: 3.53e-02, grad_scale: 32.0 2023-06-17 23:45:28,060 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=58080.0, ans=0.125 2023-06-17 23:45:40,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=58140.0, ans=0.125 2023-06-17 23:45:54,824 INFO [train.py:996] (2/4) Epoch 1, batch 9700, loss[loss=0.3494, simple_loss=0.3676, pruned_loss=0.1656, over 20215.00 frames. ], tot_loss[loss=0.3664, simple_loss=0.405, pruned_loss=0.1639, over 4276558.96 frames. ], batch size: 703, lr: 3.52e-02, grad_scale: 32.0 2023-06-17 23:46:08,598 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.74 vs. limit=15.0 2023-06-17 23:46:40,977 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.286e+02 3.561e+02 4.162e+02 5.499e+02 1.221e+03, threshold=8.324e+02, percent-clipped=3.0 2023-06-17 23:47:06,866 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.79 vs. limit=10.0 2023-06-17 23:47:19,142 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=58380.0, ans=0.0 2023-06-17 23:47:26,002 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 23:47:54,054 INFO [train.py:996] (2/4) Epoch 1, batch 9750, loss[loss=0.3431, simple_loss=0.3756, pruned_loss=0.1553, over 21238.00 frames. ], tot_loss[loss=0.3588, simple_loss=0.3958, pruned_loss=0.1609, over 4264051.90 frames. ], batch size: 159, lr: 3.51e-02, grad_scale: 32.0 2023-06-17 23:47:58,823 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=58500.0, ans=0.125 2023-06-17 23:48:21,127 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.52 vs. limit=10.0 2023-06-17 23:48:43,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=58620.0, ans=0.0 2023-06-17 23:48:53,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=58680.0, ans=0.1 2023-06-17 23:49:07,516 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=58680.0, ans=0.0 2023-06-17 23:49:43,190 INFO [train.py:996] (2/4) Epoch 1, batch 9800, loss[loss=0.3454, simple_loss=0.383, pruned_loss=0.1539, over 21887.00 frames. ], tot_loss[loss=0.3597, simple_loss=0.397, pruned_loss=0.1612, over 4257592.33 frames. ], batch size: 107, lr: 3.51e-02, grad_scale: 32.0 2023-06-17 23:50:48,190 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.310e+02 3.805e+02 4.466e+02 5.878e+02 9.239e+02, threshold=8.932e+02, percent-clipped=2.0 2023-06-17 23:51:44,814 INFO [train.py:996] (2/4) Epoch 1, batch 9850, loss[loss=0.3511, simple_loss=0.3824, pruned_loss=0.1598, over 21914.00 frames. ], tot_loss[loss=0.3602, simple_loss=0.3972, pruned_loss=0.1616, over 4249885.67 frames. ], batch size: 316, lr: 3.50e-02, grad_scale: 32.0 2023-06-17 23:51:49,630 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=59100.0, ans=0.0 2023-06-17 23:52:10,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=59100.0, ans=0.125 2023-06-17 23:52:54,799 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=59280.0, ans=0.125 2023-06-17 23:53:42,102 INFO [train.py:996] (2/4) Epoch 1, batch 9900, loss[loss=0.3694, simple_loss=0.3735, pruned_loss=0.1826, over 21339.00 frames. ], tot_loss[loss=0.3556, simple_loss=0.3911, pruned_loss=0.1601, over 4248691.45 frames. ], batch size: 473, lr: 3.50e-02, grad_scale: 32.0 2023-06-17 23:54:05,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=59460.0, ans=0.2 2023-06-17 23:54:18,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=59460.0, ans=0.0 2023-06-17 23:54:25,442 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=59460.0, ans=0.07 2023-06-17 23:54:37,611 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.448e+02 3.479e+02 4.439e+02 5.469e+02 9.869e+02, threshold=8.878e+02, percent-clipped=4.0 2023-06-17 23:54:49,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff3.min_abs, batch_count=59520.0, ans=0.2 2023-06-17 23:55:19,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=59640.0, ans=0.0 2023-06-17 23:55:27,580 INFO [train.py:996] (2/4) Epoch 1, batch 9950, loss[loss=0.393, simple_loss=0.4029, pruned_loss=0.1915, over 21569.00 frames. ], tot_loss[loss=0.3625, simple_loss=0.3959, pruned_loss=0.1646, over 4258216.33 frames. ], batch size: 415, lr: 3.49e-02, grad_scale: 32.0 2023-06-17 23:57:03,284 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.43 vs. limit=15.0 2023-06-17 23:57:04,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=59940.0, ans=0.125 2023-06-17 23:57:26,286 INFO [train.py:996] (2/4) Epoch 1, batch 10000, loss[loss=0.2999, simple_loss=0.3445, pruned_loss=0.1276, over 21266.00 frames. ], tot_loss[loss=0.3547, simple_loss=0.3889, pruned_loss=0.1603, over 4256407.49 frames. ], batch size: 176, lr: 3.49e-02, grad_scale: 32.0 2023-06-17 23:57:57,179 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=60060.0, ans=0.0 2023-06-17 23:58:18,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=60060.0, ans=0.0 2023-06-17 23:58:22,343 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.612e+02 3.683e+02 4.306e+02 5.455e+02 9.094e+02, threshold=8.612e+02, percent-clipped=1.0 2023-06-17 23:58:22,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=60120.0, ans=0.0 2023-06-17 23:58:51,706 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=60180.0, ans=0.0 2023-06-17 23:59:40,480 INFO [train.py:996] (2/4) Epoch 1, batch 10050, loss[loss=0.3177, simple_loss=0.3608, pruned_loss=0.1373, over 21703.00 frames. ], tot_loss[loss=0.3542, simple_loss=0.3893, pruned_loss=0.1596, over 4266044.99 frames. ], batch size: 298, lr: 3.48e-02, grad_scale: 32.0 2023-06-18 00:00:02,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=60300.0, ans=0.125 2023-06-18 00:00:04,207 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=60300.0, ans=0.125 2023-06-18 00:00:49,282 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=60420.0, ans=0.04949747468305833 2023-06-18 00:01:16,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=60480.0, ans=0.2 2023-06-18 00:01:34,963 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=60540.0, ans=0.125 2023-06-18 00:01:58,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=60540.0, ans=0.0 2023-06-18 00:02:00,513 INFO [train.py:996] (2/4) Epoch 1, batch 10100, loss[loss=0.3824, simple_loss=0.4198, pruned_loss=0.1725, over 21796.00 frames. ], tot_loss[loss=0.3503, simple_loss=0.3878, pruned_loss=0.1564, over 4267965.43 frames. ], batch size: 332, lr: 3.47e-02, grad_scale: 32.0 2023-06-18 00:02:15,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=60600.0, ans=0.1 2023-06-18 00:02:24,745 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=12.0 2023-06-18 00:02:27,289 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=60660.0, ans=0.0 2023-06-18 00:02:45,036 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.836e+02 3.604e+02 4.387e+02 5.436e+02 8.331e+02, threshold=8.774e+02, percent-clipped=0.0 2023-06-18 00:02:50,937 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=60720.0, ans=0.125 2023-06-18 00:02:52,493 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=60720.0, ans=0.1 2023-06-18 00:04:07,997 INFO [train.py:996] (2/4) Epoch 1, batch 10150, loss[loss=0.3599, simple_loss=0.3838, pruned_loss=0.168, over 21831.00 frames. ], tot_loss[loss=0.3565, simple_loss=0.3937, pruned_loss=0.1596, over 4262687.78 frames. ], batch size: 107, lr: 3.47e-02, grad_scale: 32.0 2023-06-18 00:04:59,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=61080.0, ans=0.125 2023-06-18 00:05:21,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=61140.0, ans=0.1 2023-06-18 00:05:33,117 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.12 vs. limit=15.0 2023-06-18 00:05:46,108 INFO [train.py:996] (2/4) Epoch 1, batch 10200, loss[loss=0.2655, simple_loss=0.3346, pruned_loss=0.09827, over 21626.00 frames. ], tot_loss[loss=0.35, simple_loss=0.3907, pruned_loss=0.1546, over 4266043.40 frames. ], batch size: 247, lr: 3.46e-02, grad_scale: 32.0 2023-06-18 00:06:08,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=61260.0, ans=0.125 2023-06-18 00:06:21,936 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.056e+02 3.332e+02 4.219e+02 5.598e+02 1.332e+03, threshold=8.438e+02, percent-clipped=6.0 2023-06-18 00:07:15,804 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.70 vs. limit=15.0 2023-06-18 00:07:23,934 INFO [train.py:996] (2/4) Epoch 1, batch 10250, loss[loss=0.4513, simple_loss=0.4646, pruned_loss=0.219, over 21354.00 frames. ], tot_loss[loss=0.3344, simple_loss=0.3813, pruned_loss=0.1437, over 4268490.18 frames. ], batch size: 507, lr: 3.46e-02, grad_scale: 16.0 2023-06-18 00:08:29,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=61680.0, ans=15.0 2023-06-18 00:08:35,574 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.74 vs. limit=15.0 2023-06-18 00:08:53,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=61740.0, ans=0.125 2023-06-18 00:09:09,890 INFO [train.py:996] (2/4) Epoch 1, batch 10300, loss[loss=0.3966, simple_loss=0.4568, pruned_loss=0.1682, over 21276.00 frames. ], tot_loss[loss=0.3425, simple_loss=0.3889, pruned_loss=0.148, over 4268019.19 frames. ], batch size: 549, lr: 3.45e-02, grad_scale: 16.0 2023-06-18 00:09:18,312 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.79 vs. limit=10.0 2023-06-18 00:09:22,033 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=61800.0, ans=0.2 2023-06-18 00:09:46,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=61800.0, ans=0.125 2023-06-18 00:10:20,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=61860.0, ans=0.125 2023-06-18 00:10:24,897 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.176e+02 3.585e+02 4.803e+02 7.199e+02 1.796e+03, threshold=9.605e+02, percent-clipped=14.0 2023-06-18 00:10:46,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=61980.0, ans=0.125 2023-06-18 00:11:02,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=61980.0, ans=0.2 2023-06-18 00:11:07,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=62040.0, ans=0.1 2023-06-18 00:11:20,377 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.21 vs. limit=15.0 2023-06-18 00:11:27,151 INFO [train.py:996] (2/4) Epoch 1, batch 10350, loss[loss=0.3239, simple_loss=0.3721, pruned_loss=0.1378, over 21649.00 frames. ], tot_loss[loss=0.3429, simple_loss=0.3897, pruned_loss=0.1481, over 4272558.12 frames. ], batch size: 351, lr: 3.45e-02, grad_scale: 16.0 2023-06-18 00:11:27,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=62100.0, ans=0.125 2023-06-18 00:11:29,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=62100.0, ans=0.125 2023-06-18 00:11:35,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=62100.0, ans=0.2 2023-06-18 00:11:37,351 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 00:12:13,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=62220.0, ans=0.125 2023-06-18 00:12:39,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=62280.0, ans=0.0 2023-06-18 00:12:43,103 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.02 vs. limit=15.0 2023-06-18 00:13:08,617 INFO [train.py:996] (2/4) Epoch 1, batch 10400, loss[loss=0.409, simple_loss=0.4381, pruned_loss=0.1899, over 21441.00 frames. ], tot_loss[loss=0.3318, simple_loss=0.3777, pruned_loss=0.1429, over 4270062.00 frames. ], batch size: 507, lr: 3.44e-02, grad_scale: 32.0 2023-06-18 00:13:48,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=62520.0, ans=0.0 2023-06-18 00:13:51,649 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.495e+02 3.668e+02 4.396e+02 5.299e+02 1.049e+03, threshold=8.792e+02, percent-clipped=2.0 2023-06-18 00:13:52,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=62520.0, ans=0.2 2023-06-18 00:14:34,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=62640.0, ans=0.015 2023-06-18 00:14:45,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=62640.0, ans=0.2 2023-06-18 00:14:48,082 INFO [train.py:996] (2/4) Epoch 1, batch 10450, loss[loss=0.3526, simple_loss=0.3953, pruned_loss=0.155, over 21413.00 frames. ], tot_loss[loss=0.3401, simple_loss=0.3841, pruned_loss=0.148, over 4274890.82 frames. ], batch size: 211, lr: 3.44e-02, grad_scale: 32.0 2023-06-18 00:14:48,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=62700.0, ans=0.1 2023-06-18 00:15:36,230 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.84 vs. limit=22.5 2023-06-18 00:15:42,847 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=62820.0, ans=0.04949747468305833 2023-06-18 00:15:44,982 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.67 vs. limit=15.0 2023-06-18 00:15:46,182 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.92 vs. limit=22.5 2023-06-18 00:16:09,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=62880.0, ans=0.125 2023-06-18 00:16:56,099 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 00:16:59,888 INFO [train.py:996] (2/4) Epoch 1, batch 10500, loss[loss=0.3115, simple_loss=0.3361, pruned_loss=0.1435, over 21411.00 frames. ], tot_loss[loss=0.3426, simple_loss=0.3865, pruned_loss=0.1493, over 4265090.73 frames. ], batch size: 194, lr: 3.43e-02, grad_scale: 16.0 2023-06-18 00:17:48,285 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=63060.0, ans=0.0 2023-06-18 00:17:54,108 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.435e+02 3.581e+02 4.532e+02 5.617e+02 1.542e+03, threshold=9.064e+02, percent-clipped=4.0 2023-06-18 00:18:11,614 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.34 vs. limit=10.0 2023-06-18 00:18:15,788 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=63180.0, ans=0.125 2023-06-18 00:18:40,768 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.27 vs. limit=15.0 2023-06-18 00:18:42,291 INFO [train.py:996] (2/4) Epoch 1, batch 10550, loss[loss=0.3303, simple_loss=0.3617, pruned_loss=0.1494, over 21806.00 frames. ], tot_loss[loss=0.3412, simple_loss=0.3821, pruned_loss=0.1502, over 4264279.03 frames. ], batch size: 352, lr: 3.43e-02, grad_scale: 16.0 2023-06-18 00:18:42,800 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=63300.0, ans=0.125 2023-06-18 00:18:43,291 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.39 vs. limit=22.5 2023-06-18 00:19:32,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=63420.0, ans=0.125 2023-06-18 00:19:59,334 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=63480.0, ans=0.125 2023-06-18 00:20:41,290 INFO [train.py:996] (2/4) Epoch 1, batch 10600, loss[loss=0.3207, simple_loss=0.3877, pruned_loss=0.1269, over 21748.00 frames. ], tot_loss[loss=0.335, simple_loss=0.3754, pruned_loss=0.1473, over 4251407.13 frames. ], batch size: 332, lr: 3.42e-02, grad_scale: 16.0 2023-06-18 00:21:42,395 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.306e+02 3.242e+02 3.917e+02 4.950e+02 7.459e+02, threshold=7.834e+02, percent-clipped=0.0 2023-06-18 00:21:53,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=63780.0, ans=0.125 2023-06-18 00:22:23,373 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.67 vs. limit=10.0 2023-06-18 00:23:04,591 INFO [train.py:996] (2/4) Epoch 1, batch 10650, loss[loss=0.3978, simple_loss=0.4395, pruned_loss=0.178, over 19890.00 frames. ], tot_loss[loss=0.3353, simple_loss=0.3779, pruned_loss=0.1464, over 4239137.51 frames. ], batch size: 702, lr: 3.41e-02, grad_scale: 16.0 2023-06-18 00:23:23,904 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=63960.0, ans=0.1 2023-06-18 00:24:27,320 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.98 vs. limit=15.0 2023-06-18 00:25:10,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=64140.0, ans=0.0 2023-06-18 00:25:11,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=64140.0, ans=0.04949747468305833 2023-06-18 00:25:14,492 INFO [train.py:996] (2/4) Epoch 1, batch 10700, loss[loss=0.2841, simple_loss=0.329, pruned_loss=0.1197, over 21555.00 frames. ], tot_loss[loss=0.336, simple_loss=0.3779, pruned_loss=0.1471, over 4249347.13 frames. ], batch size: 263, lr: 3.41e-02, grad_scale: 16.0 2023-06-18 00:25:16,374 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=64200.0, ans=0.2 2023-06-18 00:25:25,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=64200.0, ans=0.125 2023-06-18 00:25:27,111 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=64200.0, ans=0.2 2023-06-18 00:25:51,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=64320.0, ans=6.0 2023-06-18 00:25:53,674 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.366e+02 3.609e+02 4.403e+02 5.419e+02 8.654e+02, threshold=8.805e+02, percent-clipped=2.0 2023-06-18 00:25:58,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=64320.0, ans=0.0 2023-06-18 00:27:12,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=64440.0, ans=0.0 2023-06-18 00:27:12,465 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 00:27:17,067 INFO [train.py:996] (2/4) Epoch 1, batch 10750, loss[loss=0.351, simple_loss=0.4239, pruned_loss=0.139, over 21873.00 frames. ], tot_loss[loss=0.3488, simple_loss=0.3903, pruned_loss=0.1536, over 4249137.01 frames. ], batch size: 316, lr: 3.40e-02, grad_scale: 16.0 2023-06-18 00:28:39,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=64680.0, ans=0.0 2023-06-18 00:29:02,449 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 00:29:06,561 INFO [train.py:996] (2/4) Epoch 1, batch 10800, loss[loss=0.4119, simple_loss=0.4412, pruned_loss=0.1913, over 21694.00 frames. ], tot_loss[loss=0.3523, simple_loss=0.3955, pruned_loss=0.1546, over 4264074.94 frames. ], batch size: 351, lr: 3.40e-02, grad_scale: 32.0 2023-06-18 00:29:24,884 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=64800.0, ans=0.125 2023-06-18 00:29:40,196 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=64860.0, ans=0.1 2023-06-18 00:30:01,343 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.224e+02 3.483e+02 3.969e+02 4.979e+02 7.720e+02, threshold=7.938e+02, percent-clipped=0.0 2023-06-18 00:30:59,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=65040.0, ans=0.125 2023-06-18 00:31:13,348 INFO [train.py:996] (2/4) Epoch 1, batch 10850, loss[loss=0.3546, simple_loss=0.388, pruned_loss=0.1606, over 21561.00 frames. ], tot_loss[loss=0.3557, simple_loss=0.3983, pruned_loss=0.1566, over 4260330.89 frames. ], batch size: 441, lr: 3.39e-02, grad_scale: 32.0 2023-06-18 00:31:17,170 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.11 vs. limit=15.0 2023-06-18 00:31:32,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=65160.0, ans=10.0 2023-06-18 00:31:51,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=65220.0, ans=0.125 2023-06-18 00:32:12,935 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=65220.0, ans=0.0 2023-06-18 00:32:50,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=65340.0, ans=0.125 2023-06-18 00:33:12,049 INFO [train.py:996] (2/4) Epoch 1, batch 10900, loss[loss=0.3, simple_loss=0.351, pruned_loss=0.1245, over 21175.00 frames. ], tot_loss[loss=0.3479, simple_loss=0.3902, pruned_loss=0.1528, over 4269055.22 frames. ], batch size: 143, lr: 3.39e-02, grad_scale: 32.0 2023-06-18 00:33:24,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=65400.0, ans=0.2 2023-06-18 00:33:34,163 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=65460.0, ans=0.0 2023-06-18 00:33:51,460 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.422e+02 3.204e+02 4.085e+02 4.812e+02 7.317e+02, threshold=8.170e+02, percent-clipped=0.0 2023-06-18 00:34:20,214 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 00:34:27,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=65580.0, ans=0.125 2023-06-18 00:34:29,192 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.058e-01 2023-06-18 00:34:35,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=65640.0, ans=0.025 2023-06-18 00:34:50,800 INFO [train.py:996] (2/4) Epoch 1, batch 10950, loss[loss=0.3968, simple_loss=0.4298, pruned_loss=0.182, over 20656.00 frames. ], tot_loss[loss=0.3428, simple_loss=0.385, pruned_loss=0.1503, over 4264052.21 frames. ], batch size: 607, lr: 3.38e-02, grad_scale: 32.0 2023-06-18 00:35:11,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=65760.0, ans=0.1 2023-06-18 00:36:45,678 INFO [train.py:996] (2/4) Epoch 1, batch 11000, loss[loss=0.3645, simple_loss=0.3918, pruned_loss=0.1686, over 21939.00 frames. ], tot_loss[loss=0.3453, simple_loss=0.3853, pruned_loss=0.1527, over 4267600.11 frames. ], batch size: 316, lr: 3.38e-02, grad_scale: 32.0 2023-06-18 00:37:00,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=66060.0, ans=0.05 2023-06-18 00:37:35,146 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.527e+02 3.687e+02 4.482e+02 5.428e+02 9.093e+02, threshold=8.964e+02, percent-clipped=3.0 2023-06-18 00:38:02,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=66180.0, ans=0.1 2023-06-18 00:38:49,265 INFO [train.py:996] (2/4) Epoch 1, batch 11050, loss[loss=0.3351, simple_loss=0.3703, pruned_loss=0.1499, over 20006.00 frames. ], tot_loss[loss=0.3454, simple_loss=0.3834, pruned_loss=0.1537, over 4261578.01 frames. ], batch size: 703, lr: 3.37e-02, grad_scale: 32.0 2023-06-18 00:38:49,724 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=66300.0, ans=0.0 2023-06-18 00:39:34,111 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=16.82 vs. limit=15.0 2023-06-18 00:39:52,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=66420.0, ans=0.0 2023-06-18 00:39:53,609 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 00:40:15,052 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=66480.0, ans=0.0 2023-06-18 00:40:16,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=66480.0, ans=0.125 2023-06-18 00:40:44,262 INFO [train.py:996] (2/4) Epoch 1, batch 11100, loss[loss=0.3379, simple_loss=0.372, pruned_loss=0.1519, over 21727.00 frames. ], tot_loss[loss=0.3436, simple_loss=0.3799, pruned_loss=0.1536, over 4261926.43 frames. ], batch size: 351, lr: 3.37e-02, grad_scale: 32.0 2023-06-18 00:40:56,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=66600.0, ans=0.125 2023-06-18 00:41:14,248 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=66660.0, ans=0.125 2023-06-18 00:41:18,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=66660.0, ans=0.125 2023-06-18 00:41:22,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=66660.0, ans=0.04949747468305833 2023-06-18 00:41:39,688 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.585e+02 3.202e+02 3.919e+02 4.833e+02 8.145e+02, threshold=7.838e+02, percent-clipped=0.0 2023-06-18 00:42:39,304 INFO [train.py:996] (2/4) Epoch 1, batch 11150, loss[loss=0.331, simple_loss=0.3363, pruned_loss=0.1629, over 20310.00 frames. ], tot_loss[loss=0.3399, simple_loss=0.3768, pruned_loss=0.1515, over 4254857.44 frames. ], batch size: 703, lr: 3.36e-02, grad_scale: 32.0 2023-06-18 00:42:41,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=66900.0, ans=0.0 2023-06-18 00:42:44,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=66900.0, ans=0.125 2023-06-18 00:43:27,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=67020.0, ans=0.07 2023-06-18 00:43:47,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=67080.0, ans=0.1 2023-06-18 00:43:50,145 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=67080.0, ans=0.1 2023-06-18 00:44:08,896 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.05 vs. limit=6.0 2023-06-18 00:44:28,575 INFO [train.py:996] (2/4) Epoch 1, batch 11200, loss[loss=0.294, simple_loss=0.3326, pruned_loss=0.1278, over 15496.00 frames. ], tot_loss[loss=0.3377, simple_loss=0.3748, pruned_loss=0.1503, over 4237845.79 frames. ], batch size: 61, lr: 3.36e-02, grad_scale: 32.0 2023-06-18 00:44:39,888 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=67200.0, ans=0.2 2023-06-18 00:45:07,255 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.341e+02 3.477e+02 4.166e+02 5.140e+02 9.115e+02, threshold=8.331e+02, percent-clipped=3.0 2023-06-18 00:45:25,417 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=67380.0, ans=0.0 2023-06-18 00:45:27,606 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.92 vs. limit=15.0 2023-06-18 00:45:28,566 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=67380.0, ans=0.125 2023-06-18 00:45:38,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=67380.0, ans=0.125 2023-06-18 00:46:05,994 INFO [train.py:996] (2/4) Epoch 1, batch 11250, loss[loss=0.304, simple_loss=0.3713, pruned_loss=0.1183, over 21463.00 frames. ], tot_loss[loss=0.3378, simple_loss=0.3744, pruned_loss=0.1506, over 4234815.83 frames. ], batch size: 131, lr: 3.35e-02, grad_scale: 32.0 2023-06-18 00:47:18,604 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=67680.0, ans=0.0 2023-06-18 00:47:49,278 INFO [train.py:996] (2/4) Epoch 1, batch 11300, loss[loss=0.3697, simple_loss=0.4156, pruned_loss=0.1619, over 21772.00 frames. ], tot_loss[loss=0.3388, simple_loss=0.376, pruned_loss=0.1508, over 4241424.34 frames. ], batch size: 414, lr: 3.35e-02, grad_scale: 32.0 2023-06-18 00:49:00,213 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.349e+02 3.206e+02 3.828e+02 4.812e+02 8.998e+02, threshold=7.656e+02, percent-clipped=1.0 2023-06-18 00:49:00,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=67920.0, ans=0.2 2023-06-18 00:49:26,510 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=67980.0, ans=0.125 2023-06-18 00:49:44,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=68040.0, ans=0.02 2023-06-18 00:50:09,194 INFO [train.py:996] (2/4) Epoch 1, batch 11350, loss[loss=0.4052, simple_loss=0.4658, pruned_loss=0.1723, over 20817.00 frames. ], tot_loss[loss=0.3418, simple_loss=0.3806, pruned_loss=0.1515, over 4251550.85 frames. ], batch size: 608, lr: 3.34e-02, grad_scale: 16.0 2023-06-18 00:50:13,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=68100.0, ans=0.1 2023-06-18 00:50:49,059 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=68220.0, ans=0.2 2023-06-18 00:52:07,602 INFO [train.py:996] (2/4) Epoch 1, batch 11400, loss[loss=0.3721, simple_loss=0.4209, pruned_loss=0.1616, over 21862.00 frames. ], tot_loss[loss=0.3512, simple_loss=0.39, pruned_loss=0.1562, over 4262261.42 frames. ], batch size: 372, lr: 3.34e-02, grad_scale: 16.0 2023-06-18 00:52:09,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=68400.0, ans=0.0 2023-06-18 00:52:53,703 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.34 vs. limit=15.0 2023-06-18 00:53:03,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=68460.0, ans=0.125 2023-06-18 00:53:11,480 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 00:53:16,842 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.487e+02 3.896e+02 4.865e+02 6.013e+02 1.206e+03, threshold=9.731e+02, percent-clipped=12.0 2023-06-18 00:53:27,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=68520.0, ans=0.125 2023-06-18 00:54:10,462 INFO [train.py:996] (2/4) Epoch 1, batch 11450, loss[loss=0.4961, simple_loss=0.4975, pruned_loss=0.2473, over 21352.00 frames. ], tot_loss[loss=0.3495, simple_loss=0.3898, pruned_loss=0.1546, over 4246851.23 frames. ], batch size: 508, lr: 3.33e-02, grad_scale: 16.0 2023-06-18 00:54:24,266 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=68700.0, ans=0.125 2023-06-18 00:54:45,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=68760.0, ans=0.0 2023-06-18 00:54:46,006 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.03 vs. limit=15.0 2023-06-18 00:55:26,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=68820.0, ans=0.0 2023-06-18 00:55:35,445 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 00:55:39,038 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.64 vs. limit=22.5 2023-06-18 00:56:02,444 INFO [train.py:996] (2/4) Epoch 1, batch 11500, loss[loss=0.3938, simple_loss=0.437, pruned_loss=0.1753, over 19982.00 frames. ], tot_loss[loss=0.352, simple_loss=0.3931, pruned_loss=0.1555, over 4248250.64 frames. ], batch size: 703, lr: 3.33e-02, grad_scale: 16.0 2023-06-18 00:56:24,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=69000.0, ans=0.125 2023-06-18 00:57:20,751 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.117e+02 3.435e+02 4.128e+02 5.314e+02 9.134e+02, threshold=8.255e+02, percent-clipped=0.0 2023-06-18 00:58:13,017 INFO [train.py:996] (2/4) Epoch 1, batch 11550, loss[loss=0.5287, simple_loss=0.5729, pruned_loss=0.2422, over 21488.00 frames. ], tot_loss[loss=0.3522, simple_loss=0.3976, pruned_loss=0.1534, over 4262294.74 frames. ], batch size: 471, lr: 3.32e-02, grad_scale: 16.0 2023-06-18 00:58:22,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=69300.0, ans=0.125 2023-06-18 00:59:12,532 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.92 vs. limit=15.0 2023-06-18 01:00:05,194 INFO [train.py:996] (2/4) Epoch 1, batch 11600, loss[loss=0.418, simple_loss=0.5007, pruned_loss=0.1676, over 21261.00 frames. ], tot_loss[loss=0.3639, simple_loss=0.4154, pruned_loss=0.1562, over 4263254.47 frames. ], batch size: 549, lr: 3.32e-02, grad_scale: 32.0 2023-06-18 01:00:32,112 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=69660.0, ans=0.05 2023-06-18 01:00:40,393 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.18 vs. limit=15.0 2023-06-18 01:00:41,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=69660.0, ans=0.125 2023-06-18 01:00:48,985 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.96 vs. limit=15.0 2023-06-18 01:00:50,978 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.524e+02 3.816e+02 4.671e+02 6.267e+02 1.056e+03, threshold=9.343e+02, percent-clipped=9.0 2023-06-18 01:00:51,539 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=69720.0, ans=0.125 2023-06-18 01:01:11,926 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-18 01:01:57,958 INFO [train.py:996] (2/4) Epoch 1, batch 11650, loss[loss=0.4316, simple_loss=0.4487, pruned_loss=0.2073, over 21337.00 frames. ], tot_loss[loss=0.3652, simple_loss=0.4194, pruned_loss=0.1555, over 4260964.50 frames. ], batch size: 471, lr: 3.31e-02, grad_scale: 32.0 2023-06-18 01:02:22,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=69960.0, ans=0.125 2023-06-18 01:02:33,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=69960.0, ans=0.0 2023-06-18 01:02:36,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=69960.0, ans=0.125 2023-06-18 01:02:57,172 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.10 vs. limit=15.0 2023-06-18 01:03:37,388 INFO [train.py:996] (2/4) Epoch 1, batch 11700, loss[loss=0.319, simple_loss=0.3528, pruned_loss=0.1426, over 21731.00 frames. ], tot_loss[loss=0.3578, simple_loss=0.408, pruned_loss=0.1538, over 4261242.90 frames. ], batch size: 317, lr: 3.31e-02, grad_scale: 32.0 2023-06-18 01:04:17,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=70320.0, ans=0.125 2023-06-18 01:04:23,151 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.615e+02 4.235e+02 5.124e+02 6.443e+02 9.190e+02, threshold=1.025e+03, percent-clipped=0.0 2023-06-18 01:04:23,471 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=70320.0, ans=0.125 2023-06-18 01:04:45,754 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.32 vs. limit=15.0 2023-06-18 01:05:08,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=70440.0, ans=0.0 2023-06-18 01:05:13,839 INFO [train.py:996] (2/4) Epoch 1, batch 11750, loss[loss=0.4377, simple_loss=0.4449, pruned_loss=0.2152, over 21675.00 frames. ], tot_loss[loss=0.3526, simple_loss=0.3976, pruned_loss=0.1538, over 4268467.55 frames. ], batch size: 441, lr: 3.30e-02, grad_scale: 32.0 2023-06-18 01:05:25,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=70500.0, ans=0.125 2023-06-18 01:06:06,515 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.59 vs. limit=10.0 2023-06-18 01:06:18,052 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=70680.0, ans=0.05 2023-06-18 01:06:49,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=70740.0, ans=0.0 2023-06-18 01:06:56,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=70740.0, ans=0.125 2023-06-18 01:07:21,212 INFO [train.py:996] (2/4) Epoch 1, batch 11800, loss[loss=0.3405, simple_loss=0.4241, pruned_loss=0.1285, over 21771.00 frames. ], tot_loss[loss=0.3569, simple_loss=0.3995, pruned_loss=0.1572, over 4268529.18 frames. ], batch size: 282, lr: 3.30e-02, grad_scale: 32.0 2023-06-18 01:07:31,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=70800.0, ans=0.125 2023-06-18 01:08:07,088 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 01:08:23,707 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.399e+02 3.432e+02 4.026e+02 5.092e+02 8.722e+02, threshold=8.051e+02, percent-clipped=0.0 2023-06-18 01:08:48,837 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.98 vs. limit=6.0 2023-06-18 01:08:49,715 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=70980.0, ans=0.1 2023-06-18 01:09:32,037 INFO [train.py:996] (2/4) Epoch 1, batch 11850, loss[loss=0.423, simple_loss=0.4327, pruned_loss=0.2067, over 20099.00 frames. ], tot_loss[loss=0.3562, simple_loss=0.4005, pruned_loss=0.1559, over 4272427.47 frames. ], batch size: 707, lr: 3.29e-02, grad_scale: 32.0 2023-06-18 01:09:35,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=71100.0, ans=0.125 2023-06-18 01:10:55,622 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=71340.0, ans=0.02 2023-06-18 01:11:21,898 INFO [train.py:996] (2/4) Epoch 1, batch 11900, loss[loss=0.3419, simple_loss=0.4035, pruned_loss=0.1401, over 21674.00 frames. ], tot_loss[loss=0.3519, simple_loss=0.3992, pruned_loss=0.1523, over 4263333.64 frames. ], batch size: 414, lr: 3.29e-02, grad_scale: 16.0 2023-06-18 01:11:22,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=71400.0, ans=0.1 2023-06-18 01:11:40,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=71460.0, ans=0.1 2023-06-18 01:12:19,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=71520.0, ans=0.125 2023-06-18 01:12:27,626 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.330e+02 3.389e+02 4.230e+02 4.928e+02 7.939e+02, threshold=8.459e+02, percent-clipped=0.0 2023-06-18 01:12:57,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=71580.0, ans=0.125 2023-06-18 01:13:11,087 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=71580.0, ans=0.125 2023-06-18 01:13:35,574 INFO [train.py:996] (2/4) Epoch 1, batch 11950, loss[loss=0.2606, simple_loss=0.3422, pruned_loss=0.08952, over 21578.00 frames. ], tot_loss[loss=0.3431, simple_loss=0.3952, pruned_loss=0.1455, over 4272876.13 frames. ], batch size: 230, lr: 3.28e-02, grad_scale: 16.0 2023-06-18 01:13:45,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=71700.0, ans=0.125 2023-06-18 01:14:09,462 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=71760.0, ans=0.125 2023-06-18 01:14:21,726 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.04 vs. limit=15.0 2023-06-18 01:15:26,605 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=71940.0, ans=0.125 2023-06-18 01:15:43,907 INFO [train.py:996] (2/4) Epoch 1, batch 12000, loss[loss=0.328, simple_loss=0.3612, pruned_loss=0.1474, over 21874.00 frames. ], tot_loss[loss=0.3377, simple_loss=0.3878, pruned_loss=0.1438, over 4270653.44 frames. ], batch size: 98, lr: 3.28e-02, grad_scale: 32.0 2023-06-18 01:15:43,908 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-18 01:16:39,564 INFO [train.py:1028] (2/4) Epoch 1, validation: loss=0.3214, simple_loss=0.4077, pruned_loss=0.1176, over 1796401.00 frames. 2023-06-18 01:16:39,565 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-18 01:16:48,161 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.79 vs. limit=15.0 2023-06-18 01:16:59,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=72060.0, ans=0.125 2023-06-18 01:17:26,124 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.232e+02 3.117e+02 3.794e+02 4.594e+02 6.987e+02, threshold=7.589e+02, percent-clipped=0.0 2023-06-18 01:17:34,084 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.464e-01 2023-06-18 01:17:34,770 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.09 vs. limit=10.0 2023-06-18 01:17:57,120 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=72180.0, ans=0.1 2023-06-18 01:18:16,522 INFO [train.py:996] (2/4) Epoch 1, batch 12050, loss[loss=0.3271, simple_loss=0.3616, pruned_loss=0.1463, over 21246.00 frames. ], tot_loss[loss=0.3399, simple_loss=0.3856, pruned_loss=0.1471, over 4275905.92 frames. ], batch size: 176, lr: 3.27e-02, grad_scale: 32.0 2023-06-18 01:18:18,968 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.29 vs. limit=15.0 2023-06-18 01:18:45,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=72360.0, ans=0.0 2023-06-18 01:18:48,483 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.79 vs. limit=22.5 2023-06-18 01:19:15,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=72420.0, ans=0.1 2023-06-18 01:20:12,083 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=72540.0, ans=0.2 2023-06-18 01:20:39,881 INFO [train.py:996] (2/4) Epoch 1, batch 12100, loss[loss=0.3563, simple_loss=0.4327, pruned_loss=0.1399, over 19771.00 frames. ], tot_loss[loss=0.3523, simple_loss=0.3963, pruned_loss=0.1542, over 4272182.34 frames. ], batch size: 702, lr: 3.27e-02, grad_scale: 32.0 2023-06-18 01:20:41,862 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=72600.0, ans=0.0 2023-06-18 01:20:49,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=72600.0, ans=0.035 2023-06-18 01:21:32,850 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.780e+02 3.918e+02 4.969e+02 6.272e+02 1.033e+03, threshold=9.938e+02, percent-clipped=11.0 2023-06-18 01:21:33,932 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.03 vs. limit=15.0 2023-06-18 01:22:41,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=72840.0, ans=0.125 2023-06-18 01:22:52,400 INFO [train.py:996] (2/4) Epoch 1, batch 12150, loss[loss=0.3378, simple_loss=0.4129, pruned_loss=0.1313, over 21818.00 frames. ], tot_loss[loss=0.355, simple_loss=0.4001, pruned_loss=0.155, over 4273847.52 frames. ], batch size: 371, lr: 3.26e-02, grad_scale: 32.0 2023-06-18 01:24:07,843 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=73020.0, ans=0.1 2023-06-18 01:25:11,643 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=73140.0, ans=0.125 2023-06-18 01:25:13,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=73140.0, ans=0.125 2023-06-18 01:25:22,169 INFO [train.py:996] (2/4) Epoch 1, batch 12200, loss[loss=0.3327, simple_loss=0.3645, pruned_loss=0.1504, over 21849.00 frames. ], tot_loss[loss=0.3505, simple_loss=0.3941, pruned_loss=0.1534, over 4268236.39 frames. ], batch size: 98, lr: 3.26e-02, grad_scale: 32.0 2023-06-18 01:25:25,729 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=73200.0, ans=0.125 2023-06-18 01:25:27,618 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.17 vs. limit=22.5 2023-06-18 01:25:57,300 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.71 vs. limit=15.0 2023-06-18 01:26:09,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=73320.0, ans=0.2 2023-06-18 01:26:10,496 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.716e+02 3.638e+02 4.548e+02 5.792e+02 8.635e+02, threshold=9.096e+02, percent-clipped=0.0 2023-06-18 01:26:22,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=73380.0, ans=0.1 2023-06-18 01:26:23,424 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.31 vs. limit=15.0 2023-06-18 01:26:26,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=73380.0, ans=0.0 2023-06-18 01:27:17,666 INFO [train.py:996] (2/4) Epoch 1, batch 12250, loss[loss=0.2175, simple_loss=0.2886, pruned_loss=0.07314, over 21516.00 frames. ], tot_loss[loss=0.3396, simple_loss=0.3838, pruned_loss=0.1476, over 4263771.34 frames. ], batch size: 212, lr: 3.25e-02, grad_scale: 32.0 2023-06-18 01:27:23,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=73500.0, ans=10.0 2023-06-18 01:27:24,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=73500.0, ans=0.125 2023-06-18 01:27:41,319 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.34 vs. limit=15.0 2023-06-18 01:28:03,318 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=73620.0, ans=0.0 2023-06-18 01:28:20,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=73680.0, ans=0.1 2023-06-18 01:29:07,459 INFO [train.py:996] (2/4) Epoch 1, batch 12300, loss[loss=0.3376, simple_loss=0.4039, pruned_loss=0.1357, over 21639.00 frames. ], tot_loss[loss=0.3235, simple_loss=0.3727, pruned_loss=0.1372, over 4271338.31 frames. ], batch size: 389, lr: 3.25e-02, grad_scale: 32.0 2023-06-18 01:30:05,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=73920.0, ans=0.1 2023-06-18 01:30:09,814 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.972e+02 3.199e+02 4.044e+02 5.045e+02 8.506e+02, threshold=8.089e+02, percent-clipped=0.0 2023-06-18 01:31:15,747 INFO [train.py:996] (2/4) Epoch 1, batch 12350, loss[loss=0.347, simple_loss=0.3938, pruned_loss=0.1502, over 21838.00 frames. ], tot_loss[loss=0.3309, simple_loss=0.3809, pruned_loss=0.1404, over 4276647.24 frames. ], batch size: 332, lr: 3.24e-02, grad_scale: 32.0 2023-06-18 01:31:51,830 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.66 vs. limit=6.0 2023-06-18 01:32:01,309 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=74220.0, ans=0.125 2023-06-18 01:32:44,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=74340.0, ans=0.5 2023-06-18 01:32:51,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=74340.0, ans=0.125 2023-06-18 01:33:05,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=74340.0, ans=0.125 2023-06-18 01:33:09,311 INFO [train.py:996] (2/4) Epoch 1, batch 12400, loss[loss=0.3695, simple_loss=0.4105, pruned_loss=0.1643, over 21720.00 frames. ], tot_loss[loss=0.3343, simple_loss=0.3819, pruned_loss=0.1433, over 4275268.76 frames. ], batch size: 389, lr: 3.24e-02, grad_scale: 32.0 2023-06-18 01:33:09,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=74400.0, ans=0.2 2023-06-18 01:34:07,779 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.455e+02 3.656e+02 4.405e+02 5.426e+02 8.475e+02, threshold=8.810e+02, percent-clipped=2.0 2023-06-18 01:34:09,645 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=74520.0, ans=0.125 2023-06-18 01:35:23,698 INFO [train.py:996] (2/4) Epoch 1, batch 12450, loss[loss=0.3918, simple_loss=0.4295, pruned_loss=0.177, over 21874.00 frames. ], tot_loss[loss=0.3424, simple_loss=0.3875, pruned_loss=0.1486, over 4284512.97 frames. ], batch size: 371, lr: 3.23e-02, grad_scale: 32.0 2023-06-18 01:36:14,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=74760.0, ans=0.0 2023-06-18 01:36:14,500 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.45 vs. limit=15.0 2023-06-18 01:36:43,937 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=74880.0, ans=0.0 2023-06-18 01:36:45,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=74880.0, ans=0.125 2023-06-18 01:37:24,105 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.16 vs. limit=15.0 2023-06-18 01:37:44,974 INFO [train.py:996] (2/4) Epoch 1, batch 12500, loss[loss=0.3667, simple_loss=0.4364, pruned_loss=0.1485, over 21390.00 frames. ], tot_loss[loss=0.3567, simple_loss=0.4018, pruned_loss=0.1558, over 4288268.70 frames. ], batch size: 194, lr: 3.23e-02, grad_scale: 32.0 2023-06-18 01:37:49,375 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.85 vs. limit=22.5 2023-06-18 01:38:24,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=75060.0, ans=0.125 2023-06-18 01:38:34,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=75060.0, ans=0.0 2023-06-18 01:38:52,569 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.344e+02 3.484e+02 4.470e+02 5.562e+02 9.789e+02, threshold=8.941e+02, percent-clipped=2.0 2023-06-18 01:39:15,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=75180.0, ans=0.2 2023-06-18 01:39:23,447 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=75180.0, ans=0.05 2023-06-18 01:40:06,040 INFO [train.py:996] (2/4) Epoch 1, batch 12550, loss[loss=0.4925, simple_loss=0.5015, pruned_loss=0.2418, over 21353.00 frames. ], tot_loss[loss=0.3642, simple_loss=0.4086, pruned_loss=0.1599, over 4284142.21 frames. ], batch size: 507, lr: 3.22e-02, grad_scale: 16.0 2023-06-18 01:40:12,791 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.40 vs. limit=22.5 2023-06-18 01:41:21,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=75420.0, ans=0.0 2023-06-18 01:41:29,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=75480.0, ans=0.2 2023-06-18 01:41:49,491 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=75540.0, ans=0.09899494936611666 2023-06-18 01:42:06,768 INFO [train.py:996] (2/4) Epoch 1, batch 12600, loss[loss=0.3529, simple_loss=0.41, pruned_loss=0.1479, over 21659.00 frames. ], tot_loss[loss=0.3567, simple_loss=0.4039, pruned_loss=0.1548, over 4283406.99 frames. ], batch size: 414, lr: 3.22e-02, grad_scale: 16.0 2023-06-18 01:42:14,331 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=75600.0, ans=0.0 2023-06-18 01:42:33,144 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=75600.0, ans=0.0 2023-06-18 01:42:34,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=75660.0, ans=0.125 2023-06-18 01:43:03,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=75660.0, ans=0.05 2023-06-18 01:43:20,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=75720.0, ans=0.125 2023-06-18 01:43:24,483 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.305e+02 3.361e+02 4.155e+02 4.807e+02 7.710e+02, threshold=8.311e+02, percent-clipped=0.0 2023-06-18 01:43:26,505 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=75720.0, ans=0.0 2023-06-18 01:43:49,285 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.12 vs. limit=15.0 2023-06-18 01:44:08,039 INFO [train.py:996] (2/4) Epoch 1, batch 12650, loss[loss=0.3012, simple_loss=0.3463, pruned_loss=0.1281, over 21685.00 frames. ], tot_loss[loss=0.344, simple_loss=0.3941, pruned_loss=0.147, over 4290356.47 frames. ], batch size: 230, lr: 3.21e-02, grad_scale: 16.0 2023-06-18 01:44:38,758 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.90 vs. limit=15.0 2023-06-18 01:45:26,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=76080.0, ans=0.0 2023-06-18 01:45:35,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=76140.0, ans=0.2 2023-06-18 01:45:45,869 INFO [train.py:996] (2/4) Epoch 1, batch 12700, loss[loss=0.3583, simple_loss=0.3923, pruned_loss=0.1621, over 21261.00 frames. ], tot_loss[loss=0.3484, simple_loss=0.3947, pruned_loss=0.1511, over 4293164.30 frames. ], batch size: 176, lr: 3.21e-02, grad_scale: 16.0 2023-06-18 01:46:27,605 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=76320.0, ans=0.1 2023-06-18 01:46:39,953 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.453e+02 4.163e+02 4.705e+02 5.622e+02 1.035e+03, threshold=9.411e+02, percent-clipped=4.0 2023-06-18 01:46:46,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=76380.0, ans=0.1 2023-06-18 01:47:01,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=76440.0, ans=0.1 2023-06-18 01:47:16,591 INFO [train.py:996] (2/4) Epoch 1, batch 12750, loss[loss=0.3368, simple_loss=0.3799, pruned_loss=0.1468, over 21894.00 frames. ], tot_loss[loss=0.3489, simple_loss=0.3959, pruned_loss=0.1509, over 4286113.46 frames. ], batch size: 118, lr: 3.20e-02, grad_scale: 16.0 2023-06-18 01:48:40,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=76680.0, ans=0.125 2023-06-18 01:49:32,646 INFO [train.py:996] (2/4) Epoch 1, batch 12800, loss[loss=0.3369, simple_loss=0.3778, pruned_loss=0.148, over 21436.00 frames. ], tot_loss[loss=0.3489, simple_loss=0.3946, pruned_loss=0.1516, over 4288513.37 frames. ], batch size: 211, lr: 3.20e-02, grad_scale: 32.0 2023-06-18 01:49:36,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=76800.0, ans=0.125 2023-06-18 01:50:17,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=76860.0, ans=15.0 2023-06-18 01:50:21,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=76860.0, ans=0.1 2023-06-18 01:50:24,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=76920.0, ans=0.2 2023-06-18 01:50:33,555 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.220e+02 3.808e+02 4.446e+02 5.808e+02 1.607e+03, threshold=8.892e+02, percent-clipped=7.0 2023-06-18 01:50:49,356 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 01:51:59,023 INFO [train.py:996] (2/4) Epoch 1, batch 12850, loss[loss=0.4186, simple_loss=0.4723, pruned_loss=0.1825, over 19837.00 frames. ], tot_loss[loss=0.354, simple_loss=0.3987, pruned_loss=0.1546, over 4287980.84 frames. ], batch size: 703, lr: 3.19e-02, grad_scale: 32.0 2023-06-18 01:52:54,365 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.55 vs. limit=6.0 2023-06-18 01:52:59,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=77280.0, ans=0.125 2023-06-18 01:53:55,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=77340.0, ans=0.1 2023-06-18 01:54:07,004 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.22 vs. limit=15.0 2023-06-18 01:54:12,626 INFO [train.py:996] (2/4) Epoch 1, batch 12900, loss[loss=0.3129, simple_loss=0.3811, pruned_loss=0.1223, over 21829.00 frames. ], tot_loss[loss=0.3435, simple_loss=0.392, pruned_loss=0.1475, over 4284318.92 frames. ], batch size: 333, lr: 3.19e-02, grad_scale: 32.0 2023-06-18 01:54:50,771 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.169e+02 2.899e+02 3.578e+02 4.051e+02 7.858e+02, threshold=7.156e+02, percent-clipped=0.0 2023-06-18 01:55:01,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=77580.0, ans=0.2 2023-06-18 01:55:50,314 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=77640.0, ans=0.2 2023-06-18 01:55:57,446 INFO [train.py:996] (2/4) Epoch 1, batch 12950, loss[loss=0.3674, simple_loss=0.4461, pruned_loss=0.1444, over 19794.00 frames. ], tot_loss[loss=0.3373, simple_loss=0.3877, pruned_loss=0.1435, over 4277162.34 frames. ], batch size: 703, lr: 3.19e-02, grad_scale: 32.0 2023-06-18 01:56:30,091 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.43 vs. limit=15.0 2023-06-18 01:57:01,387 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.36 vs. limit=15.0 2023-06-18 01:57:02,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=77820.0, ans=0.0 2023-06-18 01:57:13,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=77820.0, ans=0.125 2023-06-18 01:57:16,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=77880.0, ans=0.0 2023-06-18 01:57:59,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=77940.0, ans=0.2 2023-06-18 01:58:01,743 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.76 vs. limit=10.0 2023-06-18 01:58:03,850 INFO [train.py:996] (2/4) Epoch 1, batch 13000, loss[loss=0.3477, simple_loss=0.3952, pruned_loss=0.1501, over 21500.00 frames. ], tot_loss[loss=0.3386, simple_loss=0.3888, pruned_loss=0.1442, over 4274839.07 frames. ], batch size: 471, lr: 3.18e-02, grad_scale: 32.0 2023-06-18 01:58:04,813 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.67 vs. limit=15.0 2023-06-18 01:58:27,477 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=78060.0, ans=0.1 2023-06-18 01:58:54,015 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.166e+02 3.081e+02 4.107e+02 5.952e+02 9.573e+02, threshold=8.214e+02, percent-clipped=12.0 2023-06-18 01:59:15,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=78180.0, ans=0.1 2023-06-18 01:59:20,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=78180.0, ans=0.125 2023-06-18 02:00:08,207 INFO [train.py:996] (2/4) Epoch 1, batch 13050, loss[loss=0.3454, simple_loss=0.3863, pruned_loss=0.1522, over 21438.00 frames. ], tot_loss[loss=0.3329, simple_loss=0.3839, pruned_loss=0.141, over 4279682.63 frames. ], batch size: 131, lr: 3.18e-02, grad_scale: 32.0 2023-06-18 02:00:37,897 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.76 vs. limit=12.0 2023-06-18 02:01:00,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=78480.0, ans=0.0 2023-06-18 02:01:39,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=78540.0, ans=0.0 2023-06-18 02:01:56,468 INFO [train.py:996] (2/4) Epoch 1, batch 13100, loss[loss=0.3083, simple_loss=0.3739, pruned_loss=0.1213, over 21807.00 frames. ], tot_loss[loss=0.3339, simple_loss=0.3856, pruned_loss=0.1411, over 4280497.42 frames. ], batch size: 282, lr: 3.17e-02, grad_scale: 32.0 2023-06-18 02:02:23,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=78600.0, ans=0.125 2023-06-18 02:02:24,430 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=78600.0, ans=0.015 2023-06-18 02:02:28,194 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.50 vs. limit=15.0 2023-06-18 02:03:16,367 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.576e+02 3.573e+02 4.115e+02 5.192e+02 9.461e+02, threshold=8.229e+02, percent-clipped=5.0 2023-06-18 02:04:05,260 INFO [train.py:996] (2/4) Epoch 1, batch 13150, loss[loss=0.3112, simple_loss=0.3521, pruned_loss=0.1351, over 21795.00 frames. ], tot_loss[loss=0.3417, simple_loss=0.3896, pruned_loss=0.1469, over 4282706.12 frames. ], batch size: 124, lr: 3.17e-02, grad_scale: 32.0 2023-06-18 02:04:23,338 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=78900.0, ans=0.125 2023-06-18 02:05:09,666 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=79020.0, ans=0.1 2023-06-18 02:05:36,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=79080.0, ans=0.125 2023-06-18 02:05:44,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=79080.0, ans=0.125 2023-06-18 02:06:18,762 INFO [train.py:996] (2/4) Epoch 1, batch 13200, loss[loss=0.3591, simple_loss=0.4037, pruned_loss=0.1572, over 21799.00 frames. ], tot_loss[loss=0.3441, simple_loss=0.3906, pruned_loss=0.1488, over 4278719.76 frames. ], batch size: 247, lr: 3.16e-02, grad_scale: 32.0 2023-06-18 02:06:34,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=79200.0, ans=0.125 2023-06-18 02:07:01,797 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=79260.0, ans=0.1 2023-06-18 02:07:24,181 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 02:07:33,396 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.66 vs. limit=15.0 2023-06-18 02:07:43,689 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.227e+02 3.259e+02 3.924e+02 5.244e+02 8.674e+02, threshold=7.848e+02, percent-clipped=1.0 2023-06-18 02:08:22,172 INFO [train.py:996] (2/4) Epoch 1, batch 13250, loss[loss=0.3485, simple_loss=0.3873, pruned_loss=0.1549, over 21822.00 frames. ], tot_loss[loss=0.348, simple_loss=0.3921, pruned_loss=0.152, over 4279987.06 frames. ], batch size: 107, lr: 3.16e-02, grad_scale: 32.0 2023-06-18 02:08:53,445 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.36 vs. limit=15.0 2023-06-18 02:09:35,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=79620.0, ans=0.0 2023-06-18 02:09:54,592 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=79680.0, ans=0.125 2023-06-18 02:10:28,702 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.20 vs. limit=15.0 2023-06-18 02:10:54,480 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=79800.0, ans=0.125 2023-06-18 02:10:55,440 INFO [train.py:996] (2/4) Epoch 1, batch 13300, loss[loss=0.3089, simple_loss=0.4245, pruned_loss=0.09662, over 19807.00 frames. ], tot_loss[loss=0.3491, simple_loss=0.395, pruned_loss=0.1516, over 4276384.53 frames. ], batch size: 702, lr: 3.15e-02, grad_scale: 32.0 2023-06-18 02:11:06,082 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=79800.0, ans=0.125 2023-06-18 02:11:23,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=79860.0, ans=0.015 2023-06-18 02:11:49,158 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.378e+02 3.863e+02 4.396e+02 5.432e+02 9.138e+02, threshold=8.792e+02, percent-clipped=4.0 2023-06-18 02:12:38,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=79980.0, ans=0.0 2023-06-18 02:13:14,161 INFO [train.py:996] (2/4) Epoch 1, batch 13350, loss[loss=0.3281, simple_loss=0.4289, pruned_loss=0.1137, over 19662.00 frames. ], tot_loss[loss=0.3564, simple_loss=0.4018, pruned_loss=0.1555, over 4273349.29 frames. ], batch size: 702, lr: 3.15e-02, grad_scale: 32.0 2023-06-18 02:13:24,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=80100.0, ans=0.0 2023-06-18 02:13:46,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=80160.0, ans=0.125 2023-06-18 02:14:21,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=80220.0, ans=0.1 2023-06-18 02:15:00,812 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 02:15:43,138 INFO [train.py:996] (2/4) Epoch 1, batch 13400, loss[loss=0.3817, simple_loss=0.4156, pruned_loss=0.174, over 21857.00 frames. ], tot_loss[loss=0.3604, simple_loss=0.4039, pruned_loss=0.1585, over 4273823.79 frames. ], batch size: 371, lr: 3.14e-02, grad_scale: 32.0 2023-06-18 02:16:15,026 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=80460.0, ans=0.04949747468305833 2023-06-18 02:16:20,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=80460.0, ans=0.125 2023-06-18 02:16:33,155 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.555e+02 3.828e+02 4.349e+02 5.476e+02 1.150e+03, threshold=8.698e+02, percent-clipped=4.0 2023-06-18 02:16:35,633 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.75 vs. limit=15.0 2023-06-18 02:17:19,150 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.07 vs. limit=15.0 2023-06-18 02:17:42,325 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.07 vs. limit=12.0 2023-06-18 02:17:55,769 INFO [train.py:996] (2/4) Epoch 1, batch 13450, loss[loss=0.2758, simple_loss=0.2965, pruned_loss=0.1275, over 16710.00 frames. ], tot_loss[loss=0.3635, simple_loss=0.404, pruned_loss=0.1615, over 4271779.29 frames. ], batch size: 60, lr: 3.14e-02, grad_scale: 32.0 2023-06-18 02:18:35,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=80820.0, ans=0.2 2023-06-18 02:18:50,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=80880.0, ans=0.0 2023-06-18 02:18:55,509 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.93 vs. limit=10.0 2023-06-18 02:19:06,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=80940.0, ans=0.125 2023-06-18 02:19:33,604 INFO [train.py:996] (2/4) Epoch 1, batch 13500, loss[loss=0.3172, simple_loss=0.3671, pruned_loss=0.1336, over 21880.00 frames. ], tot_loss[loss=0.3512, simple_loss=0.3914, pruned_loss=0.1555, over 4265946.24 frames. ], batch size: 317, lr: 3.14e-02, grad_scale: 32.0 2023-06-18 02:19:59,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=81060.0, ans=0.5 2023-06-18 02:20:30,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=81060.0, ans=0.1 2023-06-18 02:20:46,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=81120.0, ans=0.2 2023-06-18 02:20:49,348 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.184e+02 3.441e+02 4.151e+02 5.577e+02 1.126e+03, threshold=8.302e+02, percent-clipped=5.0 2023-06-18 02:21:46,105 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=81240.0, ans=0.125 2023-06-18 02:22:09,585 INFO [train.py:996] (2/4) Epoch 1, batch 13550, loss[loss=0.2944, simple_loss=0.3391, pruned_loss=0.1248, over 21727.00 frames. ], tot_loss[loss=0.3495, simple_loss=0.3937, pruned_loss=0.1527, over 4252488.08 frames. ], batch size: 112, lr: 3.13e-02, grad_scale: 32.0 2023-06-18 02:23:03,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=81420.0, ans=22.5 2023-06-18 02:23:33,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=81480.0, ans=0.1 2023-06-18 02:23:34,209 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.18 vs. limit=15.0 2023-06-18 02:23:55,107 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=81480.0, ans=0.04949747468305833 2023-06-18 02:24:17,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=81540.0, ans=0.1 2023-06-18 02:24:25,310 INFO [train.py:996] (2/4) Epoch 1, batch 13600, loss[loss=0.3633, simple_loss=0.4011, pruned_loss=0.1627, over 21899.00 frames. ], tot_loss[loss=0.3518, simple_loss=0.3966, pruned_loss=0.1535, over 4261986.38 frames. ], batch size: 351, lr: 3.13e-02, grad_scale: 32.0 2023-06-18 02:24:34,995 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.56 vs. limit=10.0 2023-06-18 02:24:37,043 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=81600.0, ans=0.125 2023-06-18 02:25:20,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=81720.0, ans=0.0 2023-06-18 02:25:21,036 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.54 vs. limit=8.0 2023-06-18 02:25:25,449 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.475e+02 3.459e+02 4.202e+02 5.582e+02 9.308e+02, threshold=8.405e+02, percent-clipped=4.0 2023-06-18 02:26:20,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=81840.0, ans=0.125 2023-06-18 02:26:25,969 INFO [train.py:996] (2/4) Epoch 1, batch 13650, loss[loss=0.2802, simple_loss=0.3241, pruned_loss=0.1181, over 21540.00 frames. ], tot_loss[loss=0.3435, simple_loss=0.3894, pruned_loss=0.1488, over 4260783.16 frames. ], batch size: 263, lr: 3.12e-02, grad_scale: 32.0 2023-06-18 02:26:36,771 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.98 vs. limit=15.0 2023-06-18 02:27:35,605 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=82020.0, ans=0.1 2023-06-18 02:27:47,107 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=82080.0, ans=0.125 2023-06-18 02:28:17,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=82140.0, ans=0.0 2023-06-18 02:28:26,805 INFO [train.py:996] (2/4) Epoch 1, batch 13700, loss[loss=0.3353, simple_loss=0.3759, pruned_loss=0.1473, over 21773.00 frames. ], tot_loss[loss=0.3387, simple_loss=0.3823, pruned_loss=0.1476, over 4269082.33 frames. ], batch size: 316, lr: 3.12e-02, grad_scale: 32.0 2023-06-18 02:28:27,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=82200.0, ans=0.0 2023-06-18 02:29:38,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=82320.0, ans=0.125 2023-06-18 02:29:46,846 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.386e+02 3.308e+02 4.161e+02 5.184e+02 1.059e+03, threshold=8.322e+02, percent-clipped=1.0 2023-06-18 02:30:08,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=82380.0, ans=0.2 2023-06-18 02:30:11,368 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=82380.0, ans=0.0 2023-06-18 02:30:12,779 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=82380.0, ans=0.0 2023-06-18 02:30:39,906 INFO [train.py:996] (2/4) Epoch 1, batch 13750, loss[loss=0.3273, simple_loss=0.3934, pruned_loss=0.1306, over 21175.00 frames. ], tot_loss[loss=0.3295, simple_loss=0.3743, pruned_loss=0.1423, over 4264012.70 frames. ], batch size: 548, lr: 3.11e-02, grad_scale: 32.0 2023-06-18 02:30:46,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=82500.0, ans=0.125 2023-06-18 02:31:03,382 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.69 vs. limit=15.0 2023-06-18 02:31:55,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=82620.0, ans=0.0 2023-06-18 02:32:31,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=82680.0, ans=0.125 2023-06-18 02:33:08,345 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.88 vs. limit=15.0 2023-06-18 02:33:17,446 INFO [train.py:996] (2/4) Epoch 1, batch 13800, loss[loss=0.3765, simple_loss=0.455, pruned_loss=0.1489, over 21827.00 frames. ], tot_loss[loss=0.3333, simple_loss=0.3815, pruned_loss=0.1426, over 4264395.76 frames. ], batch size: 371, lr: 3.11e-02, grad_scale: 32.0 2023-06-18 02:34:10,455 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=82860.0, ans=0.125 2023-06-18 02:34:28,552 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.483e+02 3.529e+02 4.559e+02 5.712e+02 9.830e+02, threshold=9.119e+02, percent-clipped=1.0 2023-06-18 02:34:52,843 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=82980.0, ans=0.2 2023-06-18 02:35:13,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=83040.0, ans=0.0 2023-06-18 02:35:49,012 INFO [train.py:996] (2/4) Epoch 1, batch 13850, loss[loss=0.2818, simple_loss=0.3397, pruned_loss=0.1119, over 21867.00 frames. ], tot_loss[loss=0.3412, simple_loss=0.3909, pruned_loss=0.1457, over 4267183.26 frames. ], batch size: 107, lr: 3.11e-02, grad_scale: 32.0 2023-06-18 02:35:54,149 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=83100.0, ans=0.0 2023-06-18 02:36:11,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=83160.0, ans=0.125 2023-06-18 02:37:13,436 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 02:37:16,144 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=83340.0, ans=0.125 2023-06-18 02:37:30,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=83340.0, ans=0.1 2023-06-18 02:37:52,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=83340.0, ans=0.125 2023-06-18 02:37:59,150 INFO [train.py:996] (2/4) Epoch 1, batch 13900, loss[loss=0.4108, simple_loss=0.4288, pruned_loss=0.1964, over 21742.00 frames. ], tot_loss[loss=0.3506, simple_loss=0.3968, pruned_loss=0.1522, over 4268108.62 frames. ], batch size: 441, lr: 3.10e-02, grad_scale: 32.0 2023-06-18 02:38:06,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=83400.0, ans=0.125 2023-06-18 02:38:39,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=83520.0, ans=0.0 2023-06-18 02:38:46,648 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.741e+02 3.529e+02 4.116e+02 5.133e+02 1.052e+03, threshold=8.231e+02, percent-clipped=3.0 2023-06-18 02:39:03,554 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=83580.0, ans=0.125 2023-06-18 02:39:43,494 INFO [train.py:996] (2/4) Epoch 1, batch 13950, loss[loss=0.3659, simple_loss=0.3998, pruned_loss=0.1661, over 21851.00 frames. ], tot_loss[loss=0.3534, simple_loss=0.3971, pruned_loss=0.1549, over 4283841.23 frames. ], batch size: 332, lr: 3.10e-02, grad_scale: 32.0 2023-06-18 02:40:27,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=83760.0, ans=0.0 2023-06-18 02:40:44,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=83820.0, ans=0.125 2023-06-18 02:41:03,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=83880.0, ans=0.5 2023-06-18 02:41:48,695 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=83940.0, ans=0.125 2023-06-18 02:41:58,059 INFO [train.py:996] (2/4) Epoch 1, batch 14000, loss[loss=0.2349, simple_loss=0.2962, pruned_loss=0.08675, over 21675.00 frames. ], tot_loss[loss=0.3443, simple_loss=0.3898, pruned_loss=0.1494, over 4279589.37 frames. ], batch size: 263, lr: 3.09e-02, grad_scale: 32.0 2023-06-18 02:42:41,327 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.780e+02 3.528e+02 4.428e+02 5.328e+02 9.334e+02, threshold=8.857e+02, percent-clipped=2.0 2023-06-18 02:43:34,036 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-06-18 02:43:43,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=84240.0, ans=0.2 2023-06-18 02:43:49,528 INFO [train.py:996] (2/4) Epoch 1, batch 14050, loss[loss=0.312, simple_loss=0.3464, pruned_loss=0.1388, over 21167.00 frames. ], tot_loss[loss=0.3352, simple_loss=0.3839, pruned_loss=0.1432, over 4282636.25 frames. ], batch size: 548, lr: 3.09e-02, grad_scale: 32.0 2023-06-18 02:43:57,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=84300.0, ans=0.1 2023-06-18 02:44:49,447 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=84480.0, ans=0.0 2023-06-18 02:45:36,630 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.76 vs. limit=22.5 2023-06-18 02:45:44,406 INFO [train.py:996] (2/4) Epoch 1, batch 14100, loss[loss=0.4036, simple_loss=0.424, pruned_loss=0.1916, over 21327.00 frames. ], tot_loss[loss=0.3323, simple_loss=0.3785, pruned_loss=0.1431, over 4283021.36 frames. ], batch size: 549, lr: 3.08e-02, grad_scale: 32.0 2023-06-18 02:45:46,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=84600.0, ans=0.125 2023-06-18 02:45:49,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=84600.0, ans=0.1 2023-06-18 02:45:53,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=84600.0, ans=0.04949747468305833 2023-06-18 02:46:06,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=84660.0, ans=0.0 2023-06-18 02:46:47,137 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.207e+02 3.554e+02 4.205e+02 5.494e+02 8.066e+02, threshold=8.411e+02, percent-clipped=0.0 2023-06-18 02:46:53,946 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.08 vs. limit=15.0 2023-06-18 02:47:36,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=84840.0, ans=0.0 2023-06-18 02:47:39,389 INFO [train.py:996] (2/4) Epoch 1, batch 14150, loss[loss=0.3353, simple_loss=0.3907, pruned_loss=0.1399, over 21199.00 frames. ], tot_loss[loss=0.3362, simple_loss=0.3828, pruned_loss=0.1448, over 4279402.20 frames. ], batch size: 143, lr: 3.08e-02, grad_scale: 32.0 2023-06-18 02:47:39,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=84900.0, ans=0.125 2023-06-18 02:48:09,813 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=20.23 vs. limit=15.0 2023-06-18 02:48:38,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=85080.0, ans=0.5 2023-06-18 02:49:32,056 INFO [train.py:996] (2/4) Epoch 1, batch 14200, loss[loss=0.2742, simple_loss=0.3391, pruned_loss=0.1046, over 21487.00 frames. ], tot_loss[loss=0.3331, simple_loss=0.3799, pruned_loss=0.1431, over 4269455.60 frames. ], batch size: 194, lr: 3.08e-02, grad_scale: 16.0 2023-06-18 02:49:32,498 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=85200.0, ans=0.0 2023-06-18 02:49:32,934 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.18 vs. limit=15.0 2023-06-18 02:49:50,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=85260.0, ans=0.1 2023-06-18 02:50:05,099 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=85260.0, ans=0.125 2023-06-18 02:50:06,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=85260.0, ans=0.0 2023-06-18 02:50:26,461 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.08 vs. limit=15.0 2023-06-18 02:50:28,370 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.871e+02 3.158e+02 3.617e+02 4.610e+02 1.061e+03, threshold=7.235e+02, percent-clipped=3.0 2023-06-18 02:51:14,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=85440.0, ans=0.0 2023-06-18 02:51:14,254 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=85440.0, ans=0.125 2023-06-18 02:51:20,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=85440.0, ans=0.125 2023-06-18 02:51:23,997 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.47 vs. limit=22.5 2023-06-18 02:51:31,282 INFO [train.py:996] (2/4) Epoch 1, batch 14250, loss[loss=0.2959, simple_loss=0.357, pruned_loss=0.1174, over 21638.00 frames. ], tot_loss[loss=0.328, simple_loss=0.3742, pruned_loss=0.1409, over 4253939.66 frames. ], batch size: 391, lr: 3.07e-02, grad_scale: 16.0 2023-06-18 02:51:49,997 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=85500.0, ans=0.1 2023-06-18 02:51:55,019 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.57 vs. limit=10.0 2023-06-18 02:52:14,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=85620.0, ans=0.2 2023-06-18 02:52:22,569 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.25 vs. limit=15.0 2023-06-18 02:53:19,419 INFO [train.py:996] (2/4) Epoch 1, batch 14300, loss[loss=0.3022, simple_loss=0.3242, pruned_loss=0.1402, over 20710.00 frames. ], tot_loss[loss=0.3283, simple_loss=0.3758, pruned_loss=0.1405, over 4245178.20 frames. ], batch size: 607, lr: 3.07e-02, grad_scale: 16.0 2023-06-18 02:53:43,900 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 02:54:23,981 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.357e+02 3.210e+02 4.117e+02 5.928e+02 1.578e+03, threshold=8.234e+02, percent-clipped=19.0 2023-06-18 02:54:45,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=85980.0, ans=0.1 2023-06-18 02:54:50,446 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.84 vs. limit=15.0 2023-06-18 02:55:08,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=86040.0, ans=0.125 2023-06-18 02:55:09,275 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.06 vs. limit=15.0 2023-06-18 02:55:14,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff3.min_abs, batch_count=86040.0, ans=0.2 2023-06-18 02:55:30,411 INFO [train.py:996] (2/4) Epoch 1, batch 14350, loss[loss=0.3087, simple_loss=0.3575, pruned_loss=0.13, over 21833.00 frames. ], tot_loss[loss=0.331, simple_loss=0.3795, pruned_loss=0.1412, over 4244818.75 frames. ], batch size: 282, lr: 3.06e-02, grad_scale: 16.0 2023-06-18 02:55:40,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=86100.0, ans=0.125 2023-06-18 02:56:41,027 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff3.min_abs, batch_count=86280.0, ans=0.2 2023-06-18 02:57:12,504 INFO [train.py:996] (2/4) Epoch 1, batch 14400, loss[loss=0.3241, simple_loss=0.3658, pruned_loss=0.1412, over 21315.00 frames. ], tot_loss[loss=0.3317, simple_loss=0.378, pruned_loss=0.1427, over 4251201.60 frames. ], batch size: 176, lr: 3.06e-02, grad_scale: 32.0 2023-06-18 02:57:53,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=86520.0, ans=0.0 2023-06-18 02:57:53,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=86520.0, ans=0.0 2023-06-18 02:57:56,143 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.214e+02 3.630e+02 4.233e+02 4.966e+02 8.953e+02, threshold=8.465e+02, percent-clipped=1.0 2023-06-18 02:58:56,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=86640.0, ans=0.1 2023-06-18 02:59:04,506 INFO [train.py:996] (2/4) Epoch 1, batch 14450, loss[loss=0.367, simple_loss=0.3741, pruned_loss=0.18, over 21589.00 frames. ], tot_loss[loss=0.329, simple_loss=0.3721, pruned_loss=0.1429, over 4256658.16 frames. ], batch size: 508, lr: 3.05e-02, grad_scale: 32.0 2023-06-18 02:59:04,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=86700.0, ans=0.125 2023-06-18 02:59:20,349 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=86760.0, ans=0.0 2023-06-18 03:00:32,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=86940.0, ans=15.0 2023-06-18 03:00:34,011 INFO [train.py:996] (2/4) Epoch 1, batch 14500, loss[loss=0.3168, simple_loss=0.3712, pruned_loss=0.1312, over 21790.00 frames. ], tot_loss[loss=0.3264, simple_loss=0.3695, pruned_loss=0.1417, over 4270096.54 frames. ], batch size: 371, lr: 3.05e-02, grad_scale: 32.0 2023-06-18 03:00:46,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=87000.0, ans=0.125 2023-06-18 03:01:09,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=87060.0, ans=0.125 2023-06-18 03:01:23,215 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.389e+02 3.289e+02 4.039e+02 4.928e+02 1.223e+03, threshold=8.079e+02, percent-clipped=4.0 2023-06-18 03:02:40,217 INFO [train.py:996] (2/4) Epoch 1, batch 14550, loss[loss=0.4959, simple_loss=0.4881, pruned_loss=0.2519, over 21307.00 frames. ], tot_loss[loss=0.3359, simple_loss=0.3784, pruned_loss=0.1467, over 4267563.67 frames. ], batch size: 507, lr: 3.05e-02, grad_scale: 32.0 2023-06-18 03:02:43,082 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.89 vs. limit=15.0 2023-06-18 03:02:45,669 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=87300.0, ans=0.125 2023-06-18 03:02:46,982 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=87300.0, ans=0.2 2023-06-18 03:02:49,342 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.50 vs. limit=15.0 2023-06-18 03:04:33,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=87540.0, ans=0.125 2023-06-18 03:04:37,687 INFO [train.py:996] (2/4) Epoch 1, batch 14600, loss[loss=0.3481, simple_loss=0.4126, pruned_loss=0.1418, over 21699.00 frames. ], tot_loss[loss=0.3469, simple_loss=0.3884, pruned_loss=0.1527, over 4270628.56 frames. ], batch size: 263, lr: 3.04e-02, grad_scale: 32.0 2023-06-18 03:05:37,680 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.259e+02 3.686e+02 4.597e+02 5.767e+02 8.268e+02, threshold=9.195e+02, percent-clipped=3.0 2023-06-18 03:05:52,669 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=87780.0, ans=0.125 2023-06-18 03:06:25,853 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.61 vs. limit=15.0 2023-06-18 03:06:31,183 INFO [train.py:996] (2/4) Epoch 1, batch 14650, loss[loss=0.3032, simple_loss=0.375, pruned_loss=0.1157, over 21837.00 frames. ], tot_loss[loss=0.3451, simple_loss=0.3884, pruned_loss=0.1509, over 4259795.45 frames. ], batch size: 371, lr: 3.04e-02, grad_scale: 32.0 2023-06-18 03:06:46,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=87960.0, ans=0.125 2023-06-18 03:06:48,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=87960.0, ans=0.0 2023-06-18 03:07:17,224 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.59 vs. limit=15.0 2023-06-18 03:07:31,721 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.83 vs. limit=6.0 2023-06-18 03:07:38,314 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 03:07:57,973 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=88140.0, ans=0.125 2023-06-18 03:08:05,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=88140.0, ans=0.0 2023-06-18 03:08:23,522 INFO [train.py:996] (2/4) Epoch 1, batch 14700, loss[loss=0.3805, simple_loss=0.444, pruned_loss=0.1585, over 21671.00 frames. ], tot_loss[loss=0.3321, simple_loss=0.3803, pruned_loss=0.1419, over 4258696.76 frames. ], batch size: 441, lr: 3.03e-02, grad_scale: 32.0 2023-06-18 03:09:23,839 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=88320.0, ans=0.1 2023-06-18 03:09:25,266 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=88320.0, ans=0.0 2023-06-18 03:09:37,815 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.719e+02 3.037e+02 3.906e+02 4.878e+02 1.107e+03, threshold=7.811e+02, percent-clipped=1.0 2023-06-18 03:09:41,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=88320.0, ans=0.1 2023-06-18 03:10:14,881 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=88440.0, ans=0.125 2023-06-18 03:10:20,941 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=88440.0, ans=0.125 2023-06-18 03:10:24,996 INFO [train.py:996] (2/4) Epoch 1, batch 14750, loss[loss=0.3933, simple_loss=0.4316, pruned_loss=0.1775, over 21269.00 frames. ], tot_loss[loss=0.3416, simple_loss=0.3891, pruned_loss=0.1471, over 4269994.30 frames. ], batch size: 548, lr: 3.03e-02, grad_scale: 32.0 2023-06-18 03:11:55,124 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.66 vs. limit=15.0 2023-06-18 03:12:30,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=88740.0, ans=0.125 2023-06-18 03:12:45,288 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=88740.0, ans=0.5 2023-06-18 03:12:47,660 INFO [train.py:996] (2/4) Epoch 1, batch 14800, loss[loss=0.3655, simple_loss=0.3985, pruned_loss=0.1663, over 20042.00 frames. ], tot_loss[loss=0.354, simple_loss=0.4009, pruned_loss=0.1536, over 4267183.55 frames. ], batch size: 702, lr: 3.03e-02, grad_scale: 32.0 2023-06-18 03:13:22,186 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.18 vs. limit=15.0 2023-06-18 03:13:40,801 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=8.94 vs. limit=15.0 2023-06-18 03:13:46,433 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.441e+02 3.653e+02 4.322e+02 5.427e+02 8.202e+02, threshold=8.644e+02, percent-clipped=2.0 2023-06-18 03:13:57,616 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.71 vs. limit=15.0 2023-06-18 03:13:59,091 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-18 03:14:00,060 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=88980.0, ans=0.125 2023-06-18 03:14:37,648 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.26 vs. limit=15.0 2023-06-18 03:14:47,073 INFO [train.py:996] (2/4) Epoch 1, batch 14850, loss[loss=0.3648, simple_loss=0.4081, pruned_loss=0.1607, over 21728.00 frames. ], tot_loss[loss=0.3494, simple_loss=0.3935, pruned_loss=0.1527, over 4270129.48 frames. ], batch size: 332, lr: 3.02e-02, grad_scale: 32.0 2023-06-18 03:14:47,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=89100.0, ans=0.1 2023-06-18 03:14:51,904 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=8.004e-02 2023-06-18 03:14:59,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=89100.0, ans=0.04949747468305833 2023-06-18 03:15:00,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=89100.0, ans=0.09899494936611666 2023-06-18 03:16:13,661 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.99 vs. limit=15.0 2023-06-18 03:16:33,013 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.61 vs. limit=22.5 2023-06-18 03:16:43,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=89340.0, ans=0.125 2023-06-18 03:16:51,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=89340.0, ans=0.1 2023-06-18 03:16:56,801 INFO [train.py:996] (2/4) Epoch 1, batch 14900, loss[loss=0.3514, simple_loss=0.393, pruned_loss=0.1549, over 21586.00 frames. ], tot_loss[loss=0.3536, simple_loss=0.3973, pruned_loss=0.155, over 4278224.74 frames. ], batch size: 230, lr: 3.02e-02, grad_scale: 32.0 2023-06-18 03:17:52,388 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.464e+02 3.509e+02 4.418e+02 5.707e+02 9.536e+02, threshold=8.836e+02, percent-clipped=5.0 2023-06-18 03:18:10,816 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=89580.0, ans=0.125 2023-06-18 03:18:26,692 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.73 vs. limit=15.0 2023-06-18 03:19:03,617 INFO [train.py:996] (2/4) Epoch 1, batch 14950, loss[loss=0.3194, simple_loss=0.3382, pruned_loss=0.1502, over 20685.00 frames. ], tot_loss[loss=0.3539, simple_loss=0.399, pruned_loss=0.1544, over 4271289.77 frames. ], batch size: 607, lr: 3.01e-02, grad_scale: 32.0 2023-06-18 03:19:39,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=89760.0, ans=0.125 2023-06-18 03:20:44,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=89880.0, ans=0.125 2023-06-18 03:20:48,493 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.18 vs. limit=15.0 2023-06-18 03:20:54,062 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.79 vs. limit=15.0 2023-06-18 03:21:07,188 INFO [train.py:996] (2/4) Epoch 1, batch 15000, loss[loss=0.3213, simple_loss=0.3656, pruned_loss=0.1384, over 21497.00 frames. ], tot_loss[loss=0.3574, simple_loss=0.4018, pruned_loss=0.1565, over 4276611.55 frames. ], batch size: 194, lr: 3.01e-02, grad_scale: 16.0 2023-06-18 03:21:07,188 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-18 03:21:55,990 INFO [train.py:1028] (2/4) Epoch 1, validation: loss=0.3047, simple_loss=0.3953, pruned_loss=0.107, over 1796401.00 frames. 2023-06-18 03:21:55,992 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-18 03:21:56,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=90000.0, ans=0.0 2023-06-18 03:22:01,643 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=90000.0, ans=0.0 2023-06-18 03:22:05,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=90000.0, ans=0.0 2023-06-18 03:22:08,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=90000.0, ans=0.125 2023-06-18 03:22:48,012 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.606e+02 3.680e+02 4.733e+02 5.350e+02 8.092e+02, threshold=9.466e+02, percent-clipped=0.0 2023-06-18 03:23:51,358 INFO [train.py:996] (2/4) Epoch 1, batch 15050, loss[loss=0.3967, simple_loss=0.461, pruned_loss=0.1662, over 21256.00 frames. ], tot_loss[loss=0.3589, simple_loss=0.4024, pruned_loss=0.1577, over 4279628.36 frames. ], batch size: 548, lr: 3.01e-02, grad_scale: 16.0 2023-06-18 03:23:53,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=90300.0, ans=0.0 2023-06-18 03:24:17,522 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=90300.0, ans=0.0 2023-06-18 03:25:41,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=90480.0, ans=0.0 2023-06-18 03:26:00,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=90540.0, ans=0.125 2023-06-18 03:26:03,914 INFO [train.py:996] (2/4) Epoch 1, batch 15100, loss[loss=0.3786, simple_loss=0.4139, pruned_loss=0.1716, over 21908.00 frames. ], tot_loss[loss=0.3603, simple_loss=0.4048, pruned_loss=0.158, over 4278509.84 frames. ], batch size: 316, lr: 3.00e-02, grad_scale: 16.0 2023-06-18 03:26:04,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=90600.0, ans=0.0 2023-06-18 03:26:54,267 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.579e+02 3.766e+02 4.918e+02 6.265e+02 9.726e+02, threshold=9.837e+02, percent-clipped=1.0 2023-06-18 03:26:59,752 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.91 vs. limit=15.0 2023-06-18 03:27:06,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=90780.0, ans=0.125 2023-06-18 03:27:34,443 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=16.56 vs. limit=15.0 2023-06-18 03:27:56,496 INFO [train.py:996] (2/4) Epoch 1, batch 15150, loss[loss=0.3084, simple_loss=0.3438, pruned_loss=0.1366, over 21730.00 frames. ], tot_loss[loss=0.3591, simple_loss=0.4015, pruned_loss=0.1583, over 4279362.51 frames. ], batch size: 124, lr: 3.00e-02, grad_scale: 16.0 2023-06-18 03:28:27,628 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.51 vs. limit=15.0 2023-06-18 03:28:40,246 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.96 vs. limit=15.0 2023-06-18 03:29:10,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=91080.0, ans=0.0 2023-06-18 03:29:32,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=91080.0, ans=0.1 2023-06-18 03:30:03,092 INFO [train.py:996] (2/4) Epoch 1, batch 15200, loss[loss=0.2527, simple_loss=0.3026, pruned_loss=0.1014, over 21802.00 frames. ], tot_loss[loss=0.3483, simple_loss=0.3924, pruned_loss=0.1521, over 4279756.02 frames. ], batch size: 112, lr: 2.99e-02, grad_scale: 32.0 2023-06-18 03:30:57,180 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.017e+02 3.451e+02 4.223e+02 5.136e+02 8.420e+02, threshold=8.446e+02, percent-clipped=0.0 2023-06-18 03:31:01,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=91320.0, ans=0.0 2023-06-18 03:31:18,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=91380.0, ans=0.025 2023-06-18 03:31:52,039 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 03:31:59,049 INFO [train.py:996] (2/4) Epoch 1, batch 15250, loss[loss=0.3253, simple_loss=0.3611, pruned_loss=0.1447, over 21417.00 frames. ], tot_loss[loss=0.3426, simple_loss=0.3866, pruned_loss=0.1493, over 4275124.60 frames. ], batch size: 194, lr: 2.99e-02, grad_scale: 32.0 2023-06-18 03:32:34,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=91620.0, ans=0.125 2023-06-18 03:34:02,103 INFO [train.py:996] (2/4) Epoch 1, batch 15300, loss[loss=0.2938, simple_loss=0.3092, pruned_loss=0.1392, over 20743.00 frames. ], tot_loss[loss=0.3473, simple_loss=0.389, pruned_loss=0.1528, over 4269428.66 frames. ], batch size: 609, lr: 2.99e-02, grad_scale: 32.0 2023-06-18 03:35:10,025 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.236e+02 3.694e+02 4.588e+02 5.460e+02 1.157e+03, threshold=9.176e+02, percent-clipped=1.0 2023-06-18 03:35:21,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=91980.0, ans=0.125 2023-06-18 03:35:40,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=91980.0, ans=0.0 2023-06-18 03:36:16,124 INFO [train.py:996] (2/4) Epoch 1, batch 15350, loss[loss=0.3798, simple_loss=0.411, pruned_loss=0.1742, over 21890.00 frames. ], tot_loss[loss=0.3523, simple_loss=0.3939, pruned_loss=0.1553, over 4269874.44 frames. ], batch size: 371, lr: 2.98e-02, grad_scale: 32.0 2023-06-18 03:36:56,051 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.46 vs. limit=10.0 2023-06-18 03:37:38,047 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 03:38:13,657 INFO [train.py:996] (2/4) Epoch 1, batch 15400, loss[loss=0.3075, simple_loss=0.3548, pruned_loss=0.1301, over 21702.00 frames. ], tot_loss[loss=0.3482, simple_loss=0.3926, pruned_loss=0.1519, over 4275154.40 frames. ], batch size: 230, lr: 2.98e-02, grad_scale: 32.0 2023-06-18 03:38:15,697 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=92400.0, ans=0.1 2023-06-18 03:38:39,196 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=92460.0, ans=0.1 2023-06-18 03:38:45,033 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=92460.0, ans=0.0 2023-06-18 03:39:06,568 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.210e+02 3.578e+02 4.472e+02 5.706e+02 1.204e+03, threshold=8.945e+02, percent-clipped=6.0 2023-06-18 03:39:28,433 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=92580.0, ans=0.2 2023-06-18 03:39:43,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=92580.0, ans=0.0 2023-06-18 03:40:00,800 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.28 vs. limit=5.0 2023-06-18 03:40:04,061 INFO [train.py:996] (2/4) Epoch 1, batch 15450, loss[loss=0.3455, simple_loss=0.4055, pruned_loss=0.1427, over 21824.00 frames. ], tot_loss[loss=0.3441, simple_loss=0.3889, pruned_loss=0.1497, over 4273489.37 frames. ], batch size: 351, lr: 2.97e-02, grad_scale: 32.0 2023-06-18 03:40:24,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=92760.0, ans=0.125 2023-06-18 03:40:28,515 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.61 vs. limit=15.0 2023-06-18 03:40:47,659 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=15.0 2023-06-18 03:40:48,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=92820.0, ans=0.2 2023-06-18 03:40:58,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=92880.0, ans=0.2 2023-06-18 03:41:34,863 INFO [train.py:996] (2/4) Epoch 1, batch 15500, loss[loss=0.485, simple_loss=0.4867, pruned_loss=0.2417, over 21345.00 frames. ], tot_loss[loss=0.3441, simple_loss=0.3908, pruned_loss=0.1487, over 4248680.58 frames. ], batch size: 507, lr: 2.97e-02, grad_scale: 16.0 2023-06-18 03:42:14,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=93120.0, ans=0.95 2023-06-18 03:42:38,905 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.287e+02 3.182e+02 4.008e+02 5.609e+02 1.059e+03, threshold=8.016e+02, percent-clipped=3.0 2023-06-18 03:43:41,836 INFO [train.py:996] (2/4) Epoch 1, batch 15550, loss[loss=0.2862, simple_loss=0.3335, pruned_loss=0.1195, over 21359.00 frames. ], tot_loss[loss=0.3388, simple_loss=0.3885, pruned_loss=0.1445, over 4258329.17 frames. ], batch size: 131, lr: 2.97e-02, grad_scale: 16.0 2023-06-18 03:43:43,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=93300.0, ans=0.125 2023-06-18 03:44:41,925 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=93480.0, ans=0.125 2023-06-18 03:44:56,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=93540.0, ans=0.125 2023-06-18 03:45:21,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=93540.0, ans=10.0 2023-06-18 03:45:25,076 INFO [train.py:996] (2/4) Epoch 1, batch 15600, loss[loss=0.3977, simple_loss=0.4129, pruned_loss=0.1913, over 21367.00 frames. ], tot_loss[loss=0.332, simple_loss=0.3807, pruned_loss=0.1417, over 4259440.12 frames. ], batch size: 508, lr: 2.96e-02, grad_scale: 32.0 2023-06-18 03:46:40,240 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.412e+02 3.054e+02 4.153e+02 4.865e+02 7.806e+02, threshold=8.306e+02, percent-clipped=0.0 2023-06-18 03:46:46,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=93780.0, ans=0.125 2023-06-18 03:47:00,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=93840.0, ans=6.0 2023-06-18 03:47:07,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=93840.0, ans=0.2 2023-06-18 03:47:31,847 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=93840.0, ans=0.125 2023-06-18 03:47:35,560 INFO [train.py:996] (2/4) Epoch 1, batch 15650, loss[loss=0.3186, simple_loss=0.3601, pruned_loss=0.1386, over 21762.00 frames. ], tot_loss[loss=0.331, simple_loss=0.3798, pruned_loss=0.1411, over 4253669.14 frames. ], batch size: 112, lr: 2.96e-02, grad_scale: 32.0 2023-06-18 03:48:31,417 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.16 vs. limit=15.0 2023-06-18 03:49:33,342 INFO [train.py:996] (2/4) Epoch 1, batch 15700, loss[loss=0.307, simple_loss=0.3493, pruned_loss=0.1324, over 20777.00 frames. ], tot_loss[loss=0.3275, simple_loss=0.3749, pruned_loss=0.1401, over 4249587.24 frames. ], batch size: 608, lr: 2.95e-02, grad_scale: 32.0 2023-06-18 03:49:42,721 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=94200.0, ans=0.125 2023-06-18 03:50:46,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=94320.0, ans=0.0 2023-06-18 03:50:47,009 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.075e+02 3.466e+02 4.238e+02 5.232e+02 9.594e+02, threshold=8.475e+02, percent-clipped=2.0 2023-06-18 03:50:59,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=94380.0, ans=0.125 2023-06-18 03:50:59,904 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.71 vs. limit=15.0 2023-06-18 03:51:03,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=94440.0, ans=0.125 2023-06-18 03:51:24,783 INFO [train.py:996] (2/4) Epoch 1, batch 15750, loss[loss=0.294, simple_loss=0.3443, pruned_loss=0.1219, over 21688.00 frames. ], tot_loss[loss=0.3236, simple_loss=0.369, pruned_loss=0.1391, over 4254441.55 frames. ], batch size: 316, lr: 2.95e-02, grad_scale: 32.0 2023-06-18 03:51:45,152 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=94560.0, ans=0.1 2023-06-18 03:51:53,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=94560.0, ans=0.0 2023-06-18 03:52:44,220 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-06-18 03:53:13,090 INFO [train.py:996] (2/4) Epoch 1, batch 15800, loss[loss=0.3468, simple_loss=0.3789, pruned_loss=0.1573, over 21302.00 frames. ], tot_loss[loss=0.3206, simple_loss=0.3636, pruned_loss=0.1388, over 4257334.41 frames. ], batch size: 159, lr: 2.95e-02, grad_scale: 32.0 2023-06-18 03:53:13,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=94800.0, ans=0.0 2023-06-18 03:53:50,111 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.46 vs. limit=15.0 2023-06-18 03:54:24,046 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.132e+02 3.502e+02 4.028e+02 5.074e+02 9.979e+02, threshold=8.057e+02, percent-clipped=5.0 2023-06-18 03:55:22,918 INFO [train.py:996] (2/4) Epoch 1, batch 15850, loss[loss=0.4178, simple_loss=0.4394, pruned_loss=0.1981, over 21742.00 frames. ], tot_loss[loss=0.3241, simple_loss=0.3657, pruned_loss=0.1412, over 4259030.93 frames. ], batch size: 441, lr: 2.94e-02, grad_scale: 32.0 2023-06-18 03:55:25,455 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.95 vs. limit=15.0 2023-06-18 03:56:39,741 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.006e-01 2023-06-18 03:56:57,429 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=95280.0, ans=0.125 2023-06-18 03:57:01,075 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.58 vs. limit=15.0 2023-06-18 03:57:41,515 INFO [train.py:996] (2/4) Epoch 1, batch 15900, loss[loss=0.2704, simple_loss=0.3264, pruned_loss=0.1071, over 21873.00 frames. ], tot_loss[loss=0.3252, simple_loss=0.3657, pruned_loss=0.1423, over 4264452.23 frames. ], batch size: 118, lr: 2.94e-02, grad_scale: 32.0 2023-06-18 03:58:26,570 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 03:58:39,808 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.679e+02 3.625e+02 4.216e+02 5.103e+02 7.976e+02, threshold=8.431e+02, percent-clipped=0.0 2023-06-18 03:59:18,959 INFO [train.py:996] (2/4) Epoch 1, batch 15950, loss[loss=0.3216, simple_loss=0.3735, pruned_loss=0.1349, over 21578.00 frames. ], tot_loss[loss=0.3227, simple_loss=0.3661, pruned_loss=0.1396, over 4250790.71 frames. ], batch size: 263, lr: 2.94e-02, grad_scale: 32.0 2023-06-18 03:59:22,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=95700.0, ans=0.09899494936611666 2023-06-18 03:59:28,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=95700.0, ans=0.125 2023-06-18 04:00:07,055 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=95820.0, ans=0.2 2023-06-18 04:00:11,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=95820.0, ans=0.1 2023-06-18 04:00:19,112 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=95880.0, ans=0.0 2023-06-18 04:00:33,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=95940.0, ans=0.125 2023-06-18 04:00:55,800 INFO [train.py:996] (2/4) Epoch 1, batch 16000, loss[loss=0.3152, simple_loss=0.3874, pruned_loss=0.1215, over 21665.00 frames. ], tot_loss[loss=0.3202, simple_loss=0.3675, pruned_loss=0.1365, over 4264423.09 frames. ], batch size: 389, lr: 2.93e-02, grad_scale: 32.0 2023-06-18 04:01:13,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=96060.0, ans=0.2 2023-06-18 04:01:24,889 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=96060.0, ans=0.1 2023-06-18 04:01:49,637 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 2.964e+02 3.608e+02 4.425e+02 8.344e+02, threshold=7.217e+02, percent-clipped=0.0 2023-06-18 04:01:54,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=96180.0, ans=0.0 2023-06-18 04:02:01,089 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=21.78 vs. limit=15.0 2023-06-18 04:02:22,048 INFO [train.py:996] (2/4) Epoch 1, batch 16050, loss[loss=0.3287, simple_loss=0.3963, pruned_loss=0.1305, over 21384.00 frames. ], tot_loss[loss=0.3219, simple_loss=0.3729, pruned_loss=0.1355, over 4260150.55 frames. ], batch size: 211, lr: 2.93e-02, grad_scale: 32.0 2023-06-18 04:02:32,564 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.50 vs. limit=15.0 2023-06-18 04:02:48,878 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=15.0 2023-06-18 04:03:29,945 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=96420.0, ans=0.125 2023-06-18 04:03:36,149 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.23 vs. limit=6.0 2023-06-18 04:03:47,128 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff2.min_abs, batch_count=96480.0, ans=0.1 2023-06-18 04:03:47,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=96480.0, ans=0.125 2023-06-18 04:04:02,087 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=12.0 2023-06-18 04:04:15,093 INFO [train.py:996] (2/4) Epoch 1, batch 16100, loss[loss=0.3605, simple_loss=0.3986, pruned_loss=0.1612, over 21871.00 frames. ], tot_loss[loss=0.3278, simple_loss=0.3792, pruned_loss=0.1382, over 4262952.28 frames. ], batch size: 124, lr: 2.92e-02, grad_scale: 32.0 2023-06-18 04:04:41,697 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=96600.0, ans=0.1 2023-06-18 04:04:44,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=96600.0, ans=0.125 2023-06-18 04:05:38,119 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.902e+02 3.663e+02 4.522e+02 5.171e+02 8.163e+02, threshold=9.044e+02, percent-clipped=3.0 2023-06-18 04:05:58,220 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 04:06:01,255 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=96840.0, ans=0.125 2023-06-18 04:06:07,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=96840.0, ans=0.1 2023-06-18 04:06:11,405 INFO [train.py:996] (2/4) Epoch 1, batch 16150, loss[loss=0.3744, simple_loss=0.4635, pruned_loss=0.1426, over 20853.00 frames. ], tot_loss[loss=0.3317, simple_loss=0.3806, pruned_loss=0.1414, over 4273755.94 frames. ], batch size: 608, lr: 2.92e-02, grad_scale: 32.0 2023-06-18 04:07:22,675 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.71 vs. limit=15.0 2023-06-18 04:07:43,557 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=97080.0, ans=0.0 2023-06-18 04:08:07,690 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.24 vs. limit=6.0 2023-06-18 04:08:24,765 INFO [train.py:996] (2/4) Epoch 1, batch 16200, loss[loss=0.3837, simple_loss=0.4214, pruned_loss=0.173, over 21338.00 frames. ], tot_loss[loss=0.3359, simple_loss=0.3845, pruned_loss=0.1436, over 4279250.65 frames. ], batch size: 548, lr: 2.92e-02, grad_scale: 32.0 2023-06-18 04:08:28,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=97200.0, ans=0.125 2023-06-18 04:09:41,323 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.094e+02 3.302e+02 3.926e+02 4.944e+02 8.800e+02, threshold=7.851e+02, percent-clipped=0.0 2023-06-18 04:10:11,541 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-06-18 04:10:12,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=97440.0, ans=0.125 2023-06-18 04:10:17,132 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.15 vs. limit=15.0 2023-06-18 04:10:19,003 INFO [train.py:996] (2/4) Epoch 1, batch 16250, loss[loss=0.2631, simple_loss=0.3136, pruned_loss=0.1063, over 21443.00 frames. ], tot_loss[loss=0.3331, simple_loss=0.3813, pruned_loss=0.1425, over 4277352.57 frames. ], batch size: 194, lr: 2.91e-02, grad_scale: 32.0 2023-06-18 04:10:57,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=97560.0, ans=10.0 2023-06-18 04:11:16,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=97560.0, ans=0.125 2023-06-18 04:11:30,730 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=97620.0, ans=0.125 2023-06-18 04:11:53,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=97740.0, ans=0.125 2023-06-18 04:12:18,163 INFO [train.py:996] (2/4) Epoch 1, batch 16300, loss[loss=0.2656, simple_loss=0.3289, pruned_loss=0.1012, over 21257.00 frames. ], tot_loss[loss=0.3232, simple_loss=0.3726, pruned_loss=0.1368, over 4263867.77 frames. ], batch size: 549, lr: 2.91e-02, grad_scale: 32.0 2023-06-18 04:13:11,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=97920.0, ans=0.125 2023-06-18 04:13:19,439 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.140e+02 2.969e+02 3.585e+02 4.834e+02 8.506e+02, threshold=7.169e+02, percent-clipped=1.0 2023-06-18 04:13:59,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=98040.0, ans=0.0 2023-06-18 04:14:02,928 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.64 vs. limit=15.0 2023-06-18 04:14:26,819 INFO [train.py:996] (2/4) Epoch 1, batch 16350, loss[loss=0.2788, simple_loss=0.3331, pruned_loss=0.1123, over 21605.00 frames. ], tot_loss[loss=0.3222, simple_loss=0.3716, pruned_loss=0.1364, over 4258077.76 frames. ], batch size: 263, lr: 2.91e-02, grad_scale: 32.0 2023-06-18 04:14:51,008 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.19 vs. limit=15.0 2023-06-18 04:15:50,731 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=98280.0, ans=0.0 2023-06-18 04:16:11,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=98340.0, ans=0.0 2023-06-18 04:16:37,186 INFO [train.py:996] (2/4) Epoch 1, batch 16400, loss[loss=0.3712, simple_loss=0.4226, pruned_loss=0.1599, over 19944.00 frames. ], tot_loss[loss=0.3286, simple_loss=0.3775, pruned_loss=0.1398, over 4260906.62 frames. ], batch size: 703, lr: 2.90e-02, grad_scale: 32.0 2023-06-18 04:16:53,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=98400.0, ans=0.125 2023-06-18 04:17:06,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=98460.0, ans=0.125 2023-06-18 04:17:44,958 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.113e+02 3.149e+02 3.641e+02 4.935e+02 8.709e+02, threshold=7.281e+02, percent-clipped=2.0 2023-06-18 04:18:08,138 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=98580.0, ans=0.125 2023-06-18 04:18:09,630 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=98580.0, ans=0.125 2023-06-18 04:18:59,110 INFO [train.py:996] (2/4) Epoch 1, batch 16450, loss[loss=0.3058, simple_loss=0.3559, pruned_loss=0.1279, over 21870.00 frames. ], tot_loss[loss=0.3301, simple_loss=0.3781, pruned_loss=0.1411, over 4263510.88 frames. ], batch size: 282, lr: 2.90e-02, grad_scale: 32.0 2023-06-18 04:18:59,690 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 04:19:35,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=98760.0, ans=0.0 2023-06-18 04:19:37,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=98760.0, ans=0.125 2023-06-18 04:19:52,227 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=18.39 vs. limit=22.5 2023-06-18 04:19:58,354 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.72 vs. limit=15.0 2023-06-18 04:20:51,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=98940.0, ans=0.0 2023-06-18 04:20:54,864 INFO [train.py:996] (2/4) Epoch 1, batch 16500, loss[loss=0.2988, simple_loss=0.3505, pruned_loss=0.1236, over 21716.00 frames. ], tot_loss[loss=0.3299, simple_loss=0.3779, pruned_loss=0.141, over 4268905.45 frames. ], batch size: 298, lr: 2.89e-02, grad_scale: 32.0 2023-06-18 04:21:16,816 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=99000.0, ans=0.07 2023-06-18 04:22:25,038 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 3.246e+02 4.122e+02 4.972e+02 8.514e+02, threshold=8.244e+02, percent-clipped=2.0 2023-06-18 04:23:29,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=99240.0, ans=0.125 2023-06-18 04:23:43,059 INFO [train.py:996] (2/4) Epoch 1, batch 16550, loss[loss=0.3188, simple_loss=0.3605, pruned_loss=0.1386, over 21798.00 frames. ], tot_loss[loss=0.3243, simple_loss=0.3742, pruned_loss=0.1373, over 4272291.92 frames. ], batch size: 124, lr: 2.89e-02, grad_scale: 32.0 2023-06-18 04:25:36,937 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=99540.0, ans=0.125 2023-06-18 04:25:39,586 INFO [train.py:996] (2/4) Epoch 1, batch 16600, loss[loss=0.3846, simple_loss=0.4449, pruned_loss=0.1621, over 21274.00 frames. ], tot_loss[loss=0.3367, simple_loss=0.3866, pruned_loss=0.1434, over 4276730.19 frames. ], batch size: 548, lr: 2.89e-02, grad_scale: 32.0 2023-06-18 04:25:49,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=99600.0, ans=0.04949747468305833 2023-06-18 04:26:24,874 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=12.0 2023-06-18 04:26:45,520 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.308e+02 3.572e+02 4.785e+02 5.858e+02 1.029e+03, threshold=9.570e+02, percent-clipped=5.0 2023-06-18 04:27:02,176 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.08 vs. limit=15.0 2023-06-18 04:27:34,748 INFO [train.py:996] (2/4) Epoch 1, batch 16650, loss[loss=0.3435, simple_loss=0.3982, pruned_loss=0.1444, over 21799.00 frames. ], tot_loss[loss=0.3456, simple_loss=0.3974, pruned_loss=0.1469, over 4281198.79 frames. ], batch size: 247, lr: 2.88e-02, grad_scale: 32.0 2023-06-18 04:28:53,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=100020.0, ans=0.07 2023-06-18 04:28:53,902 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.63 vs. limit=15.0 2023-06-18 04:29:00,029 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=100020.0, ans=0.1 2023-06-18 04:29:06,356 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=100020.0, ans=0.125 2023-06-18 04:29:06,386 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=100020.0, ans=0.125 2023-06-18 04:29:26,433 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=100080.0, ans=0.1 2023-06-18 04:29:51,770 INFO [train.py:996] (2/4) Epoch 1, batch 16700, loss[loss=0.2608, simple_loss=0.3068, pruned_loss=0.1073, over 21248.00 frames. ], tot_loss[loss=0.3459, simple_loss=0.3976, pruned_loss=0.1471, over 4274173.77 frames. ], batch size: 176, lr: 2.88e-02, grad_scale: 32.0 2023-06-18 04:29:55,029 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=100200.0, ans=0.2 2023-06-18 04:29:57,136 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.15 vs. limit=15.0 2023-06-18 04:30:38,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=100260.0, ans=0.0 2023-06-18 04:31:12,793 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.672e+02 3.595e+02 4.250e+02 5.162e+02 1.058e+03, threshold=8.499e+02, percent-clipped=1.0 2023-06-18 04:31:40,123 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.29 vs. limit=15.0 2023-06-18 04:32:38,940 INFO [train.py:996] (2/4) Epoch 1, batch 16750, loss[loss=0.4373, simple_loss=0.4761, pruned_loss=0.1992, over 21492.00 frames. ], tot_loss[loss=0.3513, simple_loss=0.4007, pruned_loss=0.151, over 4274427.62 frames. ], batch size: 471, lr: 2.88e-02, grad_scale: 32.0 2023-06-18 04:33:01,592 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.10 vs. limit=6.0 2023-06-18 04:33:20,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=100560.0, ans=10.0 2023-06-18 04:34:17,410 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.01 vs. limit=12.0 2023-06-18 04:34:28,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=100680.0, ans=0.0 2023-06-18 04:34:46,193 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.04 vs. limit=6.0 2023-06-18 04:35:01,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=100740.0, ans=0.1 2023-06-18 04:35:07,952 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=100740.0, ans=0.2 2023-06-18 04:35:18,788 INFO [train.py:996] (2/4) Epoch 1, batch 16800, loss[loss=0.3261, simple_loss=0.361, pruned_loss=0.1456, over 21333.00 frames. ], tot_loss[loss=0.3529, simple_loss=0.4046, pruned_loss=0.1506, over 4269884.99 frames. ], batch size: 159, lr: 2.87e-02, grad_scale: 32.0 2023-06-18 04:35:45,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=100860.0, ans=0.125 2023-06-18 04:35:53,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=100920.0, ans=0.04949747468305833 2023-06-18 04:36:05,521 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.666e+02 3.547e+02 4.070e+02 4.993e+02 8.656e+02, threshold=8.140e+02, percent-clipped=1.0 2023-06-18 04:36:22,788 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=100980.0, ans=0.0 2023-06-18 04:36:28,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=100980.0, ans=0.125 2023-06-18 04:36:38,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=101040.0, ans=0.125 2023-06-18 04:36:46,410 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=12.0 2023-06-18 04:37:02,096 INFO [train.py:996] (2/4) Epoch 1, batch 16850, loss[loss=0.3199, simple_loss=0.3585, pruned_loss=0.1407, over 21648.00 frames. ], tot_loss[loss=0.3515, simple_loss=0.4012, pruned_loss=0.1509, over 4275610.19 frames. ], batch size: 230, lr: 2.87e-02, grad_scale: 32.0 2023-06-18 04:39:12,671 INFO [train.py:996] (2/4) Epoch 1, batch 16900, loss[loss=0.3846, simple_loss=0.3995, pruned_loss=0.1849, over 21485.00 frames. ], tot_loss[loss=0.345, simple_loss=0.3934, pruned_loss=0.1483, over 4286086.59 frames. ], batch size: 508, lr: 2.87e-02, grad_scale: 32.0 2023-06-18 04:39:27,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=101460.0, ans=0.2 2023-06-18 04:39:34,809 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=101460.0, ans=0.0 2023-06-18 04:40:16,076 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.242e+02 3.290e+02 3.976e+02 5.128e+02 6.971e+02, threshold=7.952e+02, percent-clipped=0.0 2023-06-18 04:40:21,654 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.91 vs. limit=15.0 2023-06-18 04:41:08,117 INFO [train.py:996] (2/4) Epoch 1, batch 16950, loss[loss=0.3181, simple_loss=0.3594, pruned_loss=0.1384, over 21860.00 frames. ], tot_loss[loss=0.3374, simple_loss=0.384, pruned_loss=0.1453, over 4275116.28 frames. ], batch size: 371, lr: 2.86e-02, grad_scale: 32.0 2023-06-18 04:42:14,143 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.16 vs. limit=15.0 2023-06-18 04:43:28,696 INFO [train.py:996] (2/4) Epoch 1, batch 17000, loss[loss=0.3318, simple_loss=0.3795, pruned_loss=0.1421, over 21840.00 frames. ], tot_loss[loss=0.3355, simple_loss=0.3803, pruned_loss=0.1453, over 4285604.11 frames. ], batch size: 107, lr: 2.86e-02, grad_scale: 32.0 2023-06-18 04:44:38,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=102120.0, ans=0.025 2023-06-18 04:44:39,325 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.399e+02 3.298e+02 3.924e+02 4.876e+02 1.271e+03, threshold=7.848e+02, percent-clipped=1.0 2023-06-18 04:45:43,361 INFO [train.py:996] (2/4) Epoch 1, batch 17050, loss[loss=0.3521, simple_loss=0.3995, pruned_loss=0.1524, over 21402.00 frames. ], tot_loss[loss=0.3438, simple_loss=0.3888, pruned_loss=0.1494, over 4294529.49 frames. ], batch size: 131, lr: 2.86e-02, grad_scale: 32.0 2023-06-18 04:45:50,947 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=102300.0, ans=0.05 2023-06-18 04:46:36,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=102420.0, ans=0.0 2023-06-18 04:47:40,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=102540.0, ans=0.1 2023-06-18 04:47:43,112 INFO [train.py:996] (2/4) Epoch 1, batch 17100, loss[loss=0.3221, simple_loss=0.3652, pruned_loss=0.1395, over 21662.00 frames. ], tot_loss[loss=0.3428, simple_loss=0.3878, pruned_loss=0.1488, over 4292378.89 frames. ], batch size: 230, lr: 2.85e-02, grad_scale: 32.0 2023-06-18 04:47:50,969 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.73 vs. limit=6.0 2023-06-18 04:49:04,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=102720.0, ans=0.1 2023-06-18 04:49:06,918 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.244e+02 3.266e+02 3.852e+02 4.929e+02 1.111e+03, threshold=7.703e+02, percent-clipped=4.0 2023-06-18 04:49:22,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=102780.0, ans=0.1 2023-06-18 04:49:47,614 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-06-18 04:49:53,217 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=102840.0, ans=0.125 2023-06-18 04:49:55,671 INFO [train.py:996] (2/4) Epoch 1, batch 17150, loss[loss=0.3215, simple_loss=0.3569, pruned_loss=0.143, over 21853.00 frames. ], tot_loss[loss=0.3376, simple_loss=0.3819, pruned_loss=0.1466, over 4298718.57 frames. ], batch size: 351, lr: 2.85e-02, grad_scale: 32.0 2023-06-18 04:50:00,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=102900.0, ans=0.0 2023-06-18 04:51:49,915 INFO [train.py:996] (2/4) Epoch 1, batch 17200, loss[loss=0.4093, simple_loss=0.4335, pruned_loss=0.1925, over 21416.00 frames. ], tot_loss[loss=0.3376, simple_loss=0.3817, pruned_loss=0.1468, over 4300154.48 frames. ], batch size: 471, lr: 2.84e-02, grad_scale: 32.0 2023-06-18 04:51:55,174 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=103200.0, ans=0.0 2023-06-18 04:52:21,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=103260.0, ans=0.0 2023-06-18 04:53:06,119 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.204e+02 3.334e+02 4.026e+02 5.110e+02 1.056e+03, threshold=8.051e+02, percent-clipped=6.0 2023-06-18 04:53:29,217 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.30 vs. limit=12.0 2023-06-18 04:54:01,778 INFO [train.py:996] (2/4) Epoch 1, batch 17250, loss[loss=0.3549, simple_loss=0.4103, pruned_loss=0.1498, over 21473.00 frames. ], tot_loss[loss=0.3437, simple_loss=0.3876, pruned_loss=0.1499, over 4291893.05 frames. ], batch size: 211, lr: 2.84e-02, grad_scale: 32.0 2023-06-18 04:54:58,908 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.21 vs. limit=5.0 2023-06-18 04:55:46,469 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=103740.0, ans=0.125 2023-06-18 04:55:58,637 INFO [train.py:996] (2/4) Epoch 1, batch 17300, loss[loss=0.3993, simple_loss=0.4339, pruned_loss=0.1823, over 21696.00 frames. ], tot_loss[loss=0.3527, simple_loss=0.3969, pruned_loss=0.1543, over 4290003.99 frames. ], batch size: 351, lr: 2.84e-02, grad_scale: 32.0 2023-06-18 04:57:28,607 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.729e+02 3.957e+02 4.761e+02 5.817e+02 1.132e+03, threshold=9.521e+02, percent-clipped=5.0 2023-06-18 04:58:37,541 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.19 vs. limit=15.0 2023-06-18 04:58:39,401 INFO [train.py:996] (2/4) Epoch 1, batch 17350, loss[loss=0.3166, simple_loss=0.376, pruned_loss=0.1286, over 21789.00 frames. ], tot_loss[loss=0.3535, simple_loss=0.3986, pruned_loss=0.1542, over 4281785.44 frames. ], batch size: 282, lr: 2.83e-02, grad_scale: 32.0 2023-06-18 04:58:45,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=104100.0, ans=0.1 2023-06-18 04:58:55,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=104100.0, ans=0.125 2023-06-18 04:58:55,953 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=104100.0, ans=0.125 2023-06-18 04:59:12,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=104160.0, ans=0.125 2023-06-18 04:59:15,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=104160.0, ans=0.125 2023-06-18 04:59:23,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=104220.0, ans=0.2 2023-06-18 05:00:34,643 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.57 vs. limit=15.0 2023-06-18 05:00:36,712 INFO [train.py:996] (2/4) Epoch 1, batch 17400, loss[loss=0.2644, simple_loss=0.3064, pruned_loss=0.1112, over 21279.00 frames. ], tot_loss[loss=0.3451, simple_loss=0.3932, pruned_loss=0.1485, over 4280010.10 frames. ], batch size: 159, lr: 2.83e-02, grad_scale: 32.0 2023-06-18 05:00:59,323 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.32 vs. limit=22.5 2023-06-18 05:02:13,465 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.371e+02 3.697e+02 4.904e+02 6.158e+02 8.783e+02, threshold=9.807e+02, percent-clipped=0.0 2023-06-18 05:02:13,980 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=104520.0, ans=0.2 2023-06-18 05:02:48,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=104640.0, ans=0.1 2023-06-18 05:03:10,193 INFO [train.py:996] (2/4) Epoch 1, batch 17450, loss[loss=0.2697, simple_loss=0.3537, pruned_loss=0.09289, over 21566.00 frames. ], tot_loss[loss=0.336, simple_loss=0.386, pruned_loss=0.143, over 4273399.99 frames. ], batch size: 389, lr: 2.83e-02, grad_scale: 32.0 2023-06-18 05:03:34,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=104700.0, ans=0.1 2023-06-18 05:03:34,940 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.94 vs. limit=15.0 2023-06-18 05:04:01,696 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=104760.0, ans=0.2 2023-06-18 05:04:21,539 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=104820.0, ans=0.0 2023-06-18 05:04:31,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=104880.0, ans=0.1 2023-06-18 05:05:28,339 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=105000.0, ans=0.125 2023-06-18 05:05:29,162 INFO [train.py:996] (2/4) Epoch 1, batch 17500, loss[loss=0.3274, simple_loss=0.3623, pruned_loss=0.1462, over 21535.00 frames. ], tot_loss[loss=0.3299, simple_loss=0.38, pruned_loss=0.1399, over 4277868.87 frames. ], batch size: 548, lr: 2.82e-02, grad_scale: 64.0 2023-06-18 05:05:52,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=105060.0, ans=0.2 2023-06-18 05:06:23,239 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.766e+02 2.676e+02 3.184e+02 3.972e+02 6.733e+02, threshold=6.368e+02, percent-clipped=0.0 2023-06-18 05:06:25,177 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=105180.0, ans=0.0 2023-06-18 05:06:47,895 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=105180.0, ans=0.125 2023-06-18 05:06:59,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=105240.0, ans=0.0 2023-06-18 05:07:06,469 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=105240.0, ans=0.125 2023-06-18 05:07:10,589 INFO [train.py:996] (2/4) Epoch 1, batch 17550, loss[loss=0.2765, simple_loss=0.3549, pruned_loss=0.099, over 21622.00 frames. ], tot_loss[loss=0.3277, simple_loss=0.3796, pruned_loss=0.1379, over 4264045.66 frames. ], batch size: 230, lr: 2.82e-02, grad_scale: 32.0 2023-06-18 05:08:13,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=105420.0, ans=0.0 2023-06-18 05:08:39,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=105480.0, ans=0.125 2023-06-18 05:09:08,737 INFO [train.py:996] (2/4) Epoch 1, batch 17600, loss[loss=0.34, simple_loss=0.4039, pruned_loss=0.1381, over 21801.00 frames. ], tot_loss[loss=0.3316, simple_loss=0.3832, pruned_loss=0.14, over 4265126.48 frames. ], batch size: 124, lr: 2.82e-02, grad_scale: 32.0 2023-06-18 05:09:19,835 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.58 vs. limit=22.5 2023-06-18 05:10:18,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=105720.0, ans=0.1 2023-06-18 05:10:23,867 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.884e+02 3.173e+02 4.298e+02 5.550e+02 1.174e+03, threshold=8.596e+02, percent-clipped=15.0 2023-06-18 05:10:39,905 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-06-18 05:11:05,585 INFO [train.py:996] (2/4) Epoch 1, batch 17650, loss[loss=0.2536, simple_loss=0.311, pruned_loss=0.09812, over 21760.00 frames. ], tot_loss[loss=0.3295, simple_loss=0.38, pruned_loss=0.1395, over 4264829.38 frames. ], batch size: 282, lr: 2.81e-02, grad_scale: 32.0 2023-06-18 05:11:09,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=105900.0, ans=0.0 2023-06-18 05:11:13,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=105900.0, ans=0.125 2023-06-18 05:11:27,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=105960.0, ans=0.04949747468305833 2023-06-18 05:11:38,514 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=105960.0, ans=0.0 2023-06-18 05:11:52,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=106020.0, ans=0.125 2023-06-18 05:11:55,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=106020.0, ans=0.125 2023-06-18 05:11:57,343 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 05:12:08,310 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.59 vs. limit=6.0 2023-06-18 05:12:09,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=106080.0, ans=0.1 2023-06-18 05:12:15,310 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=106080.0, ans=0.125 2023-06-18 05:12:42,037 INFO [train.py:996] (2/4) Epoch 1, batch 17700, loss[loss=0.2174, simple_loss=0.2488, pruned_loss=0.09297, over 16608.00 frames. ], tot_loss[loss=0.3208, simple_loss=0.3722, pruned_loss=0.1347, over 4252697.22 frames. ], batch size: 61, lr: 2.81e-02, grad_scale: 32.0 2023-06-18 05:12:43,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=106200.0, ans=0.125 2023-06-18 05:13:02,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=106200.0, ans=0.04949747468305833 2023-06-18 05:13:21,212 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.00 vs. limit=15.0 2023-06-18 05:14:06,835 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.164e+02 3.144e+02 3.611e+02 4.728e+02 8.023e+02, threshold=7.222e+02, percent-clipped=0.0 2023-06-18 05:14:44,163 INFO [train.py:996] (2/4) Epoch 1, batch 17750, loss[loss=0.365, simple_loss=0.4133, pruned_loss=0.1583, over 21480.00 frames. ], tot_loss[loss=0.3339, simple_loss=0.3836, pruned_loss=0.1421, over 4258645.20 frames. ], batch size: 112, lr: 2.81e-02, grad_scale: 32.0 2023-06-18 05:15:37,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=106560.0, ans=0.2 2023-06-18 05:15:51,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=106620.0, ans=0.0 2023-06-18 05:16:00,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=106620.0, ans=0.125 2023-06-18 05:16:35,949 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=106740.0, ans=0.09899494936611666 2023-06-18 05:16:41,540 INFO [train.py:996] (2/4) Epoch 1, batch 17800, loss[loss=0.2757, simple_loss=0.3403, pruned_loss=0.1056, over 21637.00 frames. ], tot_loss[loss=0.3322, simple_loss=0.3825, pruned_loss=0.1409, over 4262167.88 frames. ], batch size: 263, lr: 2.80e-02, grad_scale: 32.0 2023-06-18 05:17:04,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=106800.0, ans=0.0 2023-06-18 05:17:13,609 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=106860.0, ans=0.0 2023-06-18 05:17:48,564 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.40 vs. limit=10.0 2023-06-18 05:18:10,977 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=19.33 vs. limit=22.5 2023-06-18 05:18:11,425 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 3.015e+02 3.874e+02 4.676e+02 8.507e+02, threshold=7.748e+02, percent-clipped=1.0 2023-06-18 05:18:23,139 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.15 vs. limit=15.0 2023-06-18 05:18:53,252 INFO [train.py:996] (2/4) Epoch 1, batch 17850, loss[loss=0.3916, simple_loss=0.4291, pruned_loss=0.177, over 21698.00 frames. ], tot_loss[loss=0.3335, simple_loss=0.3842, pruned_loss=0.1414, over 4261592.47 frames. ], batch size: 351, lr: 2.80e-02, grad_scale: 32.0 2023-06-18 05:19:40,629 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.64 vs. limit=15.0 2023-06-18 05:20:09,866 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=107220.0, ans=0.125 2023-06-18 05:20:31,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=107280.0, ans=0.125 2023-06-18 05:20:53,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=107340.0, ans=0.025 2023-06-18 05:21:24,299 INFO [train.py:996] (2/4) Epoch 1, batch 17900, loss[loss=0.3189, simple_loss=0.3849, pruned_loss=0.1264, over 21223.00 frames. ], tot_loss[loss=0.3396, simple_loss=0.3899, pruned_loss=0.1447, over 4267456.73 frames. ], batch size: 176, lr: 2.80e-02, grad_scale: 32.0 2023-06-18 05:21:33,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=107400.0, ans=0.1 2023-06-18 05:21:55,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=107460.0, ans=10.0 2023-06-18 05:21:57,365 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.06 vs. limit=12.0 2023-06-18 05:22:11,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=107520.0, ans=0.2 2023-06-18 05:22:43,629 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.103e+02 3.224e+02 3.725e+02 5.130e+02 9.496e+02, threshold=7.451e+02, percent-clipped=4.0 2023-06-18 05:23:12,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=107580.0, ans=0.2 2023-06-18 05:23:21,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=107640.0, ans=0.1 2023-06-18 05:23:50,221 INFO [train.py:996] (2/4) Epoch 1, batch 17950, loss[loss=0.2778, simple_loss=0.3496, pruned_loss=0.103, over 21762.00 frames. ], tot_loss[loss=0.3344, simple_loss=0.3888, pruned_loss=0.14, over 4270945.83 frames. ], batch size: 332, lr: 2.79e-02, grad_scale: 32.0 2023-06-18 05:24:26,163 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=107760.0, ans=0.125 2023-06-18 05:25:07,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=107880.0, ans=0.125 2023-06-18 05:25:30,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=107940.0, ans=0.125 2023-06-18 05:25:58,834 INFO [train.py:996] (2/4) Epoch 1, batch 18000, loss[loss=0.3339, simple_loss=0.3643, pruned_loss=0.1518, over 21752.00 frames. ], tot_loss[loss=0.3279, simple_loss=0.3801, pruned_loss=0.1379, over 4268906.48 frames. ], batch size: 371, lr: 2.79e-02, grad_scale: 32.0 2023-06-18 05:25:58,835 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-18 05:26:54,359 INFO [train.py:1028] (2/4) Epoch 1, validation: loss=0.3106, simple_loss=0.4066, pruned_loss=0.1073, over 1796401.00 frames. 2023-06-18 05:26:54,364 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-18 05:27:05,667 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.24 vs. limit=15.0 2023-06-18 05:27:10,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=108060.0, ans=0.2 2023-06-18 05:27:12,945 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=108060.0, ans=0.2 2023-06-18 05:27:47,984 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.025e+02 3.259e+02 3.858e+02 4.507e+02 8.062e+02, threshold=7.716e+02, percent-clipped=1.0 2023-06-18 05:27:52,124 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-06-18 05:28:30,788 INFO [train.py:996] (2/4) Epoch 1, batch 18050, loss[loss=0.3132, simple_loss=0.3542, pruned_loss=0.1361, over 21654.00 frames. ], tot_loss[loss=0.3232, simple_loss=0.3735, pruned_loss=0.1364, over 4257540.53 frames. ], batch size: 298, lr: 2.79e-02, grad_scale: 32.0 2023-06-18 05:29:43,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=108420.0, ans=0.1 2023-06-18 05:29:48,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=108480.0, ans=0.125 2023-06-18 05:30:27,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=108540.0, ans=0.0 2023-06-18 05:30:52,496 INFO [train.py:996] (2/4) Epoch 1, batch 18100, loss[loss=0.302, simple_loss=0.3765, pruned_loss=0.1137, over 21433.00 frames. ], tot_loss[loss=0.3316, simple_loss=0.3809, pruned_loss=0.1411, over 4252194.95 frames. ], batch size: 131, lr: 2.78e-02, grad_scale: 32.0 2023-06-18 05:30:52,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=108600.0, ans=0.5 2023-06-18 05:32:17,146 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.115e+02 3.257e+02 4.345e+02 5.048e+02 8.084e+02, threshold=8.690e+02, percent-clipped=1.0 2023-06-18 05:32:27,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=108780.0, ans=0.09899494936611666 2023-06-18 05:32:53,570 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=108840.0, ans=0.125 2023-06-18 05:32:55,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=108840.0, ans=0.0 2023-06-18 05:33:28,589 INFO [train.py:996] (2/4) Epoch 1, batch 18150, loss[loss=0.2789, simple_loss=0.3238, pruned_loss=0.117, over 21381.00 frames. ], tot_loss[loss=0.3292, simple_loss=0.3805, pruned_loss=0.139, over 4254487.74 frames. ], batch size: 131, lr: 2.78e-02, grad_scale: 32.0 2023-06-18 05:33:46,263 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.79 vs. limit=12.0 2023-06-18 05:34:05,460 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=108960.0, ans=0.125 2023-06-18 05:34:19,364 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.71 vs. limit=15.0 2023-06-18 05:34:35,231 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.82 vs. limit=15.0 2023-06-18 05:34:44,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=109080.0, ans=0.0 2023-06-18 05:34:46,146 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=109140.0, ans=0.5 2023-06-18 05:35:07,210 INFO [train.py:996] (2/4) Epoch 1, batch 18200, loss[loss=0.2739, simple_loss=0.3263, pruned_loss=0.1107, over 21685.00 frames. ], tot_loss[loss=0.3245, simple_loss=0.3731, pruned_loss=0.138, over 4254266.71 frames. ], batch size: 282, lr: 2.78e-02, grad_scale: 32.0 2023-06-18 05:36:03,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=109320.0, ans=0.0 2023-06-18 05:36:11,566 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=109320.0, ans=0.1 2023-06-18 05:36:12,736 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.338e+02 3.177e+02 3.790e+02 4.775e+02 7.519e+02, threshold=7.579e+02, percent-clipped=0.0 2023-06-18 05:36:26,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=109380.0, ans=0.2 2023-06-18 05:37:00,794 INFO [train.py:996] (2/4) Epoch 1, batch 18250, loss[loss=0.2746, simple_loss=0.3363, pruned_loss=0.1065, over 21799.00 frames. ], tot_loss[loss=0.3119, simple_loss=0.3611, pruned_loss=0.1313, over 4255567.35 frames. ], batch size: 124, lr: 2.77e-02, grad_scale: 32.0 2023-06-18 05:38:45,539 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=109680.0, ans=0.0 2023-06-18 05:39:24,267 INFO [train.py:996] (2/4) Epoch 1, batch 18300, loss[loss=0.4556, simple_loss=0.4673, pruned_loss=0.2219, over 21654.00 frames. ], tot_loss[loss=0.3134, simple_loss=0.3612, pruned_loss=0.1328, over 4263752.86 frames. ], batch size: 507, lr: 2.77e-02, grad_scale: 32.0 2023-06-18 05:39:38,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=109800.0, ans=0.0 2023-06-18 05:41:05,960 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.243e+02 3.323e+02 3.972e+02 4.906e+02 9.934e+02, threshold=7.944e+02, percent-clipped=3.0 2023-06-18 05:41:28,130 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=110040.0, ans=0.0 2023-06-18 05:41:43,039 INFO [train.py:996] (2/4) Epoch 1, batch 18350, loss[loss=0.2361, simple_loss=0.2974, pruned_loss=0.0874, over 17472.00 frames. ], tot_loss[loss=0.3164, simple_loss=0.3675, pruned_loss=0.1327, over 4260783.53 frames. ], batch size: 67, lr: 2.77e-02, grad_scale: 32.0 2023-06-18 05:42:17,606 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.44 vs. limit=15.0 2023-06-18 05:43:11,118 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 05:44:12,473 INFO [train.py:996] (2/4) Epoch 1, batch 18400, loss[loss=0.3292, simple_loss=0.3928, pruned_loss=0.1328, over 21713.00 frames. ], tot_loss[loss=0.3143, simple_loss=0.3645, pruned_loss=0.1321, over 4250320.48 frames. ], batch size: 415, lr: 2.76e-02, grad_scale: 32.0 2023-06-18 05:44:16,072 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=3.036e-02 2023-06-18 05:44:34,254 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.035e-01 2023-06-18 05:45:14,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=110520.0, ans=0.125 2023-06-18 05:45:16,067 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.49 vs. limit=15.0 2023-06-18 05:45:16,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=110520.0, ans=0.125 2023-06-18 05:45:19,850 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=110520.0, ans=0.035 2023-06-18 05:45:22,722 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.114e+02 3.119e+02 3.679e+02 4.893e+02 6.747e+02, threshold=7.358e+02, percent-clipped=0.0 2023-06-18 05:45:44,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=110640.0, ans=0.125 2023-06-18 05:46:30,238 INFO [train.py:996] (2/4) Epoch 1, batch 18450, loss[loss=0.2901, simple_loss=0.3408, pruned_loss=0.1196, over 20800.00 frames. ], tot_loss[loss=0.3069, simple_loss=0.3605, pruned_loss=0.1267, over 4243427.32 frames. ], batch size: 609, lr: 2.76e-02, grad_scale: 32.0 2023-06-18 05:46:35,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=110700.0, ans=0.125 2023-06-18 05:47:25,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=110820.0, ans=0.125 2023-06-18 05:48:26,254 INFO [train.py:996] (2/4) Epoch 1, batch 18500, loss[loss=0.2723, simple_loss=0.316, pruned_loss=0.1143, over 21639.00 frames. ], tot_loss[loss=0.3014, simple_loss=0.355, pruned_loss=0.1239, over 4240003.37 frames. ], batch size: 263, lr: 2.76e-02, grad_scale: 32.0 2023-06-18 05:48:41,224 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.95 vs. limit=6.0 2023-06-18 05:49:01,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=111120.0, ans=0.125 2023-06-18 05:49:20,944 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.167e+02 3.596e+02 4.284e+02 6.309e+02 9.887e+02, threshold=8.569e+02, percent-clipped=11.0 2023-06-18 05:49:41,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=111240.0, ans=0.2 2023-06-18 05:50:02,713 INFO [train.py:996] (2/4) Epoch 1, batch 18550, loss[loss=0.3064, simple_loss=0.3488, pruned_loss=0.1319, over 21619.00 frames. ], tot_loss[loss=0.2998, simple_loss=0.3532, pruned_loss=0.1232, over 4228127.75 frames. ], batch size: 332, lr: 2.76e-02, grad_scale: 32.0 2023-06-18 05:50:10,714 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=111300.0, ans=0.125 2023-06-18 05:50:12,807 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.09 vs. limit=12.0 2023-06-18 05:50:13,984 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=111300.0, ans=0.2 2023-06-18 05:51:17,972 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=15.0 2023-06-18 05:51:56,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=111540.0, ans=0.125 2023-06-18 05:52:01,519 INFO [train.py:996] (2/4) Epoch 1, batch 18600, loss[loss=0.2746, simple_loss=0.3392, pruned_loss=0.1049, over 21797.00 frames. ], tot_loss[loss=0.303, simple_loss=0.354, pruned_loss=0.126, over 4235754.14 frames. ], batch size: 282, lr: 2.75e-02, grad_scale: 32.0 2023-06-18 05:52:03,398 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=111600.0, ans=0.2 2023-06-18 05:52:36,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=111660.0, ans=0.2 2023-06-18 05:53:20,117 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.160e+02 2.842e+02 3.411e+02 4.162e+02 7.856e+02, threshold=6.821e+02, percent-clipped=0.0 2023-06-18 05:53:36,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=111780.0, ans=0.0 2023-06-18 05:54:02,505 INFO [train.py:996] (2/4) Epoch 1, batch 18650, loss[loss=0.2788, simple_loss=0.3313, pruned_loss=0.1132, over 21501.00 frames. ], tot_loss[loss=0.302, simple_loss=0.3523, pruned_loss=0.1259, over 4234703.51 frames. ], batch size: 212, lr: 2.75e-02, grad_scale: 32.0 2023-06-18 05:54:35,888 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=111960.0, ans=0.0 2023-06-18 05:55:22,539 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=112140.0, ans=10.0 2023-06-18 05:55:22,553 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=112140.0, ans=0.125 2023-06-18 05:55:40,746 INFO [train.py:996] (2/4) Epoch 1, batch 18700, loss[loss=0.3754, simple_loss=0.3927, pruned_loss=0.179, over 21407.00 frames. ], tot_loss[loss=0.3039, simple_loss=0.3517, pruned_loss=0.1281, over 4237799.96 frames. ], batch size: 473, lr: 2.75e-02, grad_scale: 32.0 2023-06-18 05:55:56,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=112260.0, ans=0.125 2023-06-18 05:56:45,497 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=112320.0, ans=0.0 2023-06-18 05:57:01,543 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.994e+02 3.044e+02 3.629e+02 4.654e+02 7.971e+02, threshold=7.259e+02, percent-clipped=4.0 2023-06-18 05:57:11,009 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.82 vs. limit=10.0 2023-06-18 05:57:50,373 INFO [train.py:996] (2/4) Epoch 1, batch 18750, loss[loss=0.3129, simple_loss=0.3386, pruned_loss=0.1437, over 20266.00 frames. ], tot_loss[loss=0.3079, simple_loss=0.3539, pruned_loss=0.131, over 4249824.07 frames. ], batch size: 703, lr: 2.74e-02, grad_scale: 32.0 2023-06-18 05:58:13,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=112500.0, ans=10.0 2023-06-18 05:59:38,897 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.54 vs. limit=15.0 2023-06-18 06:00:27,314 INFO [train.py:996] (2/4) Epoch 1, batch 18800, loss[loss=0.2256, simple_loss=0.2937, pruned_loss=0.07871, over 21459.00 frames. ], tot_loss[loss=0.3123, simple_loss=0.3596, pruned_loss=0.1325, over 4256061.67 frames. ], batch size: 194, lr: 2.74e-02, grad_scale: 32.0 2023-06-18 06:00:52,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=112860.0, ans=0.0 2023-06-18 06:01:09,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=112860.0, ans=0.125 2023-06-18 06:02:08,036 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.633e+02 3.299e+02 4.034e+02 4.833e+02 8.926e+02, threshold=8.067e+02, percent-clipped=4.0 2023-06-18 06:02:11,367 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=112980.0, ans=0.2 2023-06-18 06:02:15,986 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=112980.0, ans=0.125 2023-06-18 06:02:33,349 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=113040.0, ans=0.125 2023-06-18 06:02:49,581 INFO [train.py:996] (2/4) Epoch 1, batch 18850, loss[loss=0.2506, simple_loss=0.3403, pruned_loss=0.08049, over 21769.00 frames. ], tot_loss[loss=0.2983, simple_loss=0.3515, pruned_loss=0.1226, over 4257133.86 frames. ], batch size: 391, lr: 2.74e-02, grad_scale: 32.0 2023-06-18 06:03:24,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=113160.0, ans=0.0 2023-06-18 06:04:13,446 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=12.0 2023-06-18 06:04:45,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=113340.0, ans=0.2 2023-06-18 06:04:51,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=113340.0, ans=0.125 2023-06-18 06:04:53,808 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.37 vs. limit=15.0 2023-06-18 06:04:56,932 INFO [train.py:996] (2/4) Epoch 1, batch 18900, loss[loss=0.3352, simple_loss=0.3707, pruned_loss=0.1498, over 21894.00 frames. ], tot_loss[loss=0.299, simple_loss=0.3497, pruned_loss=0.1242, over 4253578.14 frames. ], batch size: 351, lr: 2.73e-02, grad_scale: 32.0 2023-06-18 06:06:38,851 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.916e+02 3.081e+02 3.640e+02 4.781e+02 9.031e+02, threshold=7.280e+02, percent-clipped=2.0 2023-06-18 06:06:47,882 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.78 vs. limit=15.0 2023-06-18 06:07:43,553 INFO [train.py:996] (2/4) Epoch 1, batch 18950, loss[loss=0.3224, simple_loss=0.3674, pruned_loss=0.1387, over 21659.00 frames. ], tot_loss[loss=0.3024, simple_loss=0.3508, pruned_loss=0.127, over 4262769.03 frames. ], batch size: 263, lr: 2.73e-02, grad_scale: 32.0 2023-06-18 06:07:53,527 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=113700.0, ans=0.1 2023-06-18 06:07:54,830 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=113700.0, ans=0.1 2023-06-18 06:08:52,146 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=113820.0, ans=0.125 2023-06-18 06:09:12,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=113820.0, ans=0.0 2023-06-18 06:10:03,409 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=113880.0, ans=0.04949747468305833 2023-06-18 06:10:26,257 INFO [train.py:996] (2/4) Epoch 1, batch 19000, loss[loss=0.3805, simple_loss=0.4246, pruned_loss=0.1682, over 21504.00 frames. ], tot_loss[loss=0.3102, simple_loss=0.3612, pruned_loss=0.1296, over 4271560.16 frames. ], batch size: 194, lr: 2.73e-02, grad_scale: 32.0 2023-06-18 06:10:34,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=114000.0, ans=0.125 2023-06-18 06:11:11,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=114120.0, ans=0.0 2023-06-18 06:11:28,485 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.448e+02 3.443e+02 4.158e+02 4.977e+02 1.551e+03, threshold=8.315e+02, percent-clipped=7.0 2023-06-18 06:11:37,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=114180.0, ans=0.1 2023-06-18 06:12:32,963 INFO [train.py:996] (2/4) Epoch 1, batch 19050, loss[loss=0.4015, simple_loss=0.419, pruned_loss=0.192, over 21715.00 frames. ], tot_loss[loss=0.3199, simple_loss=0.3689, pruned_loss=0.1355, over 4272775.67 frames. ], batch size: 475, lr: 2.72e-02, grad_scale: 32.0 2023-06-18 06:12:53,516 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=114300.0, ans=0.125 2023-06-18 06:13:41,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=114420.0, ans=0.125 2023-06-18 06:15:15,461 INFO [train.py:996] (2/4) Epoch 1, batch 19100, loss[loss=0.3215, simple_loss=0.3557, pruned_loss=0.1436, over 21178.00 frames. ], tot_loss[loss=0.3207, simple_loss=0.3674, pruned_loss=0.137, over 4268261.36 frames. ], batch size: 608, lr: 2.72e-02, grad_scale: 32.0 2023-06-18 06:16:32,256 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.540e+02 3.357e+02 4.030e+02 5.006e+02 7.474e+02, threshold=8.060e+02, percent-clipped=0.0 2023-06-18 06:17:02,835 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.43 vs. limit=15.0 2023-06-18 06:17:03,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=114780.0, ans=0.1 2023-06-18 06:17:06,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=114780.0, ans=0.125 2023-06-18 06:18:01,643 INFO [train.py:996] (2/4) Epoch 1, batch 19150, loss[loss=0.3562, simple_loss=0.4273, pruned_loss=0.1426, over 21866.00 frames. ], tot_loss[loss=0.3236, simple_loss=0.3704, pruned_loss=0.1384, over 4273126.21 frames. ], batch size: 317, lr: 2.72e-02, grad_scale: 32.0 2023-06-18 06:18:45,956 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=114960.0, ans=0.125 2023-06-18 06:18:51,145 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=115020.0, ans=0.05 2023-06-18 06:19:53,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=115080.0, ans=0.125 2023-06-18 06:20:38,010 INFO [train.py:996] (2/4) Epoch 1, batch 19200, loss[loss=0.3443, simple_loss=0.4214, pruned_loss=0.1336, over 21754.00 frames. ], tot_loss[loss=0.3312, simple_loss=0.3826, pruned_loss=0.1398, over 4278559.76 frames. ], batch size: 332, lr: 2.71e-02, grad_scale: 32.0 2023-06-18 06:20:57,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=115260.0, ans=0.0 2023-06-18 06:21:49,703 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.840e+02 3.092e+02 3.837e+02 4.671e+02 8.670e+02, threshold=7.675e+02, percent-clipped=1.0 2023-06-18 06:21:50,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=115380.0, ans=0.125 2023-06-18 06:22:39,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=115440.0, ans=0.04949747468305833 2023-06-18 06:22:54,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=115440.0, ans=0.0 2023-06-18 06:23:03,925 INFO [train.py:996] (2/4) Epoch 1, batch 19250, loss[loss=0.3133, simple_loss=0.4019, pruned_loss=0.1124, over 19766.00 frames. ], tot_loss[loss=0.3202, simple_loss=0.3785, pruned_loss=0.131, over 4268706.10 frames. ], batch size: 702, lr: 2.71e-02, grad_scale: 32.0 2023-06-18 06:23:43,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=115500.0, ans=0.1 2023-06-18 06:23:50,367 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=115560.0, ans=0.1 2023-06-18 06:25:41,827 INFO [train.py:996] (2/4) Epoch 1, batch 19300, loss[loss=0.3835, simple_loss=0.4563, pruned_loss=0.1554, over 19737.00 frames. ], tot_loss[loss=0.3172, simple_loss=0.3748, pruned_loss=0.1298, over 4276626.67 frames. ], batch size: 703, lr: 2.71e-02, grad_scale: 32.0 2023-06-18 06:25:50,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=115800.0, ans=0.125 2023-06-18 06:26:10,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=115800.0, ans=0.2 2023-06-18 06:26:24,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=115860.0, ans=0.0 2023-06-18 06:26:46,800 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=115860.0, ans=0.0 2023-06-18 06:27:15,169 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.742e+02 2.985e+02 3.697e+02 4.596e+02 6.937e+02, threshold=7.395e+02, percent-clipped=0.0 2023-06-18 06:28:13,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=116040.0, ans=0.1 2023-06-18 06:28:28,633 INFO [train.py:996] (2/4) Epoch 1, batch 19350, loss[loss=0.2517, simple_loss=0.3232, pruned_loss=0.09014, over 21758.00 frames. ], tot_loss[loss=0.3083, simple_loss=0.3674, pruned_loss=0.1246, over 4269723.05 frames. ], batch size: 282, lr: 2.71e-02, grad_scale: 32.0 2023-06-18 06:29:51,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=116220.0, ans=0.0 2023-06-18 06:29:55,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=116280.0, ans=0.1 2023-06-18 06:29:55,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=116280.0, ans=0.0 2023-06-18 06:30:13,428 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.98 vs. limit=22.5 2023-06-18 06:30:25,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=116340.0, ans=0.0 2023-06-18 06:30:25,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=116340.0, ans=0.125 2023-06-18 06:30:59,108 INFO [train.py:996] (2/4) Epoch 1, batch 19400, loss[loss=0.4054, simple_loss=0.4156, pruned_loss=0.1976, over 21718.00 frames. ], tot_loss[loss=0.3062, simple_loss=0.3651, pruned_loss=0.1237, over 4275735.27 frames. ], batch size: 508, lr: 2.70e-02, grad_scale: 32.0 2023-06-18 06:31:37,372 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=116460.0, ans=0.04949747468305833 2023-06-18 06:32:21,373 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.044e+02 3.419e+02 3.848e+02 4.708e+02 7.710e+02, threshold=7.695e+02, percent-clipped=3.0 2023-06-18 06:32:35,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=116580.0, ans=0.07 2023-06-18 06:32:49,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=116580.0, ans=22.5 2023-06-18 06:33:02,787 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.60 vs. limit=22.5 2023-06-18 06:33:32,546 INFO [train.py:996] (2/4) Epoch 1, batch 19450, loss[loss=0.2959, simple_loss=0.3291, pruned_loss=0.1313, over 21256.00 frames. ], tot_loss[loss=0.3084, simple_loss=0.3628, pruned_loss=0.127, over 4281298.61 frames. ], batch size: 144, lr: 2.70e-02, grad_scale: 32.0 2023-06-18 06:33:40,686 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.78 vs. limit=22.5 2023-06-18 06:33:41,717 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=116700.0, ans=0.0 2023-06-18 06:33:57,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=116760.0, ans=0.125 2023-06-18 06:34:05,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=116760.0, ans=0.125 2023-06-18 06:34:43,776 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=116820.0, ans=0.2 2023-06-18 06:35:49,163 INFO [train.py:996] (2/4) Epoch 1, batch 19500, loss[loss=0.3473, simple_loss=0.3933, pruned_loss=0.1506, over 21562.00 frames. ], tot_loss[loss=0.3088, simple_loss=0.3593, pruned_loss=0.1291, over 4279959.81 frames. ], batch size: 389, lr: 2.70e-02, grad_scale: 32.0 2023-06-18 06:37:30,034 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.376e+02 3.181e+02 3.762e+02 4.814e+02 7.838e+02, threshold=7.523e+02, percent-clipped=1.0 2023-06-18 06:38:00,794 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=117240.0, ans=0.0 2023-06-18 06:38:40,443 INFO [train.py:996] (2/4) Epoch 1, batch 19550, loss[loss=0.2502, simple_loss=0.2993, pruned_loss=0.1005, over 21388.00 frames. ], tot_loss[loss=0.3014, simple_loss=0.3522, pruned_loss=0.1253, over 4262371.92 frames. ], batch size: 194, lr: 2.69e-02, grad_scale: 64.0 2023-06-18 06:39:00,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=117360.0, ans=0.125 2023-06-18 06:39:24,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=117360.0, ans=0.1 2023-06-18 06:39:24,470 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=117360.0, ans=0.125 2023-06-18 06:39:33,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=117360.0, ans=0.125 2023-06-18 06:39:35,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=117420.0, ans=0.2 2023-06-18 06:39:48,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=117420.0, ans=10.0 2023-06-18 06:41:02,438 INFO [train.py:996] (2/4) Epoch 1, batch 19600, loss[loss=0.3468, simple_loss=0.3829, pruned_loss=0.1553, over 21860.00 frames. ], tot_loss[loss=0.3056, simple_loss=0.3558, pruned_loss=0.1276, over 4267808.03 frames. ], batch size: 351, lr: 2.69e-02, grad_scale: 64.0 2023-06-18 06:41:07,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=117600.0, ans=0.0 2023-06-18 06:41:21,081 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=12.56 vs. limit=12.0 2023-06-18 06:42:25,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=117720.0, ans=0.125 2023-06-18 06:42:30,572 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.047e+02 3.158e+02 3.823e+02 5.330e+02 9.735e+02, threshold=7.645e+02, percent-clipped=7.0 2023-06-18 06:43:04,498 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 06:43:45,967 INFO [train.py:996] (2/4) Epoch 1, batch 19650, loss[loss=0.3416, simple_loss=0.3797, pruned_loss=0.1517, over 20849.00 frames. ], tot_loss[loss=0.3168, simple_loss=0.3638, pruned_loss=0.1348, over 4270614.63 frames. ], batch size: 607, lr: 2.69e-02, grad_scale: 32.0 2023-06-18 06:45:01,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=118020.0, ans=0.125 2023-06-18 06:46:01,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=118080.0, ans=0.125 2023-06-18 06:46:50,133 INFO [train.py:996] (2/4) Epoch 1, batch 19700, loss[loss=0.2719, simple_loss=0.3396, pruned_loss=0.1021, over 21596.00 frames. ], tot_loss[loss=0.3195, simple_loss=0.3673, pruned_loss=0.1358, over 4260382.21 frames. ], batch size: 263, lr: 2.68e-02, grad_scale: 32.0 2023-06-18 06:47:14,027 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=118200.0, ans=0.125 2023-06-18 06:47:16,889 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=118260.0, ans=0.2 2023-06-18 06:47:23,430 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.79 vs. limit=22.5 2023-06-18 06:47:24,656 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.27 vs. limit=15.0 2023-06-18 06:48:30,007 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.207e+02 3.429e+02 4.231e+02 5.435e+02 1.062e+03, threshold=8.463e+02, percent-clipped=10.0 2023-06-18 06:48:30,531 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=118380.0, ans=0.125 2023-06-18 06:48:34,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=118380.0, ans=0.2 2023-06-18 06:48:45,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=118380.0, ans=0.2 2023-06-18 06:49:32,815 INFO [train.py:996] (2/4) Epoch 1, batch 19750, loss[loss=0.3128, simple_loss=0.3803, pruned_loss=0.1227, over 21393.00 frames. ], tot_loss[loss=0.326, simple_loss=0.3771, pruned_loss=0.1374, over 4256144.41 frames. ], batch size: 176, lr: 2.68e-02, grad_scale: 32.0 2023-06-18 06:50:22,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=118560.0, ans=0.07 2023-06-18 06:51:00,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=118620.0, ans=0.125 2023-06-18 06:51:00,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=118620.0, ans=0.2 2023-06-18 06:51:26,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=118680.0, ans=0.125 2023-06-18 06:51:33,077 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=118680.0, ans=0.0 2023-06-18 06:51:47,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=118740.0, ans=0.0 2023-06-18 06:51:58,205 INFO [train.py:996] (2/4) Epoch 1, batch 19800, loss[loss=0.223, simple_loss=0.2623, pruned_loss=0.09186, over 21890.00 frames. ], tot_loss[loss=0.3269, simple_loss=0.3765, pruned_loss=0.1387, over 4272292.78 frames. ], batch size: 98, lr: 2.68e-02, grad_scale: 32.0 2023-06-18 06:53:44,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=118920.0, ans=0.125 2023-06-18 06:53:46,853 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.129e+02 3.372e+02 3.953e+02 4.969e+02 1.016e+03, threshold=7.905e+02, percent-clipped=3.0 2023-06-18 06:53:47,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=118980.0, ans=0.125 2023-06-18 06:53:47,937 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.06 vs. limit=10.0 2023-06-18 06:54:39,912 INFO [train.py:996] (2/4) Epoch 1, batch 19850, loss[loss=0.2698, simple_loss=0.3408, pruned_loss=0.09942, over 21728.00 frames. ], tot_loss[loss=0.3137, simple_loss=0.3662, pruned_loss=0.1306, over 4256660.93 frames. ], batch size: 351, lr: 2.68e-02, grad_scale: 32.0 2023-06-18 06:54:41,068 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.17 vs. limit=15.0 2023-06-18 06:54:49,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=119100.0, ans=0.1 2023-06-18 06:55:39,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=119220.0, ans=0.125 2023-06-18 06:56:15,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=119280.0, ans=0.025 2023-06-18 06:56:17,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=119280.0, ans=0.025 2023-06-18 06:56:49,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=119340.0, ans=0.125 2023-06-18 06:57:16,304 INFO [train.py:996] (2/4) Epoch 1, batch 19900, loss[loss=0.2657, simple_loss=0.3341, pruned_loss=0.09864, over 21170.00 frames. ], tot_loss[loss=0.3117, simple_loss=0.3672, pruned_loss=0.1281, over 4254265.07 frames. ], batch size: 159, lr: 2.67e-02, grad_scale: 32.0 2023-06-18 06:58:10,285 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=119460.0, ans=0.0 2023-06-18 06:58:40,850 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.938e+02 3.179e+02 3.927e+02 4.733e+02 6.841e+02, threshold=7.854e+02, percent-clipped=0.0 2023-06-18 06:58:41,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=119580.0, ans=0.125 2023-06-18 06:59:35,268 INFO [train.py:996] (2/4) Epoch 1, batch 19950, loss[loss=0.3035, simple_loss=0.3409, pruned_loss=0.1331, over 21859.00 frames. ], tot_loss[loss=0.3085, simple_loss=0.3612, pruned_loss=0.1279, over 4262255.28 frames. ], batch size: 98, lr: 2.67e-02, grad_scale: 32.0 2023-06-18 07:00:29,792 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-06-18 07:00:42,776 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.27 vs. limit=6.0 2023-06-18 07:01:29,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=119880.0, ans=0.125 2023-06-18 07:01:35,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=119880.0, ans=0.1 2023-06-18 07:02:16,434 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=119940.0, ans=0.125 2023-06-18 07:02:22,519 INFO [train.py:996] (2/4) Epoch 1, batch 20000, loss[loss=0.2915, simple_loss=0.3556, pruned_loss=0.1136, over 21669.00 frames. ], tot_loss[loss=0.3094, simple_loss=0.3623, pruned_loss=0.1283, over 4262697.60 frames. ], batch size: 263, lr: 2.67e-02, grad_scale: 32.0 2023-06-18 07:03:04,418 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.43 vs. limit=15.0 2023-06-18 07:03:43,966 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.407e+02 3.393e+02 3.852e+02 4.728e+02 8.512e+02, threshold=7.705e+02, percent-clipped=1.0 2023-06-18 07:03:57,516 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=120180.0, ans=0.0 2023-06-18 07:04:59,174 INFO [train.py:996] (2/4) Epoch 1, batch 20050, loss[loss=0.3648, simple_loss=0.3897, pruned_loss=0.17, over 20008.00 frames. ], tot_loss[loss=0.3157, simple_loss=0.3658, pruned_loss=0.1328, over 4268635.47 frames. ], batch size: 702, lr: 2.66e-02, grad_scale: 32.0 2023-06-18 07:05:00,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff3.min_abs, batch_count=120300.0, ans=0.2 2023-06-18 07:05:42,400 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.84 vs. limit=22.5 2023-06-18 07:06:01,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=120420.0, ans=0.0 2023-06-18 07:06:03,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=120420.0, ans=0.125 2023-06-18 07:06:28,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=120420.0, ans=0.125 2023-06-18 07:07:05,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=120540.0, ans=0.125 2023-06-18 07:07:08,129 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=20.08 vs. limit=22.5 2023-06-18 07:07:32,779 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=120540.0, ans=0.125 2023-06-18 07:07:43,289 INFO [train.py:996] (2/4) Epoch 1, batch 20100, loss[loss=0.4344, simple_loss=0.4673, pruned_loss=0.2008, over 21598.00 frames. ], tot_loss[loss=0.3212, simple_loss=0.3693, pruned_loss=0.1366, over 4273401.21 frames. ], batch size: 471, lr: 2.66e-02, grad_scale: 32.0 2023-06-18 07:07:58,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=120600.0, ans=0.2 2023-06-18 07:08:47,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=120660.0, ans=0.0 2023-06-18 07:08:50,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=120720.0, ans=0.5 2023-06-18 07:09:37,473 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.454e+02 3.314e+02 3.983e+02 4.752e+02 1.053e+03, threshold=7.965e+02, percent-clipped=3.0 2023-06-18 07:09:43,838 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=120780.0, ans=0.0 2023-06-18 07:10:06,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=120840.0, ans=0.125 2023-06-18 07:10:18,627 INFO [train.py:996] (2/4) Epoch 1, batch 20150, loss[loss=0.3678, simple_loss=0.4081, pruned_loss=0.1637, over 21322.00 frames. ], tot_loss[loss=0.3313, simple_loss=0.3807, pruned_loss=0.141, over 4274487.67 frames. ], batch size: 159, lr: 2.66e-02, grad_scale: 32.0 2023-06-18 07:10:20,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=120900.0, ans=0.1 2023-06-18 07:12:46,517 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.19 vs. limit=15.0 2023-06-18 07:12:49,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=121080.0, ans=0.125 2023-06-18 07:13:15,480 INFO [train.py:996] (2/4) Epoch 1, batch 20200, loss[loss=0.3416, simple_loss=0.4193, pruned_loss=0.132, over 21827.00 frames. ], tot_loss[loss=0.3377, simple_loss=0.3864, pruned_loss=0.1445, over 4269408.61 frames. ], batch size: 316, lr: 2.65e-02, grad_scale: 32.0 2023-06-18 07:13:18,160 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.97 vs. limit=15.0 2023-06-18 07:15:08,167 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.956e+02 3.389e+02 4.045e+02 4.963e+02 8.967e+02, threshold=8.091e+02, percent-clipped=1.0 2023-06-18 07:16:06,301 INFO [train.py:996] (2/4) Epoch 1, batch 20250, loss[loss=0.2944, simple_loss=0.3547, pruned_loss=0.1171, over 21419.00 frames. ], tot_loss[loss=0.3341, simple_loss=0.3854, pruned_loss=0.1414, over 4274188.64 frames. ], batch size: 211, lr: 2.65e-02, grad_scale: 32.0 2023-06-18 07:17:14,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=121620.0, ans=0.0 2023-06-18 07:18:22,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=121740.0, ans=0.0 2023-06-18 07:18:32,717 INFO [train.py:996] (2/4) Epoch 1, batch 20300, loss[loss=0.3225, simple_loss=0.3875, pruned_loss=0.1287, over 21600.00 frames. ], tot_loss[loss=0.3267, simple_loss=0.3809, pruned_loss=0.1362, over 4279206.21 frames. ], batch size: 389, lr: 2.65e-02, grad_scale: 32.0 2023-06-18 07:18:37,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=121800.0, ans=0.035 2023-06-18 07:19:33,830 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=121920.0, ans=0.0 2023-06-18 07:19:46,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=121920.0, ans=0.2 2023-06-18 07:19:58,577 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.147e+02 2.892e+02 3.299e+02 4.163e+02 6.538e+02, threshold=6.599e+02, percent-clipped=0.0 2023-06-18 07:20:52,731 INFO [train.py:996] (2/4) Epoch 1, batch 20350, loss[loss=0.3384, simple_loss=0.3812, pruned_loss=0.1478, over 21293.00 frames. ], tot_loss[loss=0.3276, simple_loss=0.3818, pruned_loss=0.1367, over 4285216.49 frames. ], batch size: 143, lr: 2.65e-02, grad_scale: 32.0 2023-06-18 07:21:29,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=122160.0, ans=0.05 2023-06-18 07:23:24,534 INFO [train.py:996] (2/4) Epoch 1, batch 20400, loss[loss=0.3007, simple_loss=0.3631, pruned_loss=0.1191, over 21377.00 frames. ], tot_loss[loss=0.3338, simple_loss=0.3861, pruned_loss=0.1407, over 4270018.82 frames. ], batch size: 131, lr: 2.64e-02, grad_scale: 32.0 2023-06-18 07:23:26,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=122400.0, ans=0.0 2023-06-18 07:23:37,277 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=122400.0, ans=0.125 2023-06-18 07:23:59,413 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=122400.0, ans=0.2 2023-06-18 07:24:02,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=122460.0, ans=0.0 2023-06-18 07:24:14,392 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=122460.0, ans=0.125 2023-06-18 07:24:21,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=122520.0, ans=0.2 2023-06-18 07:24:36,250 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.299e+02 3.452e+02 4.202e+02 5.218e+02 1.418e+03, threshold=8.403e+02, percent-clipped=8.0 2023-06-18 07:25:38,621 INFO [train.py:996] (2/4) Epoch 1, batch 20450, loss[loss=0.3375, simple_loss=0.379, pruned_loss=0.148, over 21857.00 frames. ], tot_loss[loss=0.3382, simple_loss=0.3881, pruned_loss=0.1441, over 4261958.46 frames. ], batch size: 107, lr: 2.64e-02, grad_scale: 32.0 2023-06-18 07:26:51,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=122820.0, ans=0.1 2023-06-18 07:27:04,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=122820.0, ans=0.015 2023-06-18 07:27:04,710 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=122820.0, ans=0.1 2023-06-18 07:27:06,077 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=122820.0, ans=0.0 2023-06-18 07:27:13,367 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.57 vs. limit=22.5 2023-06-18 07:28:18,145 INFO [train.py:996] (2/4) Epoch 1, batch 20500, loss[loss=0.2988, simple_loss=0.3485, pruned_loss=0.1246, over 21791.00 frames. ], tot_loss[loss=0.3362, simple_loss=0.3835, pruned_loss=0.1445, over 4263428.82 frames. ], batch size: 332, lr: 2.64e-02, grad_scale: 32.0 2023-06-18 07:28:26,511 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.32 vs. limit=15.0 2023-06-18 07:29:54,634 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.106e+02 3.481e+02 4.279e+02 5.718e+02 8.765e+02, threshold=8.557e+02, percent-clipped=1.0 2023-06-18 07:29:55,141 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=123180.0, ans=0.125 2023-06-18 07:30:21,686 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.09 vs. limit=15.0 2023-06-18 07:30:36,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=123240.0, ans=0.1 2023-06-18 07:30:47,880 INFO [train.py:996] (2/4) Epoch 1, batch 20550, loss[loss=0.276, simple_loss=0.3097, pruned_loss=0.1211, over 21580.00 frames. ], tot_loss[loss=0.3326, simple_loss=0.3781, pruned_loss=0.1436, over 4258514.29 frames. ], batch size: 196, lr: 2.63e-02, grad_scale: 16.0 2023-06-18 07:31:43,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=123420.0, ans=0.1 2023-06-18 07:32:19,491 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=123420.0, ans=0.025 2023-06-18 07:32:51,843 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.94 vs. limit=6.0 2023-06-18 07:33:33,245 INFO [train.py:996] (2/4) Epoch 1, batch 20600, loss[loss=0.3669, simple_loss=0.4157, pruned_loss=0.159, over 21740.00 frames. ], tot_loss[loss=0.3285, simple_loss=0.3781, pruned_loss=0.1395, over 4257647.13 frames. ], batch size: 441, lr: 2.63e-02, grad_scale: 16.0 2023-06-18 07:34:02,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=123600.0, ans=0.0 2023-06-18 07:34:26,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=123660.0, ans=0.1 2023-06-18 07:35:13,350 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.138e+02 2.900e+02 3.607e+02 4.743e+02 9.151e+02, threshold=7.215e+02, percent-clipped=2.0 2023-06-18 07:36:08,483 INFO [train.py:996] (2/4) Epoch 1, batch 20650, loss[loss=0.2929, simple_loss=0.3407, pruned_loss=0.1226, over 21889.00 frames. ], tot_loss[loss=0.3279, simple_loss=0.3747, pruned_loss=0.1405, over 4259283.21 frames. ], batch size: 107, lr: 2.63e-02, grad_scale: 16.0 2023-06-18 07:36:38,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=123960.0, ans=10.0 2023-06-18 07:37:03,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=123960.0, ans=0.0 2023-06-18 07:37:11,422 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.96 vs. limit=6.0 2023-06-18 07:37:46,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=124080.0, ans=0.0 2023-06-18 07:38:46,608 INFO [train.py:996] (2/4) Epoch 1, batch 20700, loss[loss=0.241, simple_loss=0.3046, pruned_loss=0.08874, over 21432.00 frames. ], tot_loss[loss=0.3175, simple_loss=0.3657, pruned_loss=0.1346, over 4243560.64 frames. ], batch size: 194, lr: 2.63e-02, grad_scale: 16.0 2023-06-18 07:38:50,516 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=124200.0, ans=0.0 2023-06-18 07:39:23,729 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=124260.0, ans=0.07 2023-06-18 07:40:16,088 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.154e+02 3.474e+02 3.916e+02 5.055e+02 8.009e+02, threshold=7.832e+02, percent-clipped=2.0 2023-06-18 07:40:20,963 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=124380.0, ans=0.2 2023-06-18 07:41:27,272 INFO [train.py:996] (2/4) Epoch 1, batch 20750, loss[loss=0.2966, simple_loss=0.3727, pruned_loss=0.1103, over 21345.00 frames. ], tot_loss[loss=0.3158, simple_loss=0.3665, pruned_loss=0.1326, over 4239584.72 frames. ], batch size: 194, lr: 2.62e-02, grad_scale: 16.0 2023-06-18 07:41:35,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=124500.0, ans=0.125 2023-06-18 07:41:39,067 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.29 vs. limit=15.0 2023-06-18 07:41:55,138 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 07:42:40,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=124620.0, ans=0.5 2023-06-18 07:44:09,531 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.80 vs. limit=15.0 2023-06-18 07:44:11,471 INFO [train.py:996] (2/4) Epoch 1, batch 20800, loss[loss=0.318, simple_loss=0.3666, pruned_loss=0.1347, over 21623.00 frames. ], tot_loss[loss=0.3189, simple_loss=0.3697, pruned_loss=0.1341, over 4249254.69 frames. ], batch size: 332, lr: 2.62e-02, grad_scale: 32.0 2023-06-18 07:45:37,095 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.454e+02 3.418e+02 4.058e+02 5.141e+02 9.579e+02, threshold=8.117e+02, percent-clipped=5.0 2023-06-18 07:46:35,473 INFO [train.py:996] (2/4) Epoch 1, batch 20850, loss[loss=0.2868, simple_loss=0.3332, pruned_loss=0.1203, over 21485.00 frames. ], tot_loss[loss=0.3109, simple_loss=0.3606, pruned_loss=0.1306, over 4257611.00 frames. ], batch size: 194, lr: 2.62e-02, grad_scale: 32.0 2023-06-18 07:47:40,387 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=125220.0, ans=0.0 2023-06-18 07:48:03,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=125280.0, ans=0.125 2023-06-18 07:49:08,586 INFO [train.py:996] (2/4) Epoch 1, batch 20900, loss[loss=0.2891, simple_loss=0.3371, pruned_loss=0.1206, over 21503.00 frames. ], tot_loss[loss=0.3132, simple_loss=0.361, pruned_loss=0.1327, over 4252568.71 frames. ], batch size: 212, lr: 2.62e-02, grad_scale: 32.0 2023-06-18 07:49:48,993 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=125520.0, ans=0.125 2023-06-18 07:50:24,785 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.999e+02 2.865e+02 3.704e+02 4.826e+02 8.208e+02, threshold=7.408e+02, percent-clipped=1.0 2023-06-18 07:50:42,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=125640.0, ans=0.2 2023-06-18 07:51:11,911 INFO [train.py:996] (2/4) Epoch 1, batch 20950, loss[loss=0.2439, simple_loss=0.2995, pruned_loss=0.09409, over 21189.00 frames. ], tot_loss[loss=0.3031, simple_loss=0.3545, pruned_loss=0.1259, over 4238815.00 frames. ], batch size: 143, lr: 2.61e-02, grad_scale: 32.0 2023-06-18 07:52:28,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=125880.0, ans=0.2 2023-06-18 07:52:44,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=125880.0, ans=0.1 2023-06-18 07:53:19,313 INFO [train.py:996] (2/4) Epoch 1, batch 21000, loss[loss=0.3261, simple_loss=0.3562, pruned_loss=0.148, over 21563.00 frames. ], tot_loss[loss=0.3049, simple_loss=0.3547, pruned_loss=0.1276, over 4239027.28 frames. ], batch size: 548, lr: 2.61e-02, grad_scale: 32.0 2023-06-18 07:53:19,314 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-18 07:54:13,248 INFO [train.py:1028] (2/4) Epoch 1, validation: loss=0.3148, simple_loss=0.4061, pruned_loss=0.1118, over 1796401.00 frames. 2023-06-18 07:54:13,253 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-18 07:54:13,711 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=126000.0, ans=0.07 2023-06-18 07:55:00,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=126120.0, ans=0.0 2023-06-18 07:55:15,210 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.017e+02 2.966e+02 3.592e+02 4.576e+02 7.047e+02, threshold=7.185e+02, percent-clipped=0.0 2023-06-18 07:56:08,666 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=126240.0, ans=0.09899494936611666 2023-06-18 07:56:20,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=126300.0, ans=0.09899494936611666 2023-06-18 07:56:21,892 INFO [train.py:996] (2/4) Epoch 1, batch 21050, loss[loss=0.3221, simple_loss=0.357, pruned_loss=0.1436, over 21415.00 frames. ], tot_loss[loss=0.3058, simple_loss=0.3541, pruned_loss=0.1287, over 4243404.89 frames. ], batch size: 389, lr: 2.61e-02, grad_scale: 32.0 2023-06-18 07:57:31,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=126480.0, ans=0.025 2023-06-18 07:58:25,920 INFO [train.py:996] (2/4) Epoch 1, batch 21100, loss[loss=0.2728, simple_loss=0.3194, pruned_loss=0.1131, over 21585.00 frames. ], tot_loss[loss=0.3018, simple_loss=0.3493, pruned_loss=0.1271, over 4231563.37 frames. ], batch size: 298, lr: 2.60e-02, grad_scale: 32.0 2023-06-18 07:59:03,604 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=126660.0, ans=0.1 2023-06-18 07:59:17,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=126660.0, ans=0.0 2023-06-18 07:59:50,577 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.024e+02 3.220e+02 3.817e+02 4.701e+02 7.563e+02, threshold=7.635e+02, percent-clipped=1.0 2023-06-18 08:00:25,958 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.04 vs. limit=6.0 2023-06-18 08:00:50,249 INFO [train.py:996] (2/4) Epoch 1, batch 21150, loss[loss=0.2782, simple_loss=0.3174, pruned_loss=0.1195, over 21638.00 frames. ], tot_loss[loss=0.3002, simple_loss=0.3454, pruned_loss=0.1275, over 4223198.44 frames. ], batch size: 282, lr: 2.60e-02, grad_scale: 32.0 2023-06-18 08:01:40,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=126960.0, ans=0.125 2023-06-18 08:03:11,673 INFO [train.py:996] (2/4) Epoch 1, batch 21200, loss[loss=0.2845, simple_loss=0.3309, pruned_loss=0.119, over 21747.00 frames. ], tot_loss[loss=0.296, simple_loss=0.3413, pruned_loss=0.1254, over 4234115.51 frames. ], batch size: 371, lr: 2.60e-02, grad_scale: 32.0 2023-06-18 08:03:13,553 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=127200.0, ans=0.1 2023-06-18 08:04:01,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=127260.0, ans=0.0 2023-06-18 08:04:27,448 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=127320.0, ans=0.0 2023-06-18 08:04:35,296 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.202e+02 3.266e+02 3.911e+02 4.430e+02 6.717e+02, threshold=7.823e+02, percent-clipped=0.0 2023-06-18 08:05:33,462 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=127440.0, ans=0.125 2023-06-18 08:05:51,916 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.21 vs. limit=15.0 2023-06-18 08:05:57,566 INFO [train.py:996] (2/4) Epoch 1, batch 21250, loss[loss=0.2865, simple_loss=0.3374, pruned_loss=0.1178, over 21383.00 frames. ], tot_loss[loss=0.2964, simple_loss=0.3408, pruned_loss=0.126, over 4237859.78 frames. ], batch size: 160, lr: 2.60e-02, grad_scale: 32.0 2023-06-18 08:06:33,314 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.79 vs. limit=22.5 2023-06-18 08:08:12,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=127740.0, ans=0.125 2023-06-18 08:08:27,254 INFO [train.py:996] (2/4) Epoch 1, batch 21300, loss[loss=0.3668, simple_loss=0.4054, pruned_loss=0.1641, over 21908.00 frames. ], tot_loss[loss=0.3026, simple_loss=0.3469, pruned_loss=0.1292, over 4252627.83 frames. ], batch size: 415, lr: 2.59e-02, grad_scale: 32.0 2023-06-18 08:08:32,120 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=127800.0, ans=0.125 2023-06-18 08:08:33,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=127800.0, ans=0.0 2023-06-18 08:08:56,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=127800.0, ans=0.125 2023-06-18 08:09:01,800 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.83 vs. limit=22.5 2023-06-18 08:09:54,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=127980.0, ans=0.125 2023-06-18 08:09:55,003 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.404e+02 3.217e+02 3.809e+02 4.662e+02 7.490e+02, threshold=7.618e+02, percent-clipped=0.0 2023-06-18 08:10:27,346 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.71 vs. limit=15.0 2023-06-18 08:10:28,620 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.98 vs. limit=12.0 2023-06-18 08:10:47,964 INFO [train.py:996] (2/4) Epoch 1, batch 21350, loss[loss=0.2635, simple_loss=0.338, pruned_loss=0.09451, over 21661.00 frames. ], tot_loss[loss=0.3063, simple_loss=0.3518, pruned_loss=0.1304, over 4262337.26 frames. ], batch size: 247, lr: 2.59e-02, grad_scale: 32.0 2023-06-18 08:11:33,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=128160.0, ans=0.0 2023-06-18 08:11:42,069 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.71 vs. limit=10.0 2023-06-18 08:12:06,760 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=128220.0, ans=0.0 2023-06-18 08:12:50,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=128280.0, ans=0.2 2023-06-18 08:13:16,082 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=128340.0, ans=0.125 2023-06-18 08:13:32,904 INFO [train.py:996] (2/4) Epoch 1, batch 21400, loss[loss=0.3437, simple_loss=0.3921, pruned_loss=0.1476, over 21754.00 frames. ], tot_loss[loss=0.3092, simple_loss=0.357, pruned_loss=0.1307, over 4260442.66 frames. ], batch size: 332, lr: 2.59e-02, grad_scale: 32.0 2023-06-18 08:13:42,751 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=128400.0, ans=0.0 2023-06-18 08:15:26,239 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.954e+02 2.956e+02 3.569e+02 4.352e+02 6.315e+02, threshold=7.139e+02, percent-clipped=0.0 2023-06-18 08:15:49,637 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=128640.0, ans=0.125 2023-06-18 08:16:33,233 INFO [train.py:996] (2/4) Epoch 1, batch 21450, loss[loss=0.4055, simple_loss=0.412, pruned_loss=0.1995, over 21792.00 frames. ], tot_loss[loss=0.3165, simple_loss=0.3636, pruned_loss=0.1347, over 4267729.50 frames. ], batch size: 507, lr: 2.59e-02, grad_scale: 32.0 2023-06-18 08:16:44,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=128700.0, ans=0.1 2023-06-18 08:16:46,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=128700.0, ans=0.2 2023-06-18 08:16:53,368 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=128760.0, ans=0.125 2023-06-18 08:17:03,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=128760.0, ans=0.2 2023-06-18 08:17:40,069 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.25 vs. limit=15.0 2023-06-18 08:17:56,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=128880.0, ans=0.0 2023-06-18 08:18:59,336 INFO [train.py:996] (2/4) Epoch 1, batch 21500, loss[loss=0.2788, simple_loss=0.3173, pruned_loss=0.1201, over 21576.00 frames. ], tot_loss[loss=0.3172, simple_loss=0.362, pruned_loss=0.1362, over 4264658.93 frames. ], batch size: 247, lr: 2.58e-02, grad_scale: 32.0 2023-06-18 08:18:59,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=129000.0, ans=0.125 2023-06-18 08:19:29,296 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=129060.0, ans=0.0 2023-06-18 08:19:39,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=129060.0, ans=15.0 2023-06-18 08:20:30,704 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.387e+02 3.519e+02 4.524e+02 5.451e+02 1.077e+03, threshold=9.047e+02, percent-clipped=9.0 2023-06-18 08:20:43,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=129180.0, ans=0.0 2023-06-18 08:20:49,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=129240.0, ans=0.125 2023-06-18 08:21:28,501 INFO [train.py:996] (2/4) Epoch 1, batch 21550, loss[loss=0.2199, simple_loss=0.276, pruned_loss=0.08183, over 21215.00 frames. ], tot_loss[loss=0.3063, simple_loss=0.3513, pruned_loss=0.1307, over 4260882.16 frames. ], batch size: 176, lr: 2.58e-02, grad_scale: 32.0 2023-06-18 08:21:30,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=129300.0, ans=10.0 2023-06-18 08:21:48,122 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.15 vs. limit=22.5 2023-06-18 08:22:15,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=129360.0, ans=0.0 2023-06-18 08:23:15,674 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.727e-02 2023-06-18 08:23:17,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=129480.0, ans=0.04949747468305833 2023-06-18 08:23:34,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=129540.0, ans=0.1 2023-06-18 08:24:15,445 INFO [train.py:996] (2/4) Epoch 1, batch 21600, loss[loss=0.2699, simple_loss=0.3392, pruned_loss=0.1003, over 21580.00 frames. ], tot_loss[loss=0.3005, simple_loss=0.3458, pruned_loss=0.1277, over 4257327.39 frames. ], batch size: 230, lr: 2.58e-02, grad_scale: 32.0 2023-06-18 08:24:40,157 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.47 vs. limit=15.0 2023-06-18 08:24:46,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=129660.0, ans=0.125 2023-06-18 08:25:43,523 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.188e+02 3.179e+02 3.851e+02 4.633e+02 7.831e+02, threshold=7.701e+02, percent-clipped=0.0 2023-06-18 08:26:40,565 INFO [train.py:996] (2/4) Epoch 1, batch 21650, loss[loss=0.2789, simple_loss=0.3437, pruned_loss=0.1071, over 21786.00 frames. ], tot_loss[loss=0.3023, simple_loss=0.3527, pruned_loss=0.126, over 4256124.21 frames. ], batch size: 112, lr: 2.57e-02, grad_scale: 32.0 2023-06-18 08:26:54,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=129960.0, ans=0.0 2023-06-18 08:26:58,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=129960.0, ans=0.1 2023-06-18 08:27:05,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=129960.0, ans=0.125 2023-06-18 08:27:30,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=130020.0, ans=0.125 2023-06-18 08:28:16,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=130080.0, ans=0.125 2023-06-18 08:28:26,218 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.10 vs. limit=15.0 2023-06-18 08:28:49,754 INFO [train.py:996] (2/4) Epoch 1, batch 21700, loss[loss=0.2837, simple_loss=0.3334, pruned_loss=0.117, over 21645.00 frames. ], tot_loss[loss=0.2992, simple_loss=0.3527, pruned_loss=0.1228, over 4258028.00 frames. ], batch size: 298, lr: 2.57e-02, grad_scale: 32.0 2023-06-18 08:29:07,251 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 08:29:40,580 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.65 vs. limit=15.0 2023-06-18 08:30:12,482 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.003e+02 2.876e+02 3.492e+02 4.292e+02 6.287e+02, threshold=6.984e+02, percent-clipped=0.0 2023-06-18 08:31:03,499 INFO [train.py:996] (2/4) Epoch 1, batch 21750, loss[loss=0.2991, simple_loss=0.3382, pruned_loss=0.1299, over 21968.00 frames. ], tot_loss[loss=0.2952, simple_loss=0.3478, pruned_loss=0.1213, over 4252331.19 frames. ], batch size: 119, lr: 2.57e-02, grad_scale: 32.0 2023-06-18 08:31:05,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=130500.0, ans=0.125 2023-06-18 08:32:52,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=130680.0, ans=0.125 2023-06-18 08:32:57,372 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=130680.0, ans=0.05 2023-06-18 08:33:24,960 INFO [train.py:996] (2/4) Epoch 1, batch 21800, loss[loss=0.2716, simple_loss=0.2987, pruned_loss=0.1223, over 20653.00 frames. ], tot_loss[loss=0.2957, simple_loss=0.3453, pruned_loss=0.123, over 4253601.43 frames. ], batch size: 608, lr: 2.57e-02, grad_scale: 32.0 2023-06-18 08:34:21,751 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.24 vs. limit=15.0 2023-06-18 08:34:23,070 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.21 vs. limit=22.5 2023-06-18 08:34:26,708 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.64 vs. limit=12.0 2023-06-18 08:35:17,747 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.184e+02 3.102e+02 3.611e+02 4.260e+02 6.998e+02, threshold=7.223e+02, percent-clipped=1.0 2023-06-18 08:35:24,926 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.94 vs. limit=15.0 2023-06-18 08:35:28,120 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.63 vs. limit=22.5 2023-06-18 08:36:06,439 INFO [train.py:996] (2/4) Epoch 1, batch 21850, loss[loss=0.3177, simple_loss=0.365, pruned_loss=0.1352, over 20746.00 frames. ], tot_loss[loss=0.2994, simple_loss=0.3503, pruned_loss=0.1243, over 4261689.10 frames. ], batch size: 609, lr: 2.56e-02, grad_scale: 32.0 2023-06-18 08:36:19,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=131100.0, ans=0.125 2023-06-18 08:38:50,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=131340.0, ans=0.125 2023-06-18 08:38:52,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=131340.0, ans=0.0 2023-06-18 08:38:54,595 INFO [train.py:996] (2/4) Epoch 1, batch 21900, loss[loss=0.3065, simple_loss=0.3403, pruned_loss=0.1363, over 21630.00 frames. ], tot_loss[loss=0.3026, simple_loss=0.3526, pruned_loss=0.1263, over 4266648.40 frames. ], batch size: 392, lr: 2.56e-02, grad_scale: 32.0 2023-06-18 08:40:08,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=131520.0, ans=0.125 2023-06-18 08:40:14,106 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.326e+02 3.557e+02 4.079e+02 5.086e+02 8.901e+02, threshold=8.158e+02, percent-clipped=3.0 2023-06-18 08:40:14,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=131580.0, ans=0.125 2023-06-18 08:40:21,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=131580.0, ans=0.125 2023-06-18 08:40:43,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=131640.0, ans=0.125 2023-06-18 08:40:45,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=131640.0, ans=0.125 2023-06-18 08:40:58,975 INFO [train.py:996] (2/4) Epoch 1, batch 21950, loss[loss=0.2355, simple_loss=0.2887, pruned_loss=0.09119, over 21757.00 frames. ], tot_loss[loss=0.2972, simple_loss=0.3461, pruned_loss=0.1241, over 4275342.76 frames. ], batch size: 118, lr: 2.56e-02, grad_scale: 32.0 2023-06-18 08:41:02,740 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.30 vs. limit=15.0 2023-06-18 08:41:15,144 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=131700.0, ans=0.1 2023-06-18 08:43:03,124 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.09 vs. limit=12.0 2023-06-18 08:43:18,608 INFO [train.py:996] (2/4) Epoch 1, batch 22000, loss[loss=0.2137, simple_loss=0.2814, pruned_loss=0.07297, over 21736.00 frames. ], tot_loss[loss=0.2924, simple_loss=0.3414, pruned_loss=0.1217, over 4272885.75 frames. ], batch size: 282, lr: 2.56e-02, grad_scale: 32.0 2023-06-18 08:43:27,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=132000.0, ans=0.0 2023-06-18 08:44:22,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=132060.0, ans=0.0 2023-06-18 08:45:03,451 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.854e+02 3.001e+02 3.656e+02 5.158e+02 1.119e+03, threshold=7.313e+02, percent-clipped=4.0 2023-06-18 08:45:31,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=132180.0, ans=0.125 2023-06-18 08:46:09,167 INFO [train.py:996] (2/4) Epoch 1, batch 22050, loss[loss=0.2683, simple_loss=0.3275, pruned_loss=0.1046, over 21369.00 frames. ], tot_loss[loss=0.2971, simple_loss=0.3471, pruned_loss=0.1236, over 4272144.54 frames. ], batch size: 194, lr: 2.55e-02, grad_scale: 32.0 2023-06-18 08:46:35,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=132300.0, ans=0.025 2023-06-18 08:46:36,226 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.45 vs. limit=15.0 2023-06-18 08:47:32,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=132420.0, ans=0.125 2023-06-18 08:47:34,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=132420.0, ans=0.2 2023-06-18 08:47:49,278 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=132480.0, ans=0.0 2023-06-18 08:48:42,420 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=132540.0, ans=0.1 2023-06-18 08:48:48,912 INFO [train.py:996] (2/4) Epoch 1, batch 22100, loss[loss=0.3406, simple_loss=0.3871, pruned_loss=0.147, over 21252.00 frames. ], tot_loss[loss=0.3123, simple_loss=0.3609, pruned_loss=0.1319, over 4271108.67 frames. ], batch size: 176, lr: 2.55e-02, grad_scale: 32.0 2023-06-18 08:49:49,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=132660.0, ans=0.0 2023-06-18 08:50:17,071 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.511e+02 3.342e+02 4.040e+02 5.215e+02 7.946e+02, threshold=8.079e+02, percent-clipped=2.0 2023-06-18 08:50:27,729 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=132780.0, ans=0.04949747468305833 2023-06-18 08:50:35,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=132780.0, ans=0.2 2023-06-18 08:51:25,508 INFO [train.py:996] (2/4) Epoch 1, batch 22150, loss[loss=0.3468, simple_loss=0.3968, pruned_loss=0.1484, over 20653.00 frames. ], tot_loss[loss=0.3168, simple_loss=0.3656, pruned_loss=0.134, over 4266698.05 frames. ], batch size: 607, lr: 2.55e-02, grad_scale: 32.0 2023-06-18 08:51:31,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=132900.0, ans=0.125 2023-06-18 08:51:41,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=132900.0, ans=0.0 2023-06-18 08:52:35,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=133020.0, ans=0.0 2023-06-18 08:53:27,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=133080.0, ans=0.125 2023-06-18 08:53:44,054 INFO [train.py:996] (2/4) Epoch 1, batch 22200, loss[loss=0.3596, simple_loss=0.4003, pruned_loss=0.1595, over 21780.00 frames. ], tot_loss[loss=0.3196, simple_loss=0.368, pruned_loss=0.1356, over 4275926.51 frames. ], batch size: 112, lr: 2.55e-02, grad_scale: 32.0 2023-06-18 08:54:26,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=133260.0, ans=0.0 2023-06-18 08:54:58,293 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=133260.0, ans=0.0 2023-06-18 08:55:43,640 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.153e+02 3.063e+02 3.788e+02 4.693e+02 1.107e+03, threshold=7.575e+02, percent-clipped=1.0 2023-06-18 08:56:47,425 INFO [train.py:996] (2/4) Epoch 1, batch 22250, loss[loss=0.3352, simple_loss=0.3819, pruned_loss=0.1443, over 21606.00 frames. ], tot_loss[loss=0.3251, simple_loss=0.3749, pruned_loss=0.1377, over 4280284.04 frames. ], batch size: 263, lr: 2.54e-02, grad_scale: 32.0 2023-06-18 08:56:52,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=133500.0, ans=0.125 2023-06-18 08:57:35,335 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=133620.0, ans=0.125 2023-06-18 08:57:37,407 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.65 vs. limit=15.0 2023-06-18 08:58:28,330 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.25 vs. limit=15.0 2023-06-18 08:59:08,868 INFO [train.py:996] (2/4) Epoch 1, batch 22300, loss[loss=0.3315, simple_loss=0.3744, pruned_loss=0.1443, over 21300.00 frames. ], tot_loss[loss=0.3268, simple_loss=0.3755, pruned_loss=0.1391, over 4281010.80 frames. ], batch size: 143, lr: 2.54e-02, grad_scale: 32.0 2023-06-18 09:00:27,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=133920.0, ans=0.0 2023-06-18 09:00:29,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=133920.0, ans=0.0 2023-06-18 09:00:53,011 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.587e+02 3.379e+02 4.077e+02 4.905e+02 1.025e+03, threshold=8.153e+02, percent-clipped=2.0 2023-06-18 09:01:28,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=134040.0, ans=0.0 2023-06-18 09:01:32,671 INFO [train.py:996] (2/4) Epoch 1, batch 22350, loss[loss=0.2815, simple_loss=0.3354, pruned_loss=0.1138, over 21495.00 frames. ], tot_loss[loss=0.3269, simple_loss=0.3741, pruned_loss=0.1398, over 4285355.83 frames. ], batch size: 194, lr: 2.54e-02, grad_scale: 32.0 2023-06-18 09:02:14,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.whiten.whitening_limit, batch_count=134160.0, ans=12.0 2023-06-18 09:02:49,776 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=134220.0, ans=0.125 2023-06-18 09:03:02,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=134220.0, ans=0.2 2023-06-18 09:03:02,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=134220.0, ans=0.125 2023-06-18 09:03:21,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=134280.0, ans=0.2 2023-06-18 09:04:12,522 INFO [train.py:996] (2/4) Epoch 1, batch 22400, loss[loss=0.3337, simple_loss=0.3706, pruned_loss=0.1484, over 21457.00 frames. ], tot_loss[loss=0.3218, simple_loss=0.371, pruned_loss=0.1363, over 4282004.74 frames. ], batch size: 389, lr: 2.54e-02, grad_scale: 32.0 2023-06-18 09:04:19,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=134400.0, ans=0.2 2023-06-18 09:04:20,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=134400.0, ans=0.1 2023-06-18 09:04:24,059 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.12 vs. limit=15.0 2023-06-18 09:04:44,656 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.17 vs. limit=12.0 2023-06-18 09:04:53,899 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-06-18 09:04:59,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=134460.0, ans=0.125 2023-06-18 09:04:59,671 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.46 vs. limit=6.0 2023-06-18 09:05:30,637 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.16 vs. limit=15.0 2023-06-18 09:05:44,195 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.065e+02 3.212e+02 4.054e+02 4.635e+02 7.635e+02, threshold=8.108e+02, percent-clipped=0.0 2023-06-18 09:06:24,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=134640.0, ans=0.125 2023-06-18 09:06:46,098 INFO [train.py:996] (2/4) Epoch 1, batch 22450, loss[loss=0.2929, simple_loss=0.3372, pruned_loss=0.1243, over 21810.00 frames. ], tot_loss[loss=0.3166, simple_loss=0.3636, pruned_loss=0.1348, over 4278913.34 frames. ], batch size: 372, lr: 2.53e-02, grad_scale: 32.0 2023-06-18 09:08:00,913 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.27 vs. limit=5.0 2023-06-18 09:08:29,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=134880.0, ans=0.0 2023-06-18 09:08:52,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=134880.0, ans=0.1 2023-06-18 09:09:37,350 INFO [train.py:996] (2/4) Epoch 1, batch 22500, loss[loss=0.4022, simple_loss=0.4186, pruned_loss=0.1929, over 21357.00 frames. ], tot_loss[loss=0.3123, simple_loss=0.358, pruned_loss=0.1334, over 4279987.19 frames. ], batch size: 507, lr: 2.53e-02, grad_scale: 32.0 2023-06-18 09:10:58,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=135120.0, ans=0.1 2023-06-18 09:11:15,145 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.196e+02 3.294e+02 4.262e+02 5.582e+02 1.060e+03, threshold=8.524e+02, percent-clipped=5.0 2023-06-18 09:12:15,199 INFO [train.py:996] (2/4) Epoch 1, batch 22550, loss[loss=0.323, simple_loss=0.3621, pruned_loss=0.142, over 21556.00 frames. ], tot_loss[loss=0.3129, simple_loss=0.3605, pruned_loss=0.1326, over 4276815.15 frames. ], batch size: 548, lr: 2.53e-02, grad_scale: 64.0 2023-06-18 09:12:46,818 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.85 vs. limit=15.0 2023-06-18 09:12:49,178 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=135300.0, ans=10.0 2023-06-18 09:13:01,792 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=135360.0, ans=0.125 2023-06-18 09:13:27,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=135420.0, ans=0.125 2023-06-18 09:14:46,432 INFO [train.py:996] (2/4) Epoch 1, batch 22600, loss[loss=0.3055, simple_loss=0.3572, pruned_loss=0.1269, over 21857.00 frames. ], tot_loss[loss=0.3157, simple_loss=0.3645, pruned_loss=0.1335, over 4281793.44 frames. ], batch size: 298, lr: 2.53e-02, grad_scale: 32.0 2023-06-18 09:15:40,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=135660.0, ans=0.125 2023-06-18 09:16:04,382 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.177e+02 3.237e+02 4.063e+02 5.145e+02 1.049e+03, threshold=8.126e+02, percent-clipped=2.0 2023-06-18 09:16:04,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=135780.0, ans=0.0 2023-06-18 09:17:09,574 INFO [train.py:996] (2/4) Epoch 1, batch 22650, loss[loss=0.2658, simple_loss=0.3248, pruned_loss=0.1034, over 19885.00 frames. ], tot_loss[loss=0.313, simple_loss=0.3602, pruned_loss=0.1329, over 4273711.56 frames. ], batch size: 703, lr: 2.52e-02, grad_scale: 32.0 2023-06-18 09:17:14,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=135900.0, ans=0.2 2023-06-18 09:17:47,491 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=135960.0, ans=0.125 2023-06-18 09:18:08,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=136020.0, ans=0.0 2023-06-18 09:18:55,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=136080.0, ans=0.1 2023-06-18 09:19:32,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=136200.0, ans=0.0 2023-06-18 09:19:33,374 INFO [train.py:996] (2/4) Epoch 1, batch 22700, loss[loss=0.3456, simple_loss=0.3589, pruned_loss=0.1662, over 21215.00 frames. ], tot_loss[loss=0.3084, simple_loss=0.3545, pruned_loss=0.1311, over 4255168.06 frames. ], batch size: 471, lr: 2.52e-02, grad_scale: 32.0 2023-06-18 09:21:14,816 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.549e+02 3.452e+02 4.145e+02 4.705e+02 7.533e+02, threshold=8.290e+02, percent-clipped=0.0 2023-06-18 09:21:35,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=136440.0, ans=0.0 2023-06-18 09:21:36,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=136440.0, ans=0.125 2023-06-18 09:21:56,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=136440.0, ans=0.1 2023-06-18 09:22:06,623 INFO [train.py:996] (2/4) Epoch 1, batch 22750, loss[loss=0.3395, simple_loss=0.3855, pruned_loss=0.1467, over 21849.00 frames. ], tot_loss[loss=0.3123, simple_loss=0.3564, pruned_loss=0.1341, over 4261499.32 frames. ], batch size: 247, lr: 2.52e-02, grad_scale: 32.0 2023-06-18 09:23:38,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=136620.0, ans=0.1 2023-06-18 09:23:46,969 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=136680.0, ans=0.05 2023-06-18 09:23:53,177 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=136680.0, ans=0.1 2023-06-18 09:23:55,151 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.76 vs. limit=12.0 2023-06-18 09:24:12,310 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=136680.0, ans=0.125 2023-06-18 09:24:26,087 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=136740.0, ans=0.125 2023-06-18 09:24:31,992 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=136740.0, ans=0.1 2023-06-18 09:24:40,989 INFO [train.py:996] (2/4) Epoch 1, batch 22800, loss[loss=0.311, simple_loss=0.3557, pruned_loss=0.1332, over 21754.00 frames. ], tot_loss[loss=0.3178, simple_loss=0.3612, pruned_loss=0.1372, over 4273090.89 frames. ], batch size: 298, lr: 2.52e-02, grad_scale: 32.0 2023-06-18 09:26:06,750 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.321e+02 3.270e+02 3.833e+02 4.673e+02 9.508e+02, threshold=7.666e+02, percent-clipped=3.0 2023-06-18 09:27:12,212 INFO [train.py:996] (2/4) Epoch 1, batch 22850, loss[loss=0.2875, simple_loss=0.3342, pruned_loss=0.1204, over 21858.00 frames. ], tot_loss[loss=0.3141, simple_loss=0.3566, pruned_loss=0.1358, over 4264835.02 frames. ], batch size: 118, lr: 2.51e-02, grad_scale: 32.0 2023-06-18 09:28:34,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=137220.0, ans=0.125 2023-06-18 09:28:37,554 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-06-18 09:29:39,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=137340.0, ans=0.2 2023-06-18 09:29:46,659 INFO [train.py:996] (2/4) Epoch 1, batch 22900, loss[loss=0.2988, simple_loss=0.3936, pruned_loss=0.102, over 21781.00 frames. ], tot_loss[loss=0.312, simple_loss=0.3564, pruned_loss=0.1338, over 4271743.62 frames. ], batch size: 332, lr: 2.51e-02, grad_scale: 32.0 2023-06-18 09:30:02,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=137400.0, ans=0.09899494936611666 2023-06-18 09:30:05,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=137460.0, ans=0.1 2023-06-18 09:31:06,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=137520.0, ans=0.1 2023-06-18 09:31:37,292 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.375e+02 3.304e+02 3.880e+02 4.906e+02 7.646e+02, threshold=7.759e+02, percent-clipped=0.0 2023-06-18 09:32:26,082 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=137700.0, ans=0.125 2023-06-18 09:32:27,088 INFO [train.py:996] (2/4) Epoch 1, batch 22950, loss[loss=0.3032, simple_loss=0.4036, pruned_loss=0.1014, over 21758.00 frames. ], tot_loss[loss=0.3172, simple_loss=0.3709, pruned_loss=0.1317, over 4279128.07 frames. ], batch size: 332, lr: 2.51e-02, grad_scale: 32.0 2023-06-18 09:33:34,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=137760.0, ans=0.2 2023-06-18 09:34:03,053 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.66 vs. limit=15.0 2023-06-18 09:35:12,973 INFO [train.py:996] (2/4) Epoch 1, batch 23000, loss[loss=0.3, simple_loss=0.3442, pruned_loss=0.1279, over 21801.00 frames. ], tot_loss[loss=0.3149, simple_loss=0.3726, pruned_loss=0.1286, over 4280513.25 frames. ], batch size: 247, lr: 2.51e-02, grad_scale: 32.0 2023-06-18 09:35:14,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=138000.0, ans=0.125 2023-06-18 09:36:32,804 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.92 vs. limit=15.0 2023-06-18 09:36:49,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=138180.0, ans=0.125 2023-06-18 09:36:53,637 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.982e+02 2.985e+02 3.505e+02 4.304e+02 7.318e+02, threshold=7.010e+02, percent-clipped=0.0 2023-06-18 09:37:09,006 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=138180.0, ans=0.125 2023-06-18 09:38:12,249 INFO [train.py:996] (2/4) Epoch 1, batch 23050, loss[loss=0.3385, simple_loss=0.3882, pruned_loss=0.1444, over 21388.00 frames. ], tot_loss[loss=0.3192, simple_loss=0.3745, pruned_loss=0.132, over 4285229.57 frames. ], batch size: 159, lr: 2.50e-02, grad_scale: 32.0 2023-06-18 09:38:23,917 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=138300.0, ans=0.125 2023-06-18 09:39:06,297 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=12.0 2023-06-18 09:40:18,266 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=138540.0, ans=0.2 2023-06-18 09:40:27,146 INFO [train.py:996] (2/4) Epoch 1, batch 23100, loss[loss=0.2963, simple_loss=0.3304, pruned_loss=0.1311, over 21585.00 frames. ], tot_loss[loss=0.3201, simple_loss=0.3711, pruned_loss=0.1345, over 4273570.01 frames. ], batch size: 415, lr: 2.50e-02, grad_scale: 32.0 2023-06-18 09:41:19,953 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=138660.0, ans=0.2 2023-06-18 09:41:52,129 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=138720.0, ans=0.125 2023-06-18 09:42:11,465 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.404e+02 3.111e+02 3.602e+02 4.415e+02 8.420e+02, threshold=7.204e+02, percent-clipped=6.0 2023-06-18 09:42:12,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=138780.0, ans=0.125 2023-06-18 09:42:24,566 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=138780.0, ans=0.125 2023-06-18 09:43:09,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=138840.0, ans=0.125 2023-06-18 09:43:11,444 INFO [train.py:996] (2/4) Epoch 1, batch 23150, loss[loss=0.3346, simple_loss=0.3876, pruned_loss=0.1409, over 21877.00 frames. ], tot_loss[loss=0.3145, simple_loss=0.3635, pruned_loss=0.1327, over 4280672.60 frames. ], batch size: 107, lr: 2.50e-02, grad_scale: 32.0 2023-06-18 09:43:14,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=138900.0, ans=0.0 2023-06-18 09:44:35,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=139080.0, ans=0.125 2023-06-18 09:45:03,652 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=139140.0, ans=0.125 2023-06-18 09:45:38,670 INFO [train.py:996] (2/4) Epoch 1, batch 23200, loss[loss=0.3518, simple_loss=0.3827, pruned_loss=0.1605, over 21791.00 frames. ], tot_loss[loss=0.3154, simple_loss=0.3624, pruned_loss=0.1342, over 4286629.66 frames. ], batch size: 441, lr: 2.50e-02, grad_scale: 32.0 2023-06-18 09:45:45,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=139200.0, ans=0.125 2023-06-18 09:46:09,285 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=139200.0, ans=22.5 2023-06-18 09:46:56,103 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=139320.0, ans=0.125 2023-06-18 09:47:28,184 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.287e+02 3.236e+02 3.747e+02 4.592e+02 8.495e+02, threshold=7.495e+02, percent-clipped=1.0 2023-06-18 09:47:30,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=139380.0, ans=0.0 2023-06-18 09:48:05,952 INFO [train.py:996] (2/4) Epoch 1, batch 23250, loss[loss=0.3933, simple_loss=0.4103, pruned_loss=0.1881, over 21665.00 frames. ], tot_loss[loss=0.3166, simple_loss=0.3623, pruned_loss=0.1354, over 4298230.13 frames. ], batch size: 507, lr: 2.49e-02, grad_scale: 32.0 2023-06-18 09:48:20,616 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.60 vs. limit=6.0 2023-06-18 09:49:08,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=139620.0, ans=0.125 2023-06-18 09:50:02,006 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.whiten.whitening_limit, batch_count=139680.0, ans=12.0 2023-06-18 09:50:39,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=139800.0, ans=0.125 2023-06-18 09:50:40,070 INFO [train.py:996] (2/4) Epoch 1, batch 23300, loss[loss=0.2993, simple_loss=0.3745, pruned_loss=0.1121, over 21183.00 frames. ], tot_loss[loss=0.3232, simple_loss=0.371, pruned_loss=0.1377, over 4298658.11 frames. ], batch size: 159, lr: 2.49e-02, grad_scale: 32.0 2023-06-18 09:52:00,886 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=139860.0, ans=0.0 2023-06-18 09:52:33,707 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.330e+02 3.238e+02 4.090e+02 5.618e+02 9.282e+02, threshold=8.181e+02, percent-clipped=7.0 2023-06-18 09:52:48,433 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.92 vs. limit=15.0 2023-06-18 09:53:22,498 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=140040.0, ans=0.0 2023-06-18 09:53:25,602 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=140040.0, ans=0.1 2023-06-18 09:53:27,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=140100.0, ans=0.1 2023-06-18 09:53:27,947 INFO [train.py:996] (2/4) Epoch 1, batch 23350, loss[loss=0.2669, simple_loss=0.3333, pruned_loss=0.1003, over 21740.00 frames. ], tot_loss[loss=0.3234, simple_loss=0.3742, pruned_loss=0.1363, over 4280015.17 frames. ], batch size: 371, lr: 2.49e-02, grad_scale: 32.0 2023-06-18 09:53:56,263 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.24 vs. limit=10.0 2023-06-18 09:54:42,197 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.41 vs. limit=6.0 2023-06-18 09:55:33,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=140280.0, ans=0.0 2023-06-18 09:55:41,506 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.01 vs. limit=22.5 2023-06-18 09:55:53,815 INFO [train.py:996] (2/4) Epoch 1, batch 23400, loss[loss=0.3207, simple_loss=0.3597, pruned_loss=0.1409, over 21259.00 frames. ], tot_loss[loss=0.31, simple_loss=0.3629, pruned_loss=0.1285, over 4277511.52 frames. ], batch size: 608, lr: 2.49e-02, grad_scale: 32.0 2023-06-18 09:56:04,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=140400.0, ans=0.125 2023-06-18 09:56:52,890 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.06 vs. limit=6.0 2023-06-18 09:57:05,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=140460.0, ans=0.1 2023-06-18 09:57:08,433 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=140520.0, ans=0.125 2023-06-18 09:57:31,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=140520.0, ans=0.0 2023-06-18 09:57:45,607 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.104e+02 2.809e+02 3.311e+02 4.029e+02 8.339e+02, threshold=6.622e+02, percent-clipped=1.0 2023-06-18 09:58:07,889 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=140580.0, ans=0.125 2023-06-18 09:58:26,170 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=140640.0, ans=0.125 2023-06-18 09:58:44,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=140640.0, ans=0.125 2023-06-18 09:58:48,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=140700.0, ans=0.125 2023-06-18 09:58:52,878 INFO [train.py:996] (2/4) Epoch 1, batch 23450, loss[loss=0.3598, simple_loss=0.4117, pruned_loss=0.1539, over 21845.00 frames. ], tot_loss[loss=0.3144, simple_loss=0.3646, pruned_loss=0.132, over 4273976.29 frames. ], batch size: 118, lr: 2.48e-02, grad_scale: 32.0 2023-06-18 09:59:08,566 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=140700.0, ans=0.1 2023-06-18 09:59:37,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=140760.0, ans=0.2 2023-06-18 10:00:00,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=140820.0, ans=0.0 2023-06-18 10:00:06,424 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=140820.0, ans=0.125 2023-06-18 10:00:50,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=140940.0, ans=0.125 2023-06-18 10:01:13,348 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=140940.0, ans=0.0 2023-06-18 10:01:28,659 INFO [train.py:996] (2/4) Epoch 1, batch 23500, loss[loss=0.3307, simple_loss=0.3687, pruned_loss=0.1464, over 21668.00 frames. ], tot_loss[loss=0.3186, simple_loss=0.3668, pruned_loss=0.1352, over 4285572.18 frames. ], batch size: 263, lr: 2.48e-02, grad_scale: 32.0 2023-06-18 10:01:55,429 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=141060.0, ans=0.125 2023-06-18 10:02:01,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=141060.0, ans=0.0 2023-06-18 10:02:17,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=141060.0, ans=0.125 2023-06-18 10:02:34,730 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 10:03:03,291 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.344e+02 3.176e+02 3.933e+02 4.936e+02 8.356e+02, threshold=7.866e+02, percent-clipped=7.0 2023-06-18 10:03:30,989 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=141240.0, ans=0.025 2023-06-18 10:03:32,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=141240.0, ans=0.2 2023-06-18 10:03:34,755 INFO [train.py:996] (2/4) Epoch 1, batch 23550, loss[loss=0.2912, simple_loss=0.3365, pruned_loss=0.123, over 21745.00 frames. ], tot_loss[loss=0.3144, simple_loss=0.3606, pruned_loss=0.1341, over 4287921.93 frames. ], batch size: 112, lr: 2.48e-02, grad_scale: 32.0 2023-06-18 10:04:07,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=141300.0, ans=0.05 2023-06-18 10:06:22,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=141600.0, ans=0.07 2023-06-18 10:06:23,366 INFO [train.py:996] (2/4) Epoch 1, batch 23600, loss[loss=0.3583, simple_loss=0.3985, pruned_loss=0.1591, over 21710.00 frames. ], tot_loss[loss=0.3146, simple_loss=0.3615, pruned_loss=0.1339, over 4273068.57 frames. ], batch size: 351, lr: 2.48e-02, grad_scale: 32.0 2023-06-18 10:07:52,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=141720.0, ans=0.125 2023-06-18 10:08:05,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=141780.0, ans=0.0 2023-06-18 10:08:06,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=141780.0, ans=0.0 2023-06-18 10:08:07,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=141780.0, ans=0.125 2023-06-18 10:08:08,866 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.088e+02 3.095e+02 3.693e+02 4.407e+02 6.966e+02, threshold=7.385e+02, percent-clipped=0.0 2023-06-18 10:08:09,936 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.10 vs. limit=15.0 2023-06-18 10:08:54,494 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=22.50 vs. limit=22.5 2023-06-18 10:09:12,735 INFO [train.py:996] (2/4) Epoch 1, batch 23650, loss[loss=0.1812, simple_loss=0.2399, pruned_loss=0.06127, over 17042.00 frames. ], tot_loss[loss=0.3118, simple_loss=0.3614, pruned_loss=0.1311, over 4274979.74 frames. ], batch size: 61, lr: 2.47e-02, grad_scale: 32.0 2023-06-18 10:11:35,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=142140.0, ans=0.2 2023-06-18 10:11:51,767 INFO [train.py:996] (2/4) Epoch 1, batch 23700, loss[loss=0.3219, simple_loss=0.3711, pruned_loss=0.1364, over 21295.00 frames. ], tot_loss[loss=0.3112, simple_loss=0.3633, pruned_loss=0.1295, over 4274085.22 frames. ], batch size: 143, lr: 2.47e-02, grad_scale: 32.0 2023-06-18 10:12:44,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=142260.0, ans=0.025 2023-06-18 10:13:02,764 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.84 vs. limit=15.0 2023-06-18 10:13:03,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=142320.0, ans=0.125 2023-06-18 10:13:44,068 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.166e+02 3.277e+02 3.827e+02 4.804e+02 8.451e+02, threshold=7.655e+02, percent-clipped=1.0 2023-06-18 10:13:44,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=142380.0, ans=0.1 2023-06-18 10:14:20,774 INFO [train.py:996] (2/4) Epoch 1, batch 23750, loss[loss=0.2587, simple_loss=0.3508, pruned_loss=0.08328, over 21891.00 frames. ], tot_loss[loss=0.3152, simple_loss=0.3669, pruned_loss=0.1317, over 4277242.63 frames. ], batch size: 372, lr: 2.47e-02, grad_scale: 32.0 2023-06-18 10:15:16,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=142560.0, ans=0.0 2023-06-18 10:15:27,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=142560.0, ans=0.2 2023-06-18 10:17:15,879 INFO [train.py:996] (2/4) Epoch 1, batch 23800, loss[loss=0.3477, simple_loss=0.4119, pruned_loss=0.1418, over 21745.00 frames. ], tot_loss[loss=0.3109, simple_loss=0.3652, pruned_loss=0.1283, over 4275084.33 frames. ], batch size: 332, lr: 2.47e-02, grad_scale: 32.0 2023-06-18 10:18:50,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=142920.0, ans=0.2 2023-06-18 10:19:06,965 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.827e+02 3.371e+02 4.073e+02 5.438e+02 8.873e+02, threshold=8.146e+02, percent-clipped=4.0 2023-06-18 10:19:07,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=142980.0, ans=0.125 2023-06-18 10:19:12,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=142980.0, ans=0.125 2023-06-18 10:19:23,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=143040.0, ans=0.04949747468305833 2023-06-18 10:20:03,707 INFO [train.py:996] (2/4) Epoch 1, batch 23850, loss[loss=0.3682, simple_loss=0.4147, pruned_loss=0.1609, over 21198.00 frames. ], tot_loss[loss=0.3205, simple_loss=0.3758, pruned_loss=0.1326, over 4275850.33 frames. ], batch size: 143, lr: 2.46e-02, grad_scale: 32.0 2023-06-18 10:20:27,140 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.13 vs. limit=22.5 2023-06-18 10:21:26,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=143220.0, ans=0.125 2023-06-18 10:22:00,701 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.24 vs. limit=12.0 2023-06-18 10:22:16,967 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.88 vs. limit=15.0 2023-06-18 10:22:28,687 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.53 vs. limit=15.0 2023-06-18 10:22:36,424 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.54 vs. limit=15.0 2023-06-18 10:22:38,204 INFO [train.py:996] (2/4) Epoch 1, batch 23900, loss[loss=0.2877, simple_loss=0.3672, pruned_loss=0.1041, over 21344.00 frames. ], tot_loss[loss=0.3282, simple_loss=0.3842, pruned_loss=0.136, over 4277808.32 frames. ], batch size: 131, lr: 2.46e-02, grad_scale: 32.0 2023-06-18 10:24:07,462 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=143520.0, ans=0.0 2023-06-18 10:24:14,311 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.331e+02 3.646e+02 4.391e+02 5.259e+02 8.608e+02, threshold=8.781e+02, percent-clipped=3.0 2023-06-18 10:24:20,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=143580.0, ans=0.125 2023-06-18 10:25:07,018 INFO [train.py:996] (2/4) Epoch 1, batch 23950, loss[loss=0.2751, simple_loss=0.3245, pruned_loss=0.1129, over 21641.00 frames. ], tot_loss[loss=0.3247, simple_loss=0.3775, pruned_loss=0.136, over 4274269.58 frames. ], batch size: 282, lr: 2.46e-02, grad_scale: 32.0 2023-06-18 10:26:09,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=143820.0, ans=0.125 2023-06-18 10:26:14,149 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=143820.0, ans=0.1 2023-06-18 10:26:18,293 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=143880.0, ans=0.0 2023-06-18 10:26:52,340 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.95 vs. limit=10.0 2023-06-18 10:27:39,009 INFO [train.py:996] (2/4) Epoch 1, batch 24000, loss[loss=0.3472, simple_loss=0.3876, pruned_loss=0.1534, over 21708.00 frames. ], tot_loss[loss=0.3299, simple_loss=0.3798, pruned_loss=0.14, over 4278336.82 frames. ], batch size: 298, lr: 2.46e-02, grad_scale: 32.0 2023-06-18 10:27:39,009 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-18 10:28:27,386 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.0.layers.0.self_attn_weights, attn_weights_entropy = tensor([5.6661, 5.8317, 5.5453, 5.2097], device='cuda:2') 2023-06-18 10:28:35,859 INFO [train.py:1028] (2/4) Epoch 1, validation: loss=0.3093, simple_loss=0.4026, pruned_loss=0.108, over 1796401.00 frames. 2023-06-18 10:28:35,860 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-18 10:29:00,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=144060.0, ans=0.1 2023-06-18 10:29:11,241 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=144120.0, ans=0.05 2023-06-18 10:29:38,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=144120.0, ans=0.2 2023-06-18 10:29:41,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=144120.0, ans=0.125 2023-06-18 10:29:54,933 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.268e+02 3.332e+02 4.254e+02 5.266e+02 8.160e+02, threshold=8.508e+02, percent-clipped=0.0 2023-06-18 10:30:07,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=144180.0, ans=0.05 2023-06-18 10:30:57,456 INFO [train.py:996] (2/4) Epoch 1, batch 24050, loss[loss=0.2445, simple_loss=0.3114, pruned_loss=0.08879, over 21422.00 frames. ], tot_loss[loss=0.3304, simple_loss=0.3805, pruned_loss=0.1402, over 4272401.91 frames. ], batch size: 176, lr: 2.46e-02, grad_scale: 32.0 2023-06-18 10:33:33,010 INFO [train.py:996] (2/4) Epoch 1, batch 24100, loss[loss=0.319, simple_loss=0.3803, pruned_loss=0.1288, over 21786.00 frames. ], tot_loss[loss=0.3261, simple_loss=0.3797, pruned_loss=0.1363, over 4271046.10 frames. ], batch size: 247, lr: 2.45e-02, grad_scale: 32.0 2023-06-18 10:34:18,207 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=144660.0, ans=0.025 2023-06-18 10:35:11,620 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.131e+02 2.942e+02 3.455e+02 4.270e+02 6.520e+02, threshold=6.911e+02, percent-clipped=0.0 2023-06-18 10:35:12,713 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.78 vs. limit=15.0 2023-06-18 10:35:41,295 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.61 vs. limit=10.0 2023-06-18 10:35:52,017 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.39 vs. limit=15.0 2023-06-18 10:36:07,970 INFO [train.py:996] (2/4) Epoch 1, batch 24150, loss[loss=0.316, simple_loss=0.3575, pruned_loss=0.1372, over 21505.00 frames. ], tot_loss[loss=0.329, simple_loss=0.3798, pruned_loss=0.1391, over 4276897.99 frames. ], batch size: 194, lr: 2.45e-02, grad_scale: 32.0 2023-06-18 10:36:08,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=144900.0, ans=0.0 2023-06-18 10:36:08,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=144900.0, ans=0.0 2023-06-18 10:36:09,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=144900.0, ans=0.125 2023-06-18 10:36:37,486 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 10:37:00,785 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.02 vs. limit=22.5 2023-06-18 10:37:23,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=145020.0, ans=0.0 2023-06-18 10:37:26,972 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.29 vs. limit=15.0 2023-06-18 10:37:45,214 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=145080.0, ans=0.125 2023-06-18 10:37:57,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=145080.0, ans=0.125 2023-06-18 10:38:51,891 INFO [train.py:996] (2/4) Epoch 1, batch 24200, loss[loss=0.3017, simple_loss=0.3652, pruned_loss=0.1191, over 21678.00 frames. ], tot_loss[loss=0.3295, simple_loss=0.3801, pruned_loss=0.1394, over 4282905.93 frames. ], batch size: 247, lr: 2.45e-02, grad_scale: 32.0 2023-06-18 10:38:59,795 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.92 vs. limit=22.5 2023-06-18 10:40:52,754 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.147e+02 3.211e+02 3.713e+02 4.358e+02 6.625e+02, threshold=7.425e+02, percent-clipped=0.0 2023-06-18 10:40:58,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=145380.0, ans=0.125 2023-06-18 10:41:08,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=145440.0, ans=0.125 2023-06-18 10:41:46,914 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.32 vs. limit=22.5 2023-06-18 10:41:47,347 INFO [train.py:996] (2/4) Epoch 1, batch 24250, loss[loss=0.2855, simple_loss=0.3715, pruned_loss=0.09978, over 21693.00 frames. ], tot_loss[loss=0.3179, simple_loss=0.3755, pruned_loss=0.1302, over 4272586.60 frames. ], batch size: 414, lr: 2.45e-02, grad_scale: 32.0 2023-06-18 10:42:26,671 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-06-18 10:42:38,032 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=145560.0, ans=0.125 2023-06-18 10:42:45,355 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=145560.0, ans=0.0 2023-06-18 10:43:13,465 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.41 vs. limit=10.0 2023-06-18 10:44:08,336 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.61 vs. limit=22.5 2023-06-18 10:44:28,669 INFO [train.py:996] (2/4) Epoch 1, batch 24300, loss[loss=0.1837, simple_loss=0.2497, pruned_loss=0.05886, over 21128.00 frames. ], tot_loss[loss=0.3045, simple_loss=0.3654, pruned_loss=0.1218, over 4273156.49 frames. ], batch size: 143, lr: 2.44e-02, grad_scale: 32.0 2023-06-18 10:44:53,026 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=145860.0, ans=0.125 2023-06-18 10:44:53,588 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.89 vs. limit=15.0 2023-06-18 10:45:25,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=145860.0, ans=0.2 2023-06-18 10:45:56,084 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.95 vs. limit=15.0 2023-06-18 10:46:15,197 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.527e+02 2.565e+02 3.464e+02 4.560e+02 9.381e+02, threshold=6.928e+02, percent-clipped=3.0 2023-06-18 10:46:33,551 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.51 vs. limit=6.0 2023-06-18 10:46:52,481 INFO [train.py:996] (2/4) Epoch 1, batch 24350, loss[loss=0.3364, simple_loss=0.3774, pruned_loss=0.1477, over 21349.00 frames. ], tot_loss[loss=0.303, simple_loss=0.3614, pruned_loss=0.1223, over 4281139.82 frames. ], batch size: 176, lr: 2.44e-02, grad_scale: 32.0 2023-06-18 10:47:31,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=146160.0, ans=0.0 2023-06-18 10:48:39,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=146280.0, ans=0.05 2023-06-18 10:49:26,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=146340.0, ans=0.0 2023-06-18 10:49:48,864 INFO [train.py:996] (2/4) Epoch 1, batch 24400, loss[loss=0.3705, simple_loss=0.462, pruned_loss=0.1395, over 19757.00 frames. ], tot_loss[loss=0.3128, simple_loss=0.3688, pruned_loss=0.1284, over 4283895.48 frames. ], batch size: 702, lr: 2.44e-02, grad_scale: 32.0 2023-06-18 10:49:54,289 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=12.0 2023-06-18 10:50:40,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=146460.0, ans=0.125 2023-06-18 10:51:09,430 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-06-18 10:51:09,759 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.645e+02 3.327e+02 4.021e+02 4.896e+02 7.862e+02, threshold=8.041e+02, percent-clipped=3.0 2023-06-18 10:51:10,179 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=146580.0, ans=0.125 2023-06-18 10:51:33,669 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=146580.0, ans=0.0 2023-06-18 10:51:55,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=146640.0, ans=0.1 2023-06-18 10:52:18,230 INFO [train.py:996] (2/4) Epoch 1, batch 24450, loss[loss=0.284, simple_loss=0.349, pruned_loss=0.1095, over 21635.00 frames. ], tot_loss[loss=0.3187, simple_loss=0.3738, pruned_loss=0.1318, over 4281462.24 frames. ], batch size: 247, lr: 2.44e-02, grad_scale: 32.0 2023-06-18 10:52:44,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=146700.0, ans=0.5 2023-06-18 10:52:58,565 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.58 vs. limit=22.5 2023-06-18 10:53:19,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=146820.0, ans=0.1 2023-06-18 10:54:39,721 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=146940.0, ans=0.04949747468305833 2023-06-18 10:54:48,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=147000.0, ans=0.125 2023-06-18 10:54:49,354 INFO [train.py:996] (2/4) Epoch 1, batch 24500, loss[loss=0.3126, simple_loss=0.3649, pruned_loss=0.1302, over 21921.00 frames. ], tot_loss[loss=0.318, simple_loss=0.3729, pruned_loss=0.1316, over 4284789.65 frames. ], batch size: 351, lr: 2.43e-02, grad_scale: 32.0 2023-06-18 10:54:49,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=147000.0, ans=0.1 2023-06-18 10:54:49,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=147000.0, ans=0.125 2023-06-18 10:55:03,935 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=147000.0, ans=0.125 2023-06-18 10:55:03,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=147000.0, ans=0.1 2023-06-18 10:55:29,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=147060.0, ans=0.125 2023-06-18 10:55:49,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=147060.0, ans=0.125 2023-06-18 10:56:41,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=147180.0, ans=0.125 2023-06-18 10:56:44,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=147180.0, ans=0.125 2023-06-18 10:56:45,341 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.296e+02 3.166e+02 3.756e+02 4.568e+02 6.399e+02, threshold=7.511e+02, percent-clipped=0.0 2023-06-18 10:56:46,280 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.45 vs. limit=15.0 2023-06-18 10:57:32,112 INFO [train.py:996] (2/4) Epoch 1, batch 24550, loss[loss=0.3305, simple_loss=0.3874, pruned_loss=0.1368, over 21485.00 frames. ], tot_loss[loss=0.3241, simple_loss=0.3768, pruned_loss=0.1357, over 4286336.24 frames. ], batch size: 211, lr: 2.43e-02, grad_scale: 32.0 2023-06-18 10:58:28,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=147360.0, ans=0.125 2023-06-18 10:58:36,871 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=147420.0, ans=0.0 2023-06-18 10:59:04,993 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.56 vs. limit=15.0 2023-06-18 10:59:09,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=147480.0, ans=0.0 2023-06-18 11:00:07,927 INFO [train.py:996] (2/4) Epoch 1, batch 24600, loss[loss=0.3279, simple_loss=0.378, pruned_loss=0.1389, over 21198.00 frames. ], tot_loss[loss=0.3208, simple_loss=0.3708, pruned_loss=0.1354, over 4275599.73 frames. ], batch size: 143, lr: 2.43e-02, grad_scale: 64.0 2023-06-18 11:01:26,704 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.629e+02 3.281e+02 4.105e+02 4.989e+02 7.810e+02, threshold=8.210e+02, percent-clipped=2.0 2023-06-18 11:01:30,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=147780.0, ans=0.0 2023-06-18 11:02:05,729 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=147840.0, ans=0.125 2023-06-18 11:02:16,248 INFO [train.py:996] (2/4) Epoch 1, batch 24650, loss[loss=0.2539, simple_loss=0.2975, pruned_loss=0.1052, over 21488.00 frames. ], tot_loss[loss=0.3128, simple_loss=0.3606, pruned_loss=0.1324, over 4262141.93 frames. ], batch size: 213, lr: 2.43e-02, grad_scale: 32.0 2023-06-18 11:03:10,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=147960.0, ans=0.95 2023-06-18 11:03:29,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=148020.0, ans=0.125 2023-06-18 11:03:39,747 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=18.36 vs. limit=15.0 2023-06-18 11:03:40,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=148020.0, ans=0.0 2023-06-18 11:04:22,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=148140.0, ans=0.1 2023-06-18 11:04:35,838 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=148140.0, ans=0.125 2023-06-18 11:04:37,026 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=148140.0, ans=0.1 2023-06-18 11:04:48,084 INFO [train.py:996] (2/4) Epoch 1, batch 24700, loss[loss=0.2793, simple_loss=0.3327, pruned_loss=0.113, over 21828.00 frames. ], tot_loss[loss=0.3094, simple_loss=0.3591, pruned_loss=0.1299, over 4263666.51 frames. ], batch size: 118, lr: 2.43e-02, grad_scale: 32.0 2023-06-18 11:05:25,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=148260.0, ans=0.2 2023-06-18 11:05:38,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=148320.0, ans=0.0 2023-06-18 11:06:10,604 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.880e+02 2.936e+02 3.514e+02 4.339e+02 5.892e+02, threshold=7.028e+02, percent-clipped=0.0 2023-06-18 11:06:38,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=148440.0, ans=0.125 2023-06-18 11:07:20,637 INFO [train.py:996] (2/4) Epoch 1, batch 24750, loss[loss=0.2516, simple_loss=0.2979, pruned_loss=0.1026, over 21218.00 frames. ], tot_loss[loss=0.3009, simple_loss=0.3512, pruned_loss=0.1253, over 4257895.58 frames. ], batch size: 549, lr: 2.42e-02, grad_scale: 32.0 2023-06-18 11:08:55,531 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=148680.0, ans=0.125 2023-06-18 11:08:56,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=148680.0, ans=0.125 2023-06-18 11:09:09,719 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=148680.0, ans=0.1 2023-06-18 11:09:09,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=148680.0, ans=0.125 2023-06-18 11:09:13,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=148740.0, ans=0.125 2023-06-18 11:09:13,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=148740.0, ans=0.1 2023-06-18 11:09:19,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=148740.0, ans=0.0 2023-06-18 11:09:43,702 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.92 vs. limit=15.0 2023-06-18 11:09:44,198 INFO [train.py:996] (2/4) Epoch 1, batch 24800, loss[loss=0.3265, simple_loss=0.3649, pruned_loss=0.144, over 21450.00 frames. ], tot_loss[loss=0.2982, simple_loss=0.3473, pruned_loss=0.1245, over 4262011.45 frames. ], batch size: 211, lr: 2.42e-02, grad_scale: 32.0 2023-06-18 11:10:08,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=148860.0, ans=0.125 2023-06-18 11:10:17,466 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=148860.0, ans=0.125 2023-06-18 11:10:20,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=148860.0, ans=0.125 2023-06-18 11:10:22,800 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.61 vs. limit=5.0 2023-06-18 11:11:04,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=148980.0, ans=0.125 2023-06-18 11:11:21,038 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.242e+02 3.310e+02 3.842e+02 5.008e+02 1.003e+03, threshold=7.684e+02, percent-clipped=5.0 2023-06-18 11:11:26,033 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=148980.0, ans=0.125 2023-06-18 11:11:30,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=149040.0, ans=0.0 2023-06-18 11:11:56,719 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=149040.0, ans=0.125 2023-06-18 11:12:17,381 INFO [train.py:996] (2/4) Epoch 1, batch 24850, loss[loss=0.3243, simple_loss=0.3801, pruned_loss=0.1343, over 21738.00 frames. ], tot_loss[loss=0.3019, simple_loss=0.3489, pruned_loss=0.1275, over 4267698.98 frames. ], batch size: 414, lr: 2.42e-02, grad_scale: 32.0 2023-06-18 11:12:30,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=149100.0, ans=0.125 2023-06-18 11:12:33,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=149100.0, ans=0.0 2023-06-18 11:13:38,618 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=149280.0, ans=0.125 2023-06-18 11:14:26,335 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=149340.0, ans=0.125 2023-06-18 11:14:38,774 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=149400.0, ans=0.2 2023-06-18 11:14:39,666 INFO [train.py:996] (2/4) Epoch 1, batch 24900, loss[loss=0.3607, simple_loss=0.4093, pruned_loss=0.1561, over 21410.00 frames. ], tot_loss[loss=0.3048, simple_loss=0.3524, pruned_loss=0.1286, over 4266592.06 frames. ], batch size: 143, lr: 2.42e-02, grad_scale: 32.0 2023-06-18 11:14:47,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=149400.0, ans=0.0 2023-06-18 11:15:15,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=149460.0, ans=0.125 2023-06-18 11:15:45,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=149520.0, ans=0.0 2023-06-18 11:15:50,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=149520.0, ans=0.125 2023-06-18 11:16:38,258 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.173e+02 3.164e+02 3.727e+02 4.575e+02 6.932e+02, threshold=7.454e+02, percent-clipped=0.0 2023-06-18 11:16:39,494 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-06-18 11:16:41,036 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.17 vs. limit=10.0 2023-06-18 11:17:28,502 INFO [train.py:996] (2/4) Epoch 1, batch 24950, loss[loss=0.3682, simple_loss=0.4096, pruned_loss=0.1634, over 21300.00 frames. ], tot_loss[loss=0.3143, simple_loss=0.3615, pruned_loss=0.1335, over 4264211.97 frames. ], batch size: 548, lr: 2.41e-02, grad_scale: 32.0 2023-06-18 11:17:56,109 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=149700.0, ans=0.04949747468305833 2023-06-18 11:18:20,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=149760.0, ans=0.0 2023-06-18 11:18:27,165 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=149820.0, ans=0.125 2023-06-18 11:19:16,715 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=149880.0, ans=0.1 2023-06-18 11:20:06,580 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=149940.0, ans=0.0 2023-06-18 11:20:09,050 INFO [train.py:996] (2/4) Epoch 1, batch 25000, loss[loss=0.2979, simple_loss=0.3493, pruned_loss=0.1233, over 21836.00 frames. ], tot_loss[loss=0.3231, simple_loss=0.3708, pruned_loss=0.1377, over 4255541.06 frames. ], batch size: 107, lr: 2.41e-02, grad_scale: 32.0 2023-06-18 11:21:30,640 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.58 vs. limit=15.0 2023-06-18 11:21:31,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=150120.0, ans=0.125 2023-06-18 11:21:50,977 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.298e+02 3.123e+02 3.491e+02 4.208e+02 6.099e+02, threshold=6.982e+02, percent-clipped=0.0 2023-06-18 11:22:36,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=150240.0, ans=0.1 2023-06-18 11:22:50,199 INFO [train.py:996] (2/4) Epoch 1, batch 25050, loss[loss=0.2886, simple_loss=0.3274, pruned_loss=0.1249, over 21637.00 frames. ], tot_loss[loss=0.3164, simple_loss=0.3625, pruned_loss=0.1352, over 4257602.76 frames. ], batch size: 298, lr: 2.41e-02, grad_scale: 16.0 2023-06-18 11:22:55,626 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.22 vs. limit=6.0 2023-06-18 11:23:22,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=150300.0, ans=0.1 2023-06-18 11:23:29,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=150360.0, ans=0.5 2023-06-18 11:24:50,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=150480.0, ans=0.125 2023-06-18 11:24:55,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=150540.0, ans=0.0 2023-06-18 11:25:05,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=150540.0, ans=0.125 2023-06-18 11:25:21,803 INFO [train.py:996] (2/4) Epoch 1, batch 25100, loss[loss=0.2659, simple_loss=0.3129, pruned_loss=0.1094, over 21600.00 frames. ], tot_loss[loss=0.3102, simple_loss=0.3552, pruned_loss=0.1326, over 4265640.46 frames. ], batch size: 298, lr: 2.41e-02, grad_scale: 16.0 2023-06-18 11:25:29,384 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=150600.0, ans=0.1 2023-06-18 11:25:35,114 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=150600.0, ans=0.125 2023-06-18 11:26:31,196 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 11:26:31,779 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=9.15 vs. limit=6.0 2023-06-18 11:27:00,085 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.897e+02 3.018e+02 3.792e+02 4.723e+02 7.366e+02, threshold=7.583e+02, percent-clipped=2.0 2023-06-18 11:27:17,060 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=150840.0, ans=0.1 2023-06-18 11:27:54,109 INFO [train.py:996] (2/4) Epoch 1, batch 25150, loss[loss=0.2938, simple_loss=0.3513, pruned_loss=0.1181, over 21507.00 frames. ], tot_loss[loss=0.3069, simple_loss=0.3563, pruned_loss=0.1288, over 4263467.44 frames. ], batch size: 211, lr: 2.41e-02, grad_scale: 16.0 2023-06-18 11:28:01,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=150900.0, ans=0.0 2023-06-18 11:28:07,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=150960.0, ans=0.0 2023-06-18 11:28:13,928 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.65 vs. limit=6.0 2023-06-18 11:29:08,872 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=151080.0, ans=0.125 2023-06-18 11:29:21,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=151080.0, ans=0.125 2023-06-18 11:29:34,476 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-18 11:30:17,530 INFO [train.py:996] (2/4) Epoch 1, batch 25200, loss[loss=0.3374, simple_loss=0.4054, pruned_loss=0.1347, over 21669.00 frames. ], tot_loss[loss=0.3032, simple_loss=0.3554, pruned_loss=0.1255, over 4259725.72 frames. ], batch size: 441, lr: 2.40e-02, grad_scale: 32.0 2023-06-18 11:31:11,062 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=151320.0, ans=0.2 2023-06-18 11:31:55,945 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.967e+02 2.820e+02 3.433e+02 4.407e+02 9.386e+02, threshold=6.866e+02, percent-clipped=4.0 2023-06-18 11:32:40,294 INFO [train.py:996] (2/4) Epoch 1, batch 25250, loss[loss=0.3275, simple_loss=0.3575, pruned_loss=0.1487, over 21583.00 frames. ], tot_loss[loss=0.3001, simple_loss=0.3531, pruned_loss=0.1236, over 4249759.58 frames. ], batch size: 415, lr: 2.40e-02, grad_scale: 32.0 2023-06-18 11:34:02,137 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=151620.0, ans=0.2 2023-06-18 11:34:14,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=151680.0, ans=0.1 2023-06-18 11:34:17,776 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.45 vs. limit=15.0 2023-06-18 11:34:35,890 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=151740.0, ans=0.2 2023-06-18 11:34:46,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=151740.0, ans=0.125 2023-06-18 11:35:17,337 INFO [train.py:996] (2/4) Epoch 1, batch 25300, loss[loss=0.2851, simple_loss=0.3118, pruned_loss=0.1292, over 20162.00 frames. ], tot_loss[loss=0.2978, simple_loss=0.349, pruned_loss=0.1233, over 4241642.39 frames. ], batch size: 703, lr: 2.40e-02, grad_scale: 32.0 2023-06-18 11:36:00,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=151860.0, ans=0.0 2023-06-18 11:36:12,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=151920.0, ans=0.0 2023-06-18 11:36:24,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=151920.0, ans=0.0 2023-06-18 11:36:44,298 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.03 vs. limit=15.0 2023-06-18 11:37:00,759 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=151980.0, ans=0.0 2023-06-18 11:37:01,672 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.195e+02 3.038e+02 3.574e+02 4.341e+02 5.884e+02, threshold=7.148e+02, percent-clipped=0.0 2023-06-18 11:37:36,657 INFO [train.py:996] (2/4) Epoch 1, batch 25350, loss[loss=0.3227, simple_loss=0.353, pruned_loss=0.1462, over 20106.00 frames. ], tot_loss[loss=0.2992, simple_loss=0.3513, pruned_loss=0.1235, over 4228848.58 frames. ], batch size: 703, lr: 2.40e-02, grad_scale: 32.0 2023-06-18 11:38:17,114 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=152160.0, ans=0.125 2023-06-18 11:38:40,032 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=152220.0, ans=0.125 2023-06-18 11:39:01,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=152220.0, ans=0.1 2023-06-18 11:39:18,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=152280.0, ans=0.125 2023-06-18 11:39:28,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=152280.0, ans=0.125 2023-06-18 11:39:36,715 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=152340.0, ans=0.0 2023-06-18 11:39:50,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=152340.0, ans=0.0 2023-06-18 11:40:10,522 INFO [train.py:996] (2/4) Epoch 1, batch 25400, loss[loss=0.2825, simple_loss=0.3696, pruned_loss=0.09769, over 20768.00 frames. ], tot_loss[loss=0.2964, simple_loss=0.3483, pruned_loss=0.1223, over 4230159.89 frames. ], batch size: 607, lr: 2.39e-02, grad_scale: 32.0 2023-06-18 11:40:23,350 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.04 vs. limit=15.0 2023-06-18 11:40:27,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=152400.0, ans=0.0 2023-06-18 11:40:28,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=152400.0, ans=0.025 2023-06-18 11:41:38,155 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.57 vs. limit=15.0 2023-06-18 11:41:56,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=152580.0, ans=0.125 2023-06-18 11:41:58,762 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.156e+02 3.135e+02 3.772e+02 5.152e+02 8.447e+02, threshold=7.545e+02, percent-clipped=8.0 2023-06-18 11:42:11,695 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=152640.0, ans=0.015 2023-06-18 11:42:13,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=152640.0, ans=10.0 2023-06-18 11:42:48,240 INFO [train.py:996] (2/4) Epoch 1, batch 25450, loss[loss=0.3916, simple_loss=0.4077, pruned_loss=0.1878, over 21747.00 frames. ], tot_loss[loss=0.2992, simple_loss=0.3497, pruned_loss=0.1243, over 4229919.98 frames. ], batch size: 508, lr: 2.39e-02, grad_scale: 32.0 2023-06-18 11:43:11,656 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=152760.0, ans=0.2 2023-06-18 11:44:12,447 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=152820.0, ans=0.125 2023-06-18 11:45:17,146 INFO [train.py:996] (2/4) Epoch 1, batch 25500, loss[loss=0.2855, simple_loss=0.3521, pruned_loss=0.1095, over 21419.00 frames. ], tot_loss[loss=0.2959, simple_loss=0.3503, pruned_loss=0.1208, over 4240033.37 frames. ], batch size: 211, lr: 2.39e-02, grad_scale: 32.0 2023-06-18 11:46:09,644 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=153060.0, ans=0.125 2023-06-18 11:46:44,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=153120.0, ans=0.1 2023-06-18 11:47:14,258 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.987e+02 2.935e+02 3.590e+02 4.450e+02 7.910e+02, threshold=7.180e+02, percent-clipped=1.0 2023-06-18 11:47:34,461 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=153180.0, ans=0.0 2023-06-18 11:48:09,538 INFO [train.py:996] (2/4) Epoch 1, batch 25550, loss[loss=0.2822, simple_loss=0.3626, pruned_loss=0.1009, over 21806.00 frames. ], tot_loss[loss=0.3003, simple_loss=0.3575, pruned_loss=0.1215, over 4244270.66 frames. ], batch size: 282, lr: 2.39e-02, grad_scale: 32.0 2023-06-18 11:50:46,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=153540.0, ans=0.2 2023-06-18 11:51:05,017 INFO [train.py:996] (2/4) Epoch 1, batch 25600, loss[loss=0.3559, simple_loss=0.3993, pruned_loss=0.1562, over 21717.00 frames. ], tot_loss[loss=0.3055, simple_loss=0.3631, pruned_loss=0.124, over 4256856.25 frames. ], batch size: 351, lr: 2.39e-02, grad_scale: 32.0 2023-06-18 11:51:07,636 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.32 vs. limit=22.5 2023-06-18 11:52:08,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=153720.0, ans=0.2 2023-06-18 11:52:56,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=153780.0, ans=0.0 2023-06-18 11:52:57,322 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.942e+02 3.009e+02 3.848e+02 5.961e+02 1.110e+03, threshold=7.697e+02, percent-clipped=15.0 2023-06-18 11:52:58,507 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=13.15 vs. limit=15.0 2023-06-18 11:53:03,302 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=153780.0, ans=0.125 2023-06-18 11:53:35,678 INFO [train.py:996] (2/4) Epoch 1, batch 25650, loss[loss=0.3103, simple_loss=0.3631, pruned_loss=0.1287, over 20863.00 frames. ], tot_loss[loss=0.3092, simple_loss=0.3639, pruned_loss=0.1272, over 4253917.86 frames. ], batch size: 608, lr: 2.38e-02, grad_scale: 32.0 2023-06-18 11:53:36,966 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.51 vs. limit=15.0 2023-06-18 11:53:47,123 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.41 vs. limit=15.0 2023-06-18 11:53:51,198 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.32 vs. limit=15.0 2023-06-18 11:53:52,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=153900.0, ans=0.1 2023-06-18 11:55:29,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=154080.0, ans=0.125 2023-06-18 11:56:09,169 INFO [train.py:996] (2/4) Epoch 1, batch 25700, loss[loss=0.3862, simple_loss=0.4577, pruned_loss=0.1573, over 19842.00 frames. ], tot_loss[loss=0.3091, simple_loss=0.3614, pruned_loss=0.1284, over 4250753.06 frames. ], batch size: 702, lr: 2.38e-02, grad_scale: 32.0 2023-06-18 11:57:18,789 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=154320.0, ans=0.1 2023-06-18 11:58:10,778 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.012e+02 3.158e+02 3.682e+02 4.301e+02 9.092e+02, threshold=7.363e+02, percent-clipped=1.0 2023-06-18 11:58:17,314 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=154380.0, ans=0.0 2023-06-18 11:59:00,242 INFO [train.py:996] (2/4) Epoch 1, batch 25750, loss[loss=0.4246, simple_loss=0.4993, pruned_loss=0.1749, over 19915.00 frames. ], tot_loss[loss=0.3164, simple_loss=0.3677, pruned_loss=0.1325, over 4260917.51 frames. ], batch size: 702, lr: 2.38e-02, grad_scale: 32.0 2023-06-18 11:59:25,393 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 12:01:06,546 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=18.10 vs. limit=22.5 2023-06-18 12:01:07,923 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.92 vs. limit=22.5 2023-06-18 12:01:50,079 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.49 vs. limit=15.0 2023-06-18 12:01:57,093 INFO [train.py:996] (2/4) Epoch 1, batch 25800, loss[loss=0.4397, simple_loss=0.4561, pruned_loss=0.2116, over 21361.00 frames. ], tot_loss[loss=0.3292, simple_loss=0.3809, pruned_loss=0.1387, over 4263168.30 frames. ], batch size: 507, lr: 2.38e-02, grad_scale: 32.0 2023-06-18 12:02:43,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=154860.0, ans=0.125 2023-06-18 12:03:42,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=154920.0, ans=0.125 2023-06-18 12:03:46,905 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=154920.0, ans=10.0 2023-06-18 12:03:46,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=154920.0, ans=0.125 2023-06-18 12:03:56,356 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.328e+02 3.535e+02 4.132e+02 5.213e+02 8.329e+02, threshold=8.265e+02, percent-clipped=2.0 2023-06-18 12:04:08,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=155040.0, ans=0.125 2023-06-18 12:05:02,198 INFO [train.py:996] (2/4) Epoch 1, batch 25850, loss[loss=0.2826, simple_loss=0.3284, pruned_loss=0.1184, over 21359.00 frames. ], tot_loss[loss=0.3288, simple_loss=0.3824, pruned_loss=0.1375, over 4261820.49 frames. ], batch size: 176, lr: 2.38e-02, grad_scale: 32.0 2023-06-18 12:05:14,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=155100.0, ans=0.1 2023-06-18 12:05:18,312 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.12 vs. limit=6.0 2023-06-18 12:06:04,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=155220.0, ans=0.125 2023-06-18 12:07:10,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=155340.0, ans=0.2 2023-06-18 12:07:30,263 INFO [train.py:996] (2/4) Epoch 1, batch 25900, loss[loss=0.3144, simple_loss=0.3865, pruned_loss=0.1212, over 21575.00 frames. ], tot_loss[loss=0.3312, simple_loss=0.3845, pruned_loss=0.139, over 4269701.91 frames. ], batch size: 230, lr: 2.37e-02, grad_scale: 32.0 2023-06-18 12:07:58,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=155400.0, ans=0.2 2023-06-18 12:07:58,741 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.20 vs. limit=22.5 2023-06-18 12:07:59,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=155400.0, ans=0.125 2023-06-18 12:08:09,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=155460.0, ans=0.0 2023-06-18 12:08:18,365 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=2.623e-03 2023-06-18 12:09:14,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=155520.0, ans=0.125 2023-06-18 12:09:37,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=155580.0, ans=0.0 2023-06-18 12:09:39,802 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.491e+02 3.281e+02 3.976e+02 5.347e+02 9.829e+02, threshold=7.952e+02, percent-clipped=3.0 2023-06-18 12:09:53,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=155640.0, ans=0.125 2023-06-18 12:10:18,125 INFO [train.py:996] (2/4) Epoch 1, batch 25950, loss[loss=0.3186, simple_loss=0.3771, pruned_loss=0.1301, over 21706.00 frames. ], tot_loss[loss=0.3367, simple_loss=0.3899, pruned_loss=0.1418, over 4279065.00 frames. ], batch size: 298, lr: 2.37e-02, grad_scale: 16.0 2023-06-18 12:10:57,839 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=155760.0, ans=0.0 2023-06-18 12:11:31,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=155820.0, ans=0.2 2023-06-18 12:12:26,797 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.08 vs. limit=22.5 2023-06-18 12:12:29,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=155940.0, ans=0.125 2023-06-18 12:12:33,071 INFO [train.py:996] (2/4) Epoch 1, batch 26000, loss[loss=0.3238, simple_loss=0.388, pruned_loss=0.1298, over 21621.00 frames. ], tot_loss[loss=0.3358, simple_loss=0.3909, pruned_loss=0.1403, over 4286333.62 frames. ], batch size: 263, lr: 2.37e-02, grad_scale: 32.0 2023-06-18 12:12:55,715 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.04 vs. limit=12.0 2023-06-18 12:13:01,365 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=156000.0, ans=0.125 2023-06-18 12:13:29,191 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.79 vs. limit=6.0 2023-06-18 12:14:01,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=156120.0, ans=0.015 2023-06-18 12:14:44,906 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.158e+02 3.144e+02 3.654e+02 4.482e+02 8.156e+02, threshold=7.307e+02, percent-clipped=1.0 2023-06-18 12:15:15,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=156240.0, ans=0.125 2023-06-18 12:15:22,694 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=156300.0, ans=0.0 2023-06-18 12:15:24,697 INFO [train.py:996] (2/4) Epoch 1, batch 26050, loss[loss=0.2945, simple_loss=0.3442, pruned_loss=0.1224, over 21379.00 frames. ], tot_loss[loss=0.3369, simple_loss=0.3905, pruned_loss=0.1417, over 4288570.30 frames. ], batch size: 176, lr: 2.37e-02, grad_scale: 32.0 2023-06-18 12:16:52,694 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=156420.0, ans=0.125 2023-06-18 12:17:32,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=156540.0, ans=0.0 2023-06-18 12:17:35,802 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.46 vs. limit=15.0 2023-06-18 12:17:38,137 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=156540.0, ans=0.025 2023-06-18 12:18:05,270 INFO [train.py:996] (2/4) Epoch 1, batch 26100, loss[loss=0.3522, simple_loss=0.3948, pruned_loss=0.1548, over 21899.00 frames. ], tot_loss[loss=0.3322, simple_loss=0.3834, pruned_loss=0.1404, over 4297013.62 frames. ], batch size: 107, lr: 2.36e-02, grad_scale: 32.0 2023-06-18 12:18:12,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=156600.0, ans=0.125 2023-06-18 12:19:03,941 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=156660.0, ans=0.1 2023-06-18 12:19:36,137 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=156780.0, ans=0.125 2023-06-18 12:19:41,403 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.250e+02 3.360e+02 3.963e+02 4.876e+02 8.517e+02, threshold=7.926e+02, percent-clipped=3.0 2023-06-18 12:20:31,746 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=156840.0, ans=0.2 2023-06-18 12:20:46,679 INFO [train.py:996] (2/4) Epoch 1, batch 26150, loss[loss=0.3527, simple_loss=0.3916, pruned_loss=0.1569, over 21336.00 frames. ], tot_loss[loss=0.3299, simple_loss=0.3797, pruned_loss=0.14, over 4302554.21 frames. ], batch size: 548, lr: 2.36e-02, grad_scale: 32.0 2023-06-18 12:21:53,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=156960.0, ans=0.0 2023-06-18 12:22:04,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=157020.0, ans=0.0 2023-06-18 12:23:24,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=157140.0, ans=0.0 2023-06-18 12:23:34,852 INFO [train.py:996] (2/4) Epoch 1, batch 26200, loss[loss=0.3214, simple_loss=0.3991, pruned_loss=0.1219, over 21817.00 frames. ], tot_loss[loss=0.3265, simple_loss=0.3793, pruned_loss=0.1369, over 4295325.95 frames. ], batch size: 282, lr: 2.36e-02, grad_scale: 32.0 2023-06-18 12:23:35,397 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 12:23:35,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=157200.0, ans=0.2 2023-06-18 12:24:47,954 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=15.0 2023-06-18 12:25:10,749 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=157380.0, ans=0.0 2023-06-18 12:25:21,218 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 3.203e+02 3.929e+02 5.143e+02 8.013e+02, threshold=7.858e+02, percent-clipped=1.0 2023-06-18 12:26:24,269 INFO [train.py:996] (2/4) Epoch 1, batch 26250, loss[loss=0.3148, simple_loss=0.3621, pruned_loss=0.1337, over 21491.00 frames. ], tot_loss[loss=0.3303, simple_loss=0.3868, pruned_loss=0.1369, over 4289332.46 frames. ], batch size: 211, lr: 2.36e-02, grad_scale: 32.0 2023-06-18 12:26:26,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=157500.0, ans=0.125 2023-06-18 12:27:23,009 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.25 vs. limit=10.0 2023-06-18 12:27:23,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=157620.0, ans=0.035 2023-06-18 12:27:26,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=157620.0, ans=0.0 2023-06-18 12:28:17,071 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 12:28:21,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=157740.0, ans=0.125 2023-06-18 12:28:24,551 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 12:28:44,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=157740.0, ans=0.0 2023-06-18 12:28:52,335 INFO [train.py:996] (2/4) Epoch 1, batch 26300, loss[loss=0.362, simple_loss=0.3905, pruned_loss=0.1668, over 21624.00 frames. ], tot_loss[loss=0.3277, simple_loss=0.3816, pruned_loss=0.1369, over 4300738.93 frames. ], batch size: 471, lr: 2.36e-02, grad_scale: 32.0 2023-06-18 12:29:15,078 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=157800.0, ans=0.0 2023-06-18 12:29:48,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=157860.0, ans=0.125 2023-06-18 12:30:12,983 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.83 vs. limit=22.5 2023-06-18 12:30:46,901 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.240e+02 3.042e+02 3.851e+02 4.638e+02 8.463e+02, threshold=7.702e+02, percent-clipped=2.0 2023-06-18 12:31:30,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=158040.0, ans=0.07 2023-06-18 12:31:34,494 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=158040.0, ans=0.125 2023-06-18 12:31:47,948 INFO [train.py:996] (2/4) Epoch 1, batch 26350, loss[loss=0.3435, simple_loss=0.3944, pruned_loss=0.1463, over 21456.00 frames. ], tot_loss[loss=0.327, simple_loss=0.3797, pruned_loss=0.1372, over 4302341.68 frames. ], batch size: 131, lr: 2.35e-02, grad_scale: 32.0 2023-06-18 12:31:51,134 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=158100.0, ans=0.125 2023-06-18 12:31:51,800 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.56 vs. limit=10.0 2023-06-18 12:32:51,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=158220.0, ans=0.125 2023-06-18 12:34:13,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=158340.0, ans=0.125 2023-06-18 12:34:19,835 INFO [train.py:996] (2/4) Epoch 1, batch 26400, loss[loss=0.2915, simple_loss=0.3339, pruned_loss=0.1246, over 21677.00 frames. ], tot_loss[loss=0.3242, simple_loss=0.3739, pruned_loss=0.1373, over 4292500.64 frames. ], batch size: 333, lr: 2.35e-02, grad_scale: 32.0 2023-06-18 12:34:36,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=158400.0, ans=0.125 2023-06-18 12:34:38,030 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.45 vs. limit=15.0 2023-06-18 12:34:49,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=158460.0, ans=0.0 2023-06-18 12:35:13,810 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=158460.0, ans=0.2 2023-06-18 12:35:15,471 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=158520.0, ans=0.125 2023-06-18 12:36:06,750 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 3.087e+02 3.915e+02 4.820e+02 6.818e+02, threshold=7.829e+02, percent-clipped=0.0 2023-06-18 12:36:08,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=158580.0, ans=0.0 2023-06-18 12:36:36,775 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.73 vs. limit=15.0 2023-06-18 12:36:54,687 INFO [train.py:996] (2/4) Epoch 1, batch 26450, loss[loss=0.346, simple_loss=0.4478, pruned_loss=0.1221, over 21185.00 frames. ], tot_loss[loss=0.3243, simple_loss=0.3739, pruned_loss=0.1374, over 4290834.01 frames. ], batch size: 549, lr: 2.35e-02, grad_scale: 32.0 2023-06-18 12:37:45,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=158760.0, ans=0.125 2023-06-18 12:37:49,101 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.15 vs. limit=10.0 2023-06-18 12:38:18,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=158820.0, ans=0.05 2023-06-18 12:38:36,667 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 12:38:53,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=158880.0, ans=0.0 2023-06-18 12:39:12,948 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=158880.0, ans=0.125 2023-06-18 12:39:39,963 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.91 vs. limit=15.0 2023-06-18 12:40:01,471 INFO [train.py:996] (2/4) Epoch 1, batch 26500, loss[loss=0.2407, simple_loss=0.2912, pruned_loss=0.09512, over 21265.00 frames. ], tot_loss[loss=0.322, simple_loss=0.374, pruned_loss=0.135, over 4280291.17 frames. ], batch size: 176, lr: 2.35e-02, grad_scale: 32.0 2023-06-18 12:40:46,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=159060.0, ans=0.1 2023-06-18 12:41:57,603 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.151e+02 3.526e+02 4.481e+02 5.627e+02 1.219e+03, threshold=8.961e+02, percent-clipped=9.0 2023-06-18 12:42:30,128 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=159240.0, ans=0.2 2023-06-18 12:42:31,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=159240.0, ans=0.0 2023-06-18 12:42:32,085 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.47 vs. limit=15.0 2023-06-18 12:42:41,905 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 12:42:42,862 INFO [train.py:996] (2/4) Epoch 1, batch 26550, loss[loss=0.2234, simple_loss=0.2808, pruned_loss=0.083, over 21239.00 frames. ], tot_loss[loss=0.3141, simple_loss=0.3681, pruned_loss=0.1301, over 4266174.96 frames. ], batch size: 176, lr: 2.35e-02, grad_scale: 32.0 2023-06-18 12:42:44,776 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=159300.0, ans=0.125 2023-06-18 12:44:05,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=159420.0, ans=0.0 2023-06-18 12:44:06,012 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.15 vs. limit=12.0 2023-06-18 12:44:26,964 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.73 vs. limit=10.0 2023-06-18 12:44:29,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=159420.0, ans=0.0 2023-06-18 12:45:44,650 INFO [train.py:996] (2/4) Epoch 1, batch 26600, loss[loss=0.2948, simple_loss=0.3536, pruned_loss=0.118, over 21693.00 frames. ], tot_loss[loss=0.3104, simple_loss=0.368, pruned_loss=0.1264, over 4268699.34 frames. ], batch size: 282, lr: 2.34e-02, grad_scale: 16.0 2023-06-18 12:46:11,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=159600.0, ans=0.1 2023-06-18 12:47:42,435 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.014e+02 2.824e+02 3.393e+02 4.003e+02 5.713e+02, threshold=6.786e+02, percent-clipped=0.0 2023-06-18 12:48:24,328 INFO [train.py:996] (2/4) Epoch 1, batch 26650, loss[loss=0.2641, simple_loss=0.3108, pruned_loss=0.1086, over 21569.00 frames. ], tot_loss[loss=0.3052, simple_loss=0.3608, pruned_loss=0.1248, over 4272647.06 frames. ], batch size: 247, lr: 2.34e-02, grad_scale: 16.0 2023-06-18 12:48:24,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=159900.0, ans=0.2 2023-06-18 12:48:36,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=159900.0, ans=0.0 2023-06-18 12:49:30,497 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=160020.0, ans=0.0 2023-06-18 12:49:46,468 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.85 vs. limit=6.0 2023-06-18 12:49:48,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=160080.0, ans=0.125 2023-06-18 12:50:54,361 INFO [train.py:996] (2/4) Epoch 1, batch 26700, loss[loss=0.2002, simple_loss=0.2607, pruned_loss=0.0699, over 21145.00 frames. ], tot_loss[loss=0.2948, simple_loss=0.3512, pruned_loss=0.1191, over 4278350.45 frames. ], batch size: 176, lr: 2.34e-02, grad_scale: 16.0 2023-06-18 12:50:57,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=160200.0, ans=0.2 2023-06-18 12:51:29,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=160260.0, ans=0.0 2023-06-18 12:51:43,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=160260.0, ans=0.125 2023-06-18 12:52:36,690 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.125e+02 2.796e+02 3.328e+02 4.126e+02 7.819e+02, threshold=6.657e+02, percent-clipped=1.0 2023-06-18 12:53:19,383 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=12.0 2023-06-18 12:53:22,312 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=15.0 2023-06-18 12:53:35,173 INFO [train.py:996] (2/4) Epoch 1, batch 26750, loss[loss=0.3632, simple_loss=0.4133, pruned_loss=0.1565, over 21392.00 frames. ], tot_loss[loss=0.2935, simple_loss=0.3511, pruned_loss=0.118, over 4286674.72 frames. ], batch size: 131, lr: 2.34e-02, grad_scale: 16.0 2023-06-18 12:53:56,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=160500.0, ans=0.2 2023-06-18 12:54:06,839 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=160560.0, ans=0.1 2023-06-18 12:55:28,348 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=160680.0, ans=0.125 2023-06-18 12:55:56,125 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=160740.0, ans=0.2 2023-06-18 12:56:10,918 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=160740.0, ans=0.0 2023-06-18 12:56:10,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=160740.0, ans=0.125 2023-06-18 12:56:16,281 INFO [train.py:996] (2/4) Epoch 1, batch 26800, loss[loss=0.308, simple_loss=0.3603, pruned_loss=0.1278, over 21322.00 frames. ], tot_loss[loss=0.3048, simple_loss=0.3605, pruned_loss=0.1246, over 4286325.84 frames. ], batch size: 176, lr: 2.34e-02, grad_scale: 32.0 2023-06-18 12:56:18,522 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=160800.0, ans=10.0 2023-06-18 12:56:39,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=160800.0, ans=0.0 2023-06-18 12:56:42,667 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.34 vs. limit=15.0 2023-06-18 12:57:52,926 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.47 vs. limit=15.0 2023-06-18 12:57:57,118 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.67 vs. limit=22.5 2023-06-18 12:58:01,658 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.459e+02 3.380e+02 3.968e+02 4.895e+02 8.004e+02, threshold=7.935e+02, percent-clipped=7.0 2023-06-18 12:58:17,582 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=161040.0, ans=0.125 2023-06-18 12:58:46,448 INFO [train.py:996] (2/4) Epoch 1, batch 26850, loss[loss=0.2951, simple_loss=0.3391, pruned_loss=0.1256, over 21810.00 frames. ], tot_loss[loss=0.31, simple_loss=0.3634, pruned_loss=0.1283, over 4283467.17 frames. ], batch size: 118, lr: 2.33e-02, grad_scale: 32.0 2023-06-18 12:59:21,721 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.00 vs. limit=22.5 2023-06-18 13:00:41,011 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.94 vs. limit=15.0 2023-06-18 13:01:17,653 INFO [train.py:996] (2/4) Epoch 1, batch 26900, loss[loss=0.2695, simple_loss=0.321, pruned_loss=0.109, over 21692.00 frames. ], tot_loss[loss=0.3027, simple_loss=0.3531, pruned_loss=0.1262, over 4273915.00 frames. ], batch size: 124, lr: 2.33e-02, grad_scale: 32.0 2023-06-18 13:01:21,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=161400.0, ans=0.07 2023-06-18 13:02:11,593 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.70 vs. limit=15.0 2023-06-18 13:02:28,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=161520.0, ans=0.1 2023-06-18 13:02:33,862 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=16.42 vs. limit=22.5 2023-06-18 13:02:58,286 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.176e+02 2.755e+02 3.329e+02 3.765e+02 8.557e+02, threshold=6.658e+02, percent-clipped=2.0 2023-06-18 13:03:37,563 INFO [train.py:996] (2/4) Epoch 1, batch 26950, loss[loss=0.3569, simple_loss=0.3941, pruned_loss=0.1598, over 20009.00 frames. ], tot_loss[loss=0.3014, simple_loss=0.3516, pruned_loss=0.1256, over 4273203.18 frames. ], batch size: 702, lr: 2.33e-02, grad_scale: 16.0 2023-06-18 13:04:42,077 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=161820.0, ans=0.1 2023-06-18 13:05:58,710 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=161940.0, ans=0.1 2023-06-18 13:06:17,429 INFO [train.py:996] (2/4) Epoch 1, batch 27000, loss[loss=0.2606, simple_loss=0.3352, pruned_loss=0.09296, over 21674.00 frames. ], tot_loss[loss=0.2992, simple_loss=0.3526, pruned_loss=0.123, over 4276413.09 frames. ], batch size: 247, lr: 2.33e-02, grad_scale: 16.0 2023-06-18 13:06:17,430 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-18 13:06:56,176 INFO [train.py:1028] (2/4) Epoch 1, validation: loss=0.2848, simple_loss=0.3741, pruned_loss=0.09774, over 1796401.00 frames. 2023-06-18 13:06:56,178 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-18 13:07:13,470 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=162060.0, ans=0.0 2023-06-18 13:08:26,182 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.200e+02 2.842e+02 3.323e+02 4.144e+02 6.549e+02, threshold=6.646e+02, percent-clipped=0.0 2023-06-18 13:09:19,638 INFO [train.py:996] (2/4) Epoch 1, batch 27050, loss[loss=0.3646, simple_loss=0.4122, pruned_loss=0.1586, over 21564.00 frames. ], tot_loss[loss=0.2982, simple_loss=0.3561, pruned_loss=0.1201, over 4279263.93 frames. ], batch size: 471, lr: 2.33e-02, grad_scale: 16.0 2023-06-18 13:10:11,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=162360.0, ans=0.0 2023-06-18 13:10:15,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=162360.0, ans=0.0 2023-06-18 13:10:22,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=162360.0, ans=0.125 2023-06-18 13:11:14,997 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.73 vs. limit=5.0 2023-06-18 13:11:15,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=162480.0, ans=0.035 2023-06-18 13:11:23,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=162540.0, ans=0.1 2023-06-18 13:11:34,184 INFO [train.py:996] (2/4) Epoch 1, batch 27100, loss[loss=0.2747, simple_loss=0.3617, pruned_loss=0.09388, over 21681.00 frames. ], tot_loss[loss=0.3012, simple_loss=0.3583, pruned_loss=0.1221, over 4277896.57 frames. ], batch size: 230, lr: 2.32e-02, grad_scale: 16.0 2023-06-18 13:12:04,871 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=162600.0, ans=0.125 2023-06-18 13:12:06,461 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=162600.0, ans=0.2 2023-06-18 13:13:45,528 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.096e+02 2.912e+02 3.397e+02 4.397e+02 9.217e+02, threshold=6.794e+02, percent-clipped=5.0 2023-06-18 13:14:30,554 INFO [train.py:996] (2/4) Epoch 1, batch 27150, loss[loss=0.3726, simple_loss=0.4299, pruned_loss=0.1577, over 21767.00 frames. ], tot_loss[loss=0.3103, simple_loss=0.3695, pruned_loss=0.1256, over 4277015.46 frames. ], batch size: 351, lr: 2.32e-02, grad_scale: 16.0 2023-06-18 13:14:53,054 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.96 vs. limit=22.5 2023-06-18 13:14:53,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=162900.0, ans=0.125 2023-06-18 13:16:16,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=163020.0, ans=0.0 2023-06-18 13:16:19,471 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=163080.0, ans=0.125 2023-06-18 13:17:20,111 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=163140.0, ans=0.0 2023-06-18 13:17:33,479 INFO [train.py:996] (2/4) Epoch 1, batch 27200, loss[loss=0.3878, simple_loss=0.4253, pruned_loss=0.1752, over 21797.00 frames. ], tot_loss[loss=0.3199, simple_loss=0.3806, pruned_loss=0.1296, over 4273847.49 frames. ], batch size: 124, lr: 2.32e-02, grad_scale: 32.0 2023-06-18 13:19:26,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=163380.0, ans=0.125 2023-06-18 13:19:30,609 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.925e+02 3.450e+02 3.753e+02 4.440e+02 7.540e+02, threshold=7.506e+02, percent-clipped=2.0 2023-06-18 13:19:57,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=163440.0, ans=0.07 2023-06-18 13:20:13,113 INFO [train.py:996] (2/4) Epoch 1, batch 27250, loss[loss=0.3403, simple_loss=0.3853, pruned_loss=0.1477, over 21871.00 frames. ], tot_loss[loss=0.3289, simple_loss=0.385, pruned_loss=0.1364, over 4266972.18 frames. ], batch size: 371, lr: 2.32e-02, grad_scale: 32.0 2023-06-18 13:20:46,788 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=163560.0, ans=0.0 2023-06-18 13:22:01,178 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=163680.0, ans=0.125 2023-06-18 13:22:22,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=163740.0, ans=0.125 2023-06-18 13:23:10,377 INFO [train.py:996] (2/4) Epoch 1, batch 27300, loss[loss=0.343, simple_loss=0.4058, pruned_loss=0.1401, over 21313.00 frames. ], tot_loss[loss=0.3327, simple_loss=0.3878, pruned_loss=0.1388, over 4261606.02 frames. ], batch size: 549, lr: 2.32e-02, grad_scale: 32.0 2023-06-18 13:25:12,796 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.449e+02 3.700e+02 4.629e+02 5.941e+02 1.159e+03, threshold=9.258e+02, percent-clipped=7.0 2023-06-18 13:26:07,926 INFO [train.py:996] (2/4) Epoch 1, batch 27350, loss[loss=0.317, simple_loss=0.3698, pruned_loss=0.1321, over 21260.00 frames. ], tot_loss[loss=0.3332, simple_loss=0.389, pruned_loss=0.1387, over 4268441.34 frames. ], batch size: 143, lr: 2.31e-02, grad_scale: 32.0 2023-06-18 13:26:44,675 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=164160.0, ans=0.125 2023-06-18 13:27:13,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=164220.0, ans=0.125 2023-06-18 13:27:49,817 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.47 vs. limit=22.5 2023-06-18 13:28:02,839 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=164340.0, ans=0.125 2023-06-18 13:28:07,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=164340.0, ans=0.125 2023-06-18 13:28:34,423 INFO [train.py:996] (2/4) Epoch 1, batch 27400, loss[loss=0.303, simple_loss=0.346, pruned_loss=0.13, over 21338.00 frames. ], tot_loss[loss=0.3299, simple_loss=0.3842, pruned_loss=0.1378, over 4266690.36 frames. ], batch size: 143, lr: 2.31e-02, grad_scale: 32.0 2023-06-18 13:30:12,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=164520.0, ans=0.1 2023-06-18 13:30:21,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=164580.0, ans=0.125 2023-06-18 13:30:34,864 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.370e+02 2.849e+02 3.387e+02 3.962e+02 7.059e+02, threshold=6.774e+02, percent-clipped=0.0 2023-06-18 13:30:43,269 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.43 vs. limit=15.0 2023-06-18 13:30:53,893 INFO [train.py:996] (2/4) Epoch 1, batch 27450, loss[loss=0.295, simple_loss=0.3658, pruned_loss=0.112, over 21621.00 frames. ], tot_loss[loss=0.3228, simple_loss=0.3768, pruned_loss=0.1344, over 4272755.28 frames. ], batch size: 247, lr: 2.31e-02, grad_scale: 32.0 2023-06-18 13:31:32,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=164760.0, ans=0.1 2023-06-18 13:32:27,371 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.85 vs. limit=15.0 2023-06-18 13:32:38,650 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.38 vs. limit=15.0 2023-06-18 13:32:51,706 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=164880.0, ans=0.0 2023-06-18 13:33:34,191 INFO [train.py:996] (2/4) Epoch 1, batch 27500, loss[loss=0.2861, simple_loss=0.3373, pruned_loss=0.1175, over 21584.00 frames. ], tot_loss[loss=0.3235, simple_loss=0.3758, pruned_loss=0.1356, over 4282215.83 frames. ], batch size: 212, lr: 2.31e-02, grad_scale: 32.0 2023-06-18 13:35:25,335 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.039e+02 2.982e+02 3.278e+02 3.839e+02 6.377e+02, threshold=6.555e+02, percent-clipped=0.0 2023-06-18 13:36:05,222 INFO [train.py:996] (2/4) Epoch 1, batch 27550, loss[loss=0.2561, simple_loss=0.3155, pruned_loss=0.09834, over 21603.00 frames. ], tot_loss[loss=0.3136, simple_loss=0.3672, pruned_loss=0.13, over 4279545.89 frames. ], batch size: 263, lr: 2.31e-02, grad_scale: 32.0 2023-06-18 13:37:57,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=165540.0, ans=0.05 2023-06-18 13:38:15,287 INFO [train.py:996] (2/4) Epoch 1, batch 27600, loss[loss=0.3334, simple_loss=0.3607, pruned_loss=0.1531, over 21265.00 frames. ], tot_loss[loss=0.3087, simple_loss=0.36, pruned_loss=0.1287, over 4268766.78 frames. ], batch size: 471, lr: 2.30e-02, grad_scale: 32.0 2023-06-18 13:38:15,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=165600.0, ans=0.125 2023-06-18 13:38:31,886 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=165600.0, ans=0.125 2023-06-18 13:39:04,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=165660.0, ans=0.125 2023-06-18 13:39:09,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=165660.0, ans=0.95 2023-06-18 13:39:49,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=165780.0, ans=0.2 2023-06-18 13:40:02,821 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.227e+02 3.185e+02 3.559e+02 4.117e+02 6.813e+02, threshold=7.118e+02, percent-clipped=1.0 2023-06-18 13:40:24,893 INFO [train.py:996] (2/4) Epoch 1, batch 27650, loss[loss=0.3285, simple_loss=0.3704, pruned_loss=0.1433, over 21622.00 frames. ], tot_loss[loss=0.3024, simple_loss=0.3524, pruned_loss=0.1263, over 4276758.86 frames. ], batch size: 441, lr: 2.30e-02, grad_scale: 32.0 2023-06-18 13:41:55,764 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-18 13:43:02,028 INFO [train.py:996] (2/4) Epoch 1, batch 27700, loss[loss=0.4354, simple_loss=0.461, pruned_loss=0.2049, over 21556.00 frames. ], tot_loss[loss=0.2996, simple_loss=0.3525, pruned_loss=0.1234, over 4284564.15 frames. ], batch size: 508, lr: 2.30e-02, grad_scale: 32.0 2023-06-18 13:43:55,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=166260.0, ans=0.0 2023-06-18 13:44:22,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=166320.0, ans=0.2 2023-06-18 13:44:55,898 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=166380.0, ans=0.1 2023-06-18 13:45:09,539 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.136e+02 3.145e+02 3.672e+02 4.225e+02 7.825e+02, threshold=7.344e+02, percent-clipped=2.0 2023-06-18 13:45:14,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=166380.0, ans=0.1 2023-06-18 13:45:42,143 INFO [train.py:996] (2/4) Epoch 1, batch 27750, loss[loss=0.2947, simple_loss=0.3547, pruned_loss=0.1173, over 21847.00 frames. ], tot_loss[loss=0.3, simple_loss=0.3552, pruned_loss=0.1224, over 4281038.05 frames. ], batch size: 351, lr: 2.30e-02, grad_scale: 32.0 2023-06-18 13:46:37,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=166560.0, ans=0.0 2023-06-18 13:48:15,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=166740.0, ans=0.0 2023-06-18 13:48:16,980 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=12.0 2023-06-18 13:48:23,260 INFO [train.py:996] (2/4) Epoch 1, batch 27800, loss[loss=0.2996, simple_loss=0.3524, pruned_loss=0.1234, over 21497.00 frames. ], tot_loss[loss=0.3015, simple_loss=0.3549, pruned_loss=0.1241, over 4290132.31 frames. ], batch size: 131, lr: 2.30e-02, grad_scale: 32.0 2023-06-18 13:48:27,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=166800.0, ans=0.125 2023-06-18 13:48:44,377 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.96 vs. limit=15.0 2023-06-18 13:49:09,416 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.66 vs. limit=15.0 2023-06-18 13:49:26,839 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=166920.0, ans=0.07 2023-06-18 13:49:54,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=166920.0, ans=22.5 2023-06-18 13:50:00,997 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 13:50:11,380 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.644e+02 3.084e+02 3.621e+02 4.481e+02 7.199e+02, threshold=7.242e+02, percent-clipped=0.0 2023-06-18 13:50:24,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=167040.0, ans=0.0 2023-06-18 13:50:45,598 INFO [train.py:996] (2/4) Epoch 1, batch 27850, loss[loss=0.322, simple_loss=0.368, pruned_loss=0.138, over 21367.00 frames. ], tot_loss[loss=0.3036, simple_loss=0.3552, pruned_loss=0.126, over 4294254.77 frames. ], batch size: 159, lr: 2.29e-02, grad_scale: 32.0 2023-06-18 13:52:13,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=167220.0, ans=0.125 2023-06-18 13:52:30,464 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.51 vs. limit=10.0 2023-06-18 13:52:37,454 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.81 vs. limit=15.0 2023-06-18 13:53:10,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=167340.0, ans=0.2 2023-06-18 13:53:15,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=167340.0, ans=0.125 2023-06-18 13:53:37,413 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=167340.0, ans=0.125 2023-06-18 13:53:43,402 INFO [train.py:996] (2/4) Epoch 1, batch 27900, loss[loss=0.3039, simple_loss=0.3807, pruned_loss=0.1135, over 21799.00 frames. ], tot_loss[loss=0.3122, simple_loss=0.3676, pruned_loss=0.1284, over 4297653.05 frames. ], batch size: 316, lr: 2.29e-02, grad_scale: 32.0 2023-06-18 13:54:33,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=167460.0, ans=0.1 2023-06-18 13:54:33,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=167460.0, ans=0.0 2023-06-18 13:54:34,640 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 13:55:23,918 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.36 vs. limit=15.0 2023-06-18 13:55:58,692 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.130e+02 3.095e+02 3.904e+02 5.239e+02 9.245e+02, threshold=7.808e+02, percent-clipped=6.0 2023-06-18 13:56:08,283 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 13:56:23,310 INFO [train.py:996] (2/4) Epoch 1, batch 27950, loss[loss=0.2671, simple_loss=0.3638, pruned_loss=0.08514, over 20751.00 frames. ], tot_loss[loss=0.3028, simple_loss=0.3631, pruned_loss=0.1212, over 4292403.85 frames. ], batch size: 607, lr: 2.29e-02, grad_scale: 32.0 2023-06-18 13:59:22,419 INFO [train.py:996] (2/4) Epoch 1, batch 28000, loss[loss=0.3435, simple_loss=0.3786, pruned_loss=0.1542, over 21560.00 frames. ], tot_loss[loss=0.2998, simple_loss=0.3618, pruned_loss=0.1189, over 4293697.17 frames. ], batch size: 548, lr: 2.29e-02, grad_scale: 32.0 2023-06-18 13:59:52,583 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=168060.0, ans=0.1 2023-06-18 14:00:13,582 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 14:00:13,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=168120.0, ans=0.0 2023-06-18 14:01:10,897 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.731e+02 2.776e+02 3.408e+02 4.603e+02 6.959e+02, threshold=6.817e+02, percent-clipped=0.0 2023-06-18 14:01:34,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=168240.0, ans=22.5 2023-06-18 14:01:43,200 INFO [train.py:996] (2/4) Epoch 1, batch 28050, loss[loss=0.3015, simple_loss=0.3647, pruned_loss=0.1191, over 21692.00 frames. ], tot_loss[loss=0.301, simple_loss=0.3595, pruned_loss=0.1213, over 4296393.08 frames. ], batch size: 389, lr: 2.29e-02, grad_scale: 32.0 2023-06-18 14:02:01,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=168300.0, ans=0.1 2023-06-18 14:02:04,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=168300.0, ans=0.125 2023-06-18 14:02:45,624 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.06 vs. limit=12.0 2023-06-18 14:02:47,879 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=168360.0, ans=0.125 2023-06-18 14:02:50,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=168360.0, ans=0.2 2023-06-18 14:02:52,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=168360.0, ans=0.125 2023-06-18 14:04:30,165 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=168600.0, ans=0.0 2023-06-18 14:04:31,082 INFO [train.py:996] (2/4) Epoch 1, batch 28100, loss[loss=0.2767, simple_loss=0.3264, pruned_loss=0.1134, over 21805.00 frames. ], tot_loss[loss=0.2997, simple_loss=0.3568, pruned_loss=0.1213, over 4296283.51 frames. ], batch size: 124, lr: 2.29e-02, grad_scale: 32.0 2023-06-18 14:04:35,032 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.73 vs. limit=10.0 2023-06-18 14:04:35,059 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.75 vs. limit=15.0 2023-06-18 14:05:31,979 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=168720.0, ans=0.2 2023-06-18 14:05:47,477 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=168720.0, ans=0.1 2023-06-18 14:06:12,275 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.273e+02 3.173e+02 3.819e+02 4.992e+02 1.067e+03, threshold=7.638e+02, percent-clipped=9.0 2023-06-18 14:06:13,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=168780.0, ans=0.0 2023-06-18 14:06:39,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=168840.0, ans=0.125 2023-06-18 14:06:58,335 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=168900.0, ans=0.125 2023-06-18 14:06:59,273 INFO [train.py:996] (2/4) Epoch 1, batch 28150, loss[loss=0.2545, simple_loss=0.2998, pruned_loss=0.1046, over 21602.00 frames. ], tot_loss[loss=0.2969, simple_loss=0.3509, pruned_loss=0.1214, over 4279089.41 frames. ], batch size: 247, lr: 2.28e-02, grad_scale: 32.0 2023-06-18 14:06:59,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=168900.0, ans=0.125 2023-06-18 14:07:27,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=168960.0, ans=0.125 2023-06-18 14:08:01,447 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=169020.0, ans=0.0 2023-06-18 14:08:25,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=169080.0, ans=0.125 2023-06-18 14:09:29,049 INFO [train.py:996] (2/4) Epoch 1, batch 28200, loss[loss=0.3823, simple_loss=0.4867, pruned_loss=0.1389, over 19782.00 frames. ], tot_loss[loss=0.3004, simple_loss=0.352, pruned_loss=0.1244, over 4277779.76 frames. ], batch size: 702, lr: 2.28e-02, grad_scale: 32.0 2023-06-18 14:09:54,840 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=169200.0, ans=0.2 2023-06-18 14:09:59,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=169260.0, ans=0.2 2023-06-18 14:10:08,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=169260.0, ans=0.1 2023-06-18 14:10:24,641 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.25 vs. limit=15.0 2023-06-18 14:10:52,470 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=169320.0, ans=0.0 2023-06-18 14:11:38,005 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.406e+02 3.366e+02 3.817e+02 4.750e+02 7.946e+02, threshold=7.633e+02, percent-clipped=3.0 2023-06-18 14:12:07,344 INFO [train.py:996] (2/4) Epoch 1, batch 28250, loss[loss=0.2712, simple_loss=0.3202, pruned_loss=0.1111, over 21205.00 frames. ], tot_loss[loss=0.3069, simple_loss=0.3573, pruned_loss=0.1283, over 4271997.89 frames. ], batch size: 159, lr: 2.28e-02, grad_scale: 32.0 2023-06-18 14:12:13,517 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=169500.0, ans=0.1 2023-06-18 14:12:21,310 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=169560.0, ans=0.2 2023-06-18 14:12:26,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=169560.0, ans=0.125 2023-06-18 14:12:53,374 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=169620.0, ans=0.125 2023-06-18 14:13:49,729 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=169680.0, ans=0.0 2023-06-18 14:14:17,971 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.85 vs. limit=15.0 2023-06-18 14:14:23,361 INFO [train.py:996] (2/4) Epoch 1, batch 28300, loss[loss=0.2835, simple_loss=0.3635, pruned_loss=0.1018, over 21672.00 frames. ], tot_loss[loss=0.303, simple_loss=0.3544, pruned_loss=0.1258, over 4273906.80 frames. ], batch size: 441, lr: 2.28e-02, grad_scale: 32.0 2023-06-18 14:15:08,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=169860.0, ans=0.0 2023-06-18 14:15:35,305 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.40 vs. limit=15.0 2023-06-18 14:16:01,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=169920.0, ans=0.2 2023-06-18 14:16:10,580 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=169920.0, ans=10.0 2023-06-18 14:16:42,455 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.730e+02 2.780e+02 3.405e+02 4.225e+02 7.738e+02, threshold=6.811e+02, percent-clipped=1.0 2023-06-18 14:16:55,324 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.77 vs. limit=15.0 2023-06-18 14:17:20,903 INFO [train.py:996] (2/4) Epoch 1, batch 28350, loss[loss=0.2129, simple_loss=0.2997, pruned_loss=0.06302, over 21350.00 frames. ], tot_loss[loss=0.2924, simple_loss=0.349, pruned_loss=0.118, over 4277842.62 frames. ], batch size: 211, lr: 2.28e-02, grad_scale: 32.0 2023-06-18 14:17:24,364 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=170100.0, ans=10.0 2023-06-18 14:17:25,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=170100.0, ans=0.125 2023-06-18 14:17:37,488 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=170160.0, ans=0.1 2023-06-18 14:18:19,234 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.32 vs. limit=15.0 2023-06-18 14:19:19,668 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=170340.0, ans=0.2 2023-06-18 14:19:21,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=170340.0, ans=0.125 2023-06-18 14:19:32,445 INFO [train.py:996] (2/4) Epoch 1, batch 28400, loss[loss=0.3272, simple_loss=0.4218, pruned_loss=0.1163, over 19726.00 frames. ], tot_loss[loss=0.2899, simple_loss=0.3451, pruned_loss=0.1174, over 4274541.15 frames. ], batch size: 703, lr: 2.27e-02, grad_scale: 32.0 2023-06-18 14:19:59,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=170400.0, ans=0.125 2023-06-18 14:20:01,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=170400.0, ans=0.1 2023-06-18 14:20:01,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=170400.0, ans=0.125 2023-06-18 14:21:26,783 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.97 vs. limit=15.0 2023-06-18 14:21:34,403 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 3.372e+02 4.044e+02 4.842e+02 8.605e+02, threshold=8.089e+02, percent-clipped=5.0 2023-06-18 14:22:12,934 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=12.0 2023-06-18 14:22:19,298 INFO [train.py:996] (2/4) Epoch 1, batch 28450, loss[loss=0.3234, simple_loss=0.3823, pruned_loss=0.1323, over 21753.00 frames. ], tot_loss[loss=0.3014, simple_loss=0.3539, pruned_loss=0.1244, over 4272309.01 frames. ], batch size: 112, lr: 2.27e-02, grad_scale: 32.0 2023-06-18 14:22:42,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=170700.0, ans=0.1 2023-06-18 14:22:59,384 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=170760.0, ans=0.1 2023-06-18 14:23:32,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=170760.0, ans=0.5 2023-06-18 14:24:12,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=170880.0, ans=0.0 2023-06-18 14:24:13,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=170880.0, ans=0.125 2023-06-18 14:24:22,956 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.54 vs. limit=15.0 2023-06-18 14:24:34,372 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=170940.0, ans=0.1 2023-06-18 14:24:43,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=170940.0, ans=0.0 2023-06-18 14:24:45,895 INFO [train.py:996] (2/4) Epoch 1, batch 28500, loss[loss=0.3586, simple_loss=0.4032, pruned_loss=0.157, over 21248.00 frames. ], tot_loss[loss=0.3082, simple_loss=0.3582, pruned_loss=0.1291, over 4280587.34 frames. ], batch size: 143, lr: 2.27e-02, grad_scale: 32.0 2023-06-18 14:25:21,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=171060.0, ans=0.125 2023-06-18 14:25:46,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=171060.0, ans=0.0 2023-06-18 14:25:49,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=171060.0, ans=0.2 2023-06-18 14:25:51,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=171060.0, ans=0.125 2023-06-18 14:26:29,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=171180.0, ans=0.1 2023-06-18 14:26:31,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=171180.0, ans=0.125 2023-06-18 14:26:37,756 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.263e+02 3.096e+02 3.441e+02 4.526e+02 8.440e+02, threshold=6.881e+02, percent-clipped=1.0 2023-06-18 14:27:38,029 INFO [train.py:996] (2/4) Epoch 1, batch 28550, loss[loss=0.3183, simple_loss=0.3994, pruned_loss=0.1186, over 21724.00 frames. ], tot_loss[loss=0.3157, simple_loss=0.3664, pruned_loss=0.1325, over 4283452.58 frames. ], batch size: 247, lr: 2.27e-02, grad_scale: 32.0 2023-06-18 14:28:00,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=171300.0, ans=0.1 2023-06-18 14:28:00,949 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=171300.0, ans=0.125 2023-06-18 14:28:43,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=171420.0, ans=0.0 2023-06-18 14:28:46,460 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=171420.0, ans=0.0 2023-06-18 14:29:59,324 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=171540.0, ans=0.2 2023-06-18 14:30:17,195 INFO [train.py:996] (2/4) Epoch 1, batch 28600, loss[loss=0.3121, simple_loss=0.3707, pruned_loss=0.1268, over 21374.00 frames. ], tot_loss[loss=0.3221, simple_loss=0.3742, pruned_loss=0.135, over 4283172.64 frames. ], batch size: 211, lr: 2.27e-02, grad_scale: 32.0 2023-06-18 14:31:17,041 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 14:31:36,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=171720.0, ans=0.125 2023-06-18 14:32:15,155 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.287e+02 3.104e+02 3.641e+02 4.688e+02 7.003e+02, threshold=7.282e+02, percent-clipped=1.0 2023-06-18 14:32:57,302 INFO [train.py:996] (2/4) Epoch 1, batch 28650, loss[loss=0.2803, simple_loss=0.3241, pruned_loss=0.1182, over 21857.00 frames. ], tot_loss[loss=0.3163, simple_loss=0.3664, pruned_loss=0.1331, over 4286134.02 frames. ], batch size: 98, lr: 2.26e-02, grad_scale: 32.0 2023-06-18 14:33:07,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=171900.0, ans=0.125 2023-06-18 14:34:15,229 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.72 vs. limit=15.0 2023-06-18 14:34:16,610 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.68 vs. limit=6.0 2023-06-18 14:34:43,469 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=172080.0, ans=0.0 2023-06-18 14:34:53,144 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.36 vs. limit=15.0 2023-06-18 14:35:35,836 INFO [train.py:996] (2/4) Epoch 1, batch 28700, loss[loss=0.3114, simple_loss=0.3647, pruned_loss=0.129, over 21796.00 frames. ], tot_loss[loss=0.3175, simple_loss=0.3659, pruned_loss=0.1345, over 4288305.29 frames. ], batch size: 247, lr: 2.26e-02, grad_scale: 32.0 2023-06-18 14:37:34,074 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.430e+02 3.041e+02 3.554e+02 4.437e+02 9.213e+02, threshold=7.107e+02, percent-clipped=2.0 2023-06-18 14:38:08,591 INFO [train.py:996] (2/4) Epoch 1, batch 28750, loss[loss=0.3073, simple_loss=0.3558, pruned_loss=0.1294, over 21553.00 frames. ], tot_loss[loss=0.317, simple_loss=0.3645, pruned_loss=0.1347, over 4289649.74 frames. ], batch size: 144, lr: 2.26e-02, grad_scale: 32.0 2023-06-18 14:38:21,203 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=172500.0, ans=0.0 2023-06-18 14:39:24,470 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=172620.0, ans=0.1 2023-06-18 14:39:24,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=172620.0, ans=0.2 2023-06-18 14:40:46,665 INFO [train.py:996] (2/4) Epoch 1, batch 28800, loss[loss=0.368, simple_loss=0.4086, pruned_loss=0.1637, over 21571.00 frames. ], tot_loss[loss=0.3196, simple_loss=0.3689, pruned_loss=0.1351, over 4289370.51 frames. ], batch size: 414, lr: 2.26e-02, grad_scale: 32.0 2023-06-18 14:41:10,441 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=172800.0, ans=0.0 2023-06-18 14:42:25,387 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.95 vs. limit=12.0 2023-06-18 14:42:32,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=172980.0, ans=0.0 2023-06-18 14:42:41,100 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.872e+02 3.537e+02 4.377e+02 5.475e+02 1.000e+03, threshold=8.754e+02, percent-clipped=7.0 2023-06-18 14:43:28,262 INFO [train.py:996] (2/4) Epoch 1, batch 28850, loss[loss=0.291, simple_loss=0.3349, pruned_loss=0.1235, over 20917.00 frames. ], tot_loss[loss=0.3225, simple_loss=0.371, pruned_loss=0.137, over 4290028.63 frames. ], batch size: 607, lr: 2.26e-02, grad_scale: 32.0 2023-06-18 14:46:14,985 INFO [train.py:996] (2/4) Epoch 1, batch 28900, loss[loss=0.3553, simple_loss=0.4066, pruned_loss=0.152, over 20884.00 frames. ], tot_loss[loss=0.3264, simple_loss=0.3745, pruned_loss=0.1392, over 4289786.65 frames. ], batch size: 608, lr: 2.26e-02, grad_scale: 32.0 2023-06-18 14:46:57,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=173460.0, ans=0.05 2023-06-18 14:47:31,226 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.05 vs. limit=10.0 2023-06-18 14:48:09,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=173580.0, ans=0.125 2023-06-18 14:48:16,019 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.258e+02 3.311e+02 3.872e+02 4.627e+02 1.015e+03, threshold=7.745e+02, percent-clipped=2.0 2023-06-18 14:49:06,885 INFO [train.py:996] (2/4) Epoch 1, batch 28950, loss[loss=0.2534, simple_loss=0.3092, pruned_loss=0.09878, over 21230.00 frames. ], tot_loss[loss=0.3217, simple_loss=0.372, pruned_loss=0.1357, over 4276100.45 frames. ], batch size: 176, lr: 2.25e-02, grad_scale: 64.0 2023-06-18 14:49:55,531 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=173760.0, ans=0.1 2023-06-18 14:50:28,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=173820.0, ans=0.2 2023-06-18 14:50:31,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=173820.0, ans=0.0 2023-06-18 14:50:53,327 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.70 vs. limit=6.0 2023-06-18 14:51:45,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=173940.0, ans=0.125 2023-06-18 14:51:52,106 INFO [train.py:996] (2/4) Epoch 1, batch 29000, loss[loss=0.3628, simple_loss=0.3953, pruned_loss=0.1652, over 19951.00 frames. ], tot_loss[loss=0.3221, simple_loss=0.3758, pruned_loss=0.1342, over 4263594.95 frames. ], batch size: 703, lr: 2.25e-02, grad_scale: 64.0 2023-06-18 14:52:32,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=174060.0, ans=0.125 2023-06-18 14:52:33,094 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.20 vs. limit=15.0 2023-06-18 14:53:19,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=174120.0, ans=0.125 2023-06-18 14:53:33,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=174180.0, ans=0.04949747468305833 2023-06-18 14:53:52,715 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.000e+02 3.056e+02 3.518e+02 4.354e+02 7.767e+02, threshold=7.035e+02, percent-clipped=1.0 2023-06-18 14:54:20,087 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=174240.0, ans=0.1 2023-06-18 14:54:47,792 INFO [train.py:996] (2/4) Epoch 1, batch 29050, loss[loss=0.3188, simple_loss=0.3617, pruned_loss=0.1379, over 21565.00 frames. ], tot_loss[loss=0.324, simple_loss=0.3752, pruned_loss=0.1364, over 4275457.49 frames. ], batch size: 548, lr: 2.25e-02, grad_scale: 32.0 2023-06-18 14:56:01,947 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=174420.0, ans=0.125 2023-06-18 14:56:50,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=174540.0, ans=0.125 2023-06-18 14:57:12,110 INFO [train.py:996] (2/4) Epoch 1, batch 29100, loss[loss=0.2532, simple_loss=0.3041, pruned_loss=0.1012, over 21684.00 frames. ], tot_loss[loss=0.3132, simple_loss=0.3632, pruned_loss=0.1316, over 4279918.05 frames. ], batch size: 316, lr: 2.25e-02, grad_scale: 32.0 2023-06-18 14:57:33,451 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.82 vs. limit=10.0 2023-06-18 14:58:22,430 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=174720.0, ans=0.125 2023-06-18 14:58:35,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=174720.0, ans=0.05 2023-06-18 14:59:03,210 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.10 vs. limit=15.0 2023-06-18 14:59:07,693 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.030e+02 3.175e+02 3.774e+02 4.756e+02 7.058e+02, threshold=7.548e+02, percent-clipped=1.0 2023-06-18 14:59:38,227 INFO [train.py:996] (2/4) Epoch 1, batch 29150, loss[loss=0.267, simple_loss=0.3387, pruned_loss=0.09768, over 21278.00 frames. ], tot_loss[loss=0.3096, simple_loss=0.3619, pruned_loss=0.1287, over 4275836.94 frames. ], batch size: 176, lr: 2.25e-02, grad_scale: 32.0 2023-06-18 15:00:02,812 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.31 vs. limit=15.0 2023-06-18 15:00:10,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=174900.0, ans=0.2 2023-06-18 15:01:06,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=175080.0, ans=0.2 2023-06-18 15:01:41,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=175140.0, ans=0.125 2023-06-18 15:02:04,337 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.52 vs. limit=22.5 2023-06-18 15:02:04,912 INFO [train.py:996] (2/4) Epoch 1, batch 29200, loss[loss=0.2825, simple_loss=0.3195, pruned_loss=0.1228, over 21275.00 frames. ], tot_loss[loss=0.305, simple_loss=0.3562, pruned_loss=0.1269, over 4275791.95 frames. ], batch size: 144, lr: 2.24e-02, grad_scale: 32.0 2023-06-18 15:02:11,422 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.18 vs. limit=15.0 2023-06-18 15:02:52,691 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=175260.0, ans=0.125 2023-06-18 15:03:59,429 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=175380.0, ans=0.125 2023-06-18 15:04:00,403 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 2.999e+02 3.311e+02 3.772e+02 7.580e+02, threshold=6.622e+02, percent-clipped=1.0 2023-06-18 15:04:00,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=175380.0, ans=0.125 2023-06-18 15:04:15,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=175440.0, ans=0.125 2023-06-18 15:04:39,751 INFO [train.py:996] (2/4) Epoch 1, batch 29250, loss[loss=0.2856, simple_loss=0.3651, pruned_loss=0.1031, over 21829.00 frames. ], tot_loss[loss=0.2996, simple_loss=0.353, pruned_loss=0.1231, over 4266858.07 frames. ], batch size: 317, lr: 2.24e-02, grad_scale: 32.0 2023-06-18 15:05:39,866 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=175560.0, ans=0.2 2023-06-18 15:05:46,466 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.44 vs. limit=22.5 2023-06-18 15:06:07,213 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.81 vs. limit=12.0 2023-06-18 15:07:11,916 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.34 vs. limit=15.0 2023-06-18 15:07:21,913 INFO [train.py:996] (2/4) Epoch 1, batch 29300, loss[loss=0.3016, simple_loss=0.376, pruned_loss=0.1137, over 21711.00 frames. ], tot_loss[loss=0.2991, simple_loss=0.3545, pruned_loss=0.1218, over 4269534.68 frames. ], batch size: 332, lr: 2.24e-02, grad_scale: 32.0 2023-06-18 15:07:35,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=175800.0, ans=0.2 2023-06-18 15:07:39,830 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=175800.0, ans=0.1 2023-06-18 15:09:14,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=175980.0, ans=0.0 2023-06-18 15:09:16,991 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.230e+02 3.106e+02 3.713e+02 4.603e+02 7.072e+02, threshold=7.425e+02, percent-clipped=2.0 2023-06-18 15:09:31,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=176040.0, ans=0.125 2023-06-18 15:09:50,099 INFO [train.py:996] (2/4) Epoch 1, batch 29350, loss[loss=0.2928, simple_loss=0.3632, pruned_loss=0.1112, over 21637.00 frames. ], tot_loss[loss=0.298, simple_loss=0.3511, pruned_loss=0.1224, over 4270610.80 frames. ], batch size: 298, lr: 2.24e-02, grad_scale: 32.0 2023-06-18 15:09:55,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=176100.0, ans=0.025 2023-06-18 15:10:24,930 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.62 vs. limit=15.0 2023-06-18 15:10:32,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=176160.0, ans=0.1 2023-06-18 15:10:45,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=176220.0, ans=0.2 2023-06-18 15:11:14,137 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=176220.0, ans=0.125 2023-06-18 15:12:23,753 INFO [train.py:996] (2/4) Epoch 1, batch 29400, loss[loss=0.2652, simple_loss=0.3483, pruned_loss=0.09106, over 21294.00 frames. ], tot_loss[loss=0.2952, simple_loss=0.3512, pruned_loss=0.1195, over 4275285.06 frames. ], batch size: 551, lr: 2.24e-02, grad_scale: 32.0 2023-06-18 15:12:30,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=176400.0, ans=0.0 2023-06-18 15:13:19,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=176520.0, ans=0.125 2023-06-18 15:13:47,914 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=17.15 vs. limit=22.5 2023-06-18 15:14:03,141 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.05 vs. limit=15.0 2023-06-18 15:14:09,464 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.963e+02 2.984e+02 3.574e+02 4.623e+02 7.193e+02, threshold=7.148e+02, percent-clipped=0.0 2023-06-18 15:15:00,992 INFO [train.py:996] (2/4) Epoch 1, batch 29450, loss[loss=0.3355, simple_loss=0.3874, pruned_loss=0.1418, over 21573.00 frames. ], tot_loss[loss=0.295, simple_loss=0.3511, pruned_loss=0.1194, over 4271133.39 frames. ], batch size: 414, lr: 2.24e-02, grad_scale: 32.0 2023-06-18 15:15:02,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=176700.0, ans=0.0 2023-06-18 15:16:25,683 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2023-06-18 15:16:56,442 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=176880.0, ans=0.07 2023-06-18 15:17:41,682 INFO [train.py:996] (2/4) Epoch 1, batch 29500, loss[loss=0.3165, simple_loss=0.3554, pruned_loss=0.1389, over 21267.00 frames. ], tot_loss[loss=0.3036, simple_loss=0.3574, pruned_loss=0.1249, over 4276069.55 frames. ], batch size: 159, lr: 2.23e-02, grad_scale: 32.0 2023-06-18 15:19:12,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=177180.0, ans=0.125 2023-06-18 15:19:20,756 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=177180.0, ans=0.125 2023-06-18 15:19:28,515 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.412e+02 3.232e+02 3.818e+02 4.731e+02 1.137e+03, threshold=7.636e+02, percent-clipped=8.0 2023-06-18 15:20:13,110 INFO [train.py:996] (2/4) Epoch 1, batch 29550, loss[loss=0.3187, simple_loss=0.3627, pruned_loss=0.1373, over 21886.00 frames. ], tot_loss[loss=0.305, simple_loss=0.3567, pruned_loss=0.1266, over 4281927.57 frames. ], batch size: 414, lr: 2.23e-02, grad_scale: 32.0 2023-06-18 15:20:53,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=177360.0, ans=0.125 2023-06-18 15:22:47,024 INFO [train.py:996] (2/4) Epoch 1, batch 29600, loss[loss=0.3346, simple_loss=0.4017, pruned_loss=0.1337, over 21840.00 frames. ], tot_loss[loss=0.3105, simple_loss=0.3629, pruned_loss=0.1291, over 4282218.33 frames. ], batch size: 316, lr: 2.23e-02, grad_scale: 32.0 2023-06-18 15:23:08,335 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=177600.0, ans=0.0 2023-06-18 15:23:55,239 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.09 vs. limit=15.0 2023-06-18 15:24:31,315 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.92 vs. limit=15.0 2023-06-18 15:24:54,987 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 2.992e+02 3.426e+02 4.261e+02 6.477e+02, threshold=6.851e+02, percent-clipped=0.0 2023-06-18 15:24:56,833 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=177840.0, ans=0.04949747468305833 2023-06-18 15:24:57,345 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.79 vs. limit=22.5 2023-06-18 15:25:26,168 INFO [train.py:996] (2/4) Epoch 1, batch 29650, loss[loss=0.2649, simple_loss=0.3221, pruned_loss=0.1039, over 21853.00 frames. ], tot_loss[loss=0.3053, simple_loss=0.3608, pruned_loss=0.125, over 4283136.10 frames. ], batch size: 351, lr: 2.23e-02, grad_scale: 32.0 2023-06-18 15:26:01,967 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.74 vs. limit=15.0 2023-06-18 15:26:44,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=178020.0, ans=0.125 2023-06-18 15:27:05,631 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.03 vs. limit=22.5 2023-06-18 15:28:08,842 INFO [train.py:996] (2/4) Epoch 1, batch 29700, loss[loss=0.2276, simple_loss=0.3001, pruned_loss=0.07754, over 21700.00 frames. ], tot_loss[loss=0.3073, simple_loss=0.3622, pruned_loss=0.1262, over 4288991.71 frames. ], batch size: 298, lr: 2.23e-02, grad_scale: 32.0 2023-06-18 15:28:10,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=178200.0, ans=0.0 2023-06-18 15:29:00,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=178260.0, ans=0.2 2023-06-18 15:30:08,324 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=178380.0, ans=0.1 2023-06-18 15:30:14,474 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.257e+02 3.216e+02 3.981e+02 5.488e+02 8.524e+02, threshold=7.962e+02, percent-clipped=7.0 2023-06-18 15:30:31,206 INFO [train.py:996] (2/4) Epoch 1, batch 29750, loss[loss=0.2622, simple_loss=0.3223, pruned_loss=0.1011, over 21463.00 frames. ], tot_loss[loss=0.3082, simple_loss=0.3659, pruned_loss=0.1252, over 4285915.07 frames. ], batch size: 131, lr: 2.22e-02, grad_scale: 32.0 2023-06-18 15:31:14,737 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 15:32:15,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=178680.0, ans=0.125 2023-06-18 15:32:43,825 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 15:32:51,685 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.66 vs. limit=22.5 2023-06-18 15:33:17,750 INFO [train.py:996] (2/4) Epoch 1, batch 29800, loss[loss=0.3093, simple_loss=0.3616, pruned_loss=0.1285, over 21752.00 frames. ], tot_loss[loss=0.3104, simple_loss=0.368, pruned_loss=0.1264, over 4290150.97 frames. ], batch size: 112, lr: 2.22e-02, grad_scale: 32.0 2023-06-18 15:34:09,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=178860.0, ans=0.2 2023-06-18 15:34:47,068 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 15:35:01,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=178980.0, ans=0.0 2023-06-18 15:35:23,484 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.951e+02 3.012e+02 3.384e+02 3.968e+02 6.046e+02, threshold=6.767e+02, percent-clipped=0.0 2023-06-18 15:35:33,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=179040.0, ans=0.125 2023-06-18 15:35:46,774 INFO [train.py:996] (2/4) Epoch 1, batch 29850, loss[loss=0.2566, simple_loss=0.3225, pruned_loss=0.0953, over 21796.00 frames. ], tot_loss[loss=0.3054, simple_loss=0.3635, pruned_loss=0.1236, over 4291362.76 frames. ], batch size: 282, lr: 2.22e-02, grad_scale: 32.0 2023-06-18 15:36:55,850 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=179220.0, ans=0.0 2023-06-18 15:38:20,889 INFO [train.py:996] (2/4) Epoch 1, batch 29900, loss[loss=0.3406, simple_loss=0.3794, pruned_loss=0.1509, over 21732.00 frames. ], tot_loss[loss=0.3076, simple_loss=0.3629, pruned_loss=0.1261, over 4294584.04 frames. ], batch size: 351, lr: 2.22e-02, grad_scale: 32.0 2023-06-18 15:38:46,238 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.73 vs. limit=15.0 2023-06-18 15:39:31,899 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=179520.0, ans=0.0 2023-06-18 15:39:39,805 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 15:39:41,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=179520.0, ans=10.0 2023-06-18 15:40:07,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=179580.0, ans=0.05 2023-06-18 15:40:12,442 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.413e+02 3.008e+02 3.612e+02 4.366e+02 7.864e+02, threshold=7.225e+02, percent-clipped=1.0 2023-06-18 15:40:49,207 INFO [train.py:996] (2/4) Epoch 1, batch 29950, loss[loss=0.3409, simple_loss=0.3856, pruned_loss=0.1481, over 21735.00 frames. ], tot_loss[loss=0.3135, simple_loss=0.3667, pruned_loss=0.1301, over 4289789.07 frames. ], batch size: 298, lr: 2.22e-02, grad_scale: 32.0 2023-06-18 15:41:42,061 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=179760.0, ans=0.125 2023-06-18 15:42:11,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=179820.0, ans=0.2 2023-06-18 15:42:45,652 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.92 vs. limit=15.0 2023-06-18 15:43:19,645 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=179940.0, ans=0.2 2023-06-18 15:43:22,198 INFO [train.py:996] (2/4) Epoch 1, batch 30000, loss[loss=0.2957, simple_loss=0.3792, pruned_loss=0.1061, over 21662.00 frames. ], tot_loss[loss=0.3154, simple_loss=0.3693, pruned_loss=0.1308, over 4285313.57 frames. ], batch size: 389, lr: 2.22e-02, grad_scale: 32.0 2023-06-18 15:43:22,199 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-18 15:44:17,110 INFO [train.py:1028] (2/4) Epoch 1, validation: loss=0.2715, simple_loss=0.3724, pruned_loss=0.08526, over 1796401.00 frames. 2023-06-18 15:44:17,110 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-18 15:44:57,032 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=180060.0, ans=0.0 2023-06-18 15:45:06,496 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.38 vs. limit=15.0 2023-06-18 15:45:09,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=180120.0, ans=0.125 2023-06-18 15:46:21,062 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.102e+02 2.874e+02 3.529e+02 4.809e+02 8.528e+02, threshold=7.059e+02, percent-clipped=3.0 2023-06-18 15:46:33,460 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.38 vs. limit=6.0 2023-06-18 15:46:49,221 INFO [train.py:996] (2/4) Epoch 1, batch 30050, loss[loss=0.3608, simple_loss=0.4653, pruned_loss=0.1282, over 21211.00 frames. ], tot_loss[loss=0.3111, simple_loss=0.3713, pruned_loss=0.1255, over 4268422.78 frames. ], batch size: 549, lr: 2.21e-02, grad_scale: 32.0 2023-06-18 15:47:08,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=180300.0, ans=0.0 2023-06-18 15:47:27,364 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.63 vs. limit=15.0 2023-06-18 15:47:29,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=180360.0, ans=0.2 2023-06-18 15:48:18,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=180480.0, ans=0.125 2023-06-18 15:49:30,835 INFO [train.py:996] (2/4) Epoch 1, batch 30100, loss[loss=0.3174, simple_loss=0.3465, pruned_loss=0.1441, over 21513.00 frames. ], tot_loss[loss=0.3123, simple_loss=0.3719, pruned_loss=0.1264, over 4264822.65 frames. ], batch size: 414, lr: 2.21e-02, grad_scale: 32.0 2023-06-18 15:50:23,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=180720.0, ans=0.0 2023-06-18 15:50:38,351 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=180720.0, ans=0.125 2023-06-18 15:51:05,696 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=180780.0, ans=0.125 2023-06-18 15:51:12,531 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.173e+02 3.465e+02 4.064e+02 4.791e+02 7.822e+02, threshold=8.128e+02, percent-clipped=3.0 2023-06-18 15:51:56,724 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=180840.0, ans=0.2 2023-06-18 15:51:59,481 INFO [train.py:996] (2/4) Epoch 1, batch 30150, loss[loss=0.3284, simple_loss=0.3703, pruned_loss=0.1433, over 21567.00 frames. ], tot_loss[loss=0.3117, simple_loss=0.3666, pruned_loss=0.1284, over 4263662.98 frames. ], batch size: 230, lr: 2.21e-02, grad_scale: 32.0 2023-06-18 15:53:12,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=181020.0, ans=0.125 2023-06-18 15:53:12,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=181020.0, ans=0.0 2023-06-18 15:54:17,635 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=10.16 vs. limit=15.0 2023-06-18 15:54:27,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=181140.0, ans=0.125 2023-06-18 15:54:31,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=181200.0, ans=0.0 2023-06-18 15:54:32,748 INFO [train.py:996] (2/4) Epoch 1, batch 30200, loss[loss=0.3233, simple_loss=0.3868, pruned_loss=0.1299, over 21184.00 frames. ], tot_loss[loss=0.3108, simple_loss=0.3685, pruned_loss=0.1265, over 4267010.76 frames. ], batch size: 143, lr: 2.21e-02, grad_scale: 32.0 2023-06-18 15:55:07,470 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=181200.0, ans=0.125 2023-06-18 15:56:44,951 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.290e+02 2.883e+02 3.649e+02 4.824e+02 8.627e+02, threshold=7.297e+02, percent-clipped=2.0 2023-06-18 15:57:14,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=181440.0, ans=0.125 2023-06-18 15:57:37,559 INFO [train.py:996] (2/4) Epoch 1, batch 30250, loss[loss=0.4105, simple_loss=0.487, pruned_loss=0.167, over 21648.00 frames. ], tot_loss[loss=0.3184, simple_loss=0.3778, pruned_loss=0.1295, over 4272781.53 frames. ], batch size: 414, lr: 2.21e-02, grad_scale: 32.0 2023-06-18 15:57:39,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=181500.0, ans=0.125 2023-06-18 15:58:17,948 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=181560.0, ans=0.125 2023-06-18 15:58:58,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=181620.0, ans=0.125 2023-06-18 15:59:42,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=181740.0, ans=0.125 2023-06-18 16:00:04,190 INFO [train.py:996] (2/4) Epoch 1, batch 30300, loss[loss=0.2823, simple_loss=0.3272, pruned_loss=0.1187, over 21806.00 frames. ], tot_loss[loss=0.3148, simple_loss=0.3726, pruned_loss=0.1285, over 4277605.11 frames. ], batch size: 98, lr: 2.21e-02, grad_scale: 32.0 2023-06-18 16:00:17,174 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.78 vs. limit=15.0 2023-06-18 16:00:33,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=181860.0, ans=6.0 2023-06-18 16:02:11,991 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.326e+02 3.225e+02 4.059e+02 5.033e+02 9.732e+02, threshold=8.118e+02, percent-clipped=4.0 2023-06-18 16:02:58,931 INFO [train.py:996] (2/4) Epoch 1, batch 30350, loss[loss=0.4394, simple_loss=0.4654, pruned_loss=0.2067, over 21514.00 frames. ], tot_loss[loss=0.3183, simple_loss=0.3749, pruned_loss=0.1308, over 4270045.26 frames. ], batch size: 509, lr: 2.20e-02, grad_scale: 32.0 2023-06-18 16:04:02,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=182220.0, ans=0.125 2023-06-18 16:04:09,455 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=182220.0, ans=0.125 2023-06-18 16:04:31,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=182220.0, ans=0.09899494936611666 2023-06-18 16:05:31,617 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.77 vs. limit=10.0 2023-06-18 16:05:57,426 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.23 vs. limit=15.0 2023-06-18 16:05:57,808 INFO [train.py:996] (2/4) Epoch 1, batch 30400, loss[loss=0.3083, simple_loss=0.3181, pruned_loss=0.1493, over 20352.00 frames. ], tot_loss[loss=0.3109, simple_loss=0.3664, pruned_loss=0.1277, over 4261824.23 frames. ], batch size: 703, lr: 2.20e-02, grad_scale: 32.0 2023-06-18 16:06:10,839 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.74 vs. limit=15.0 2023-06-18 16:06:32,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=182400.0, ans=0.1 2023-06-18 16:07:00,338 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.89 vs. limit=15.0 2023-06-18 16:08:37,918 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=182520.0, ans=0.125 2023-06-18 16:09:14,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=182580.0, ans=0.2 2023-06-18 16:09:50,257 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.520e+02 3.733e+02 4.742e+02 5.817e+02 2.279e+03, threshold=9.485e+02, percent-clipped=8.0 2023-06-18 16:10:31,149 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=182640.0, ans=0.2 2023-06-18 16:11:10,019 INFO [train.py:996] (2/4) Epoch 1, batch 30450, loss[loss=0.4066, simple_loss=0.4978, pruned_loss=0.1577, over 19917.00 frames. ], tot_loss[loss=0.316, simple_loss=0.3705, pruned_loss=0.1308, over 4202760.99 frames. ], batch size: 702, lr: 2.20e-02, grad_scale: 32.0 2023-06-18 16:11:29,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=182700.0, ans=0.125 2023-06-18 16:11:45,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=182700.0, ans=0.125 2023-06-18 16:12:04,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=182760.0, ans=0.125 2023-06-18 16:12:11,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=182760.0, ans=0.1 2023-06-18 16:12:55,235 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=182760.0, ans=0.0 2023-06-18 16:13:50,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=182820.0, ans=0.0 2023-06-18 16:13:51,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=182820.0, ans=0.125 2023-06-18 16:17:54,935 INFO [train.py:996] (2/4) Epoch 2, batch 0, loss[loss=0.3843, simple_loss=0.4043, pruned_loss=0.1821, over 21770.00 frames. ], tot_loss[loss=0.3843, simple_loss=0.4043, pruned_loss=0.1821, over 21770.00 frames. ], batch size: 102, lr: 2.01e-02, grad_scale: 32.0 2023-06-18 16:17:54,935 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-18 16:18:53,132 INFO [train.py:1028] (2/4) Epoch 2, validation: loss=0.2985, simple_loss=0.394, pruned_loss=0.1016, over 1796401.00 frames. 2023-06-18 16:18:53,133 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-18 16:19:08,386 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.81 vs. limit=22.5 2023-06-18 16:19:49,211 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.71 vs. limit=15.0 2023-06-18 16:19:57,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=183150.0, ans=0.0 2023-06-18 16:19:57,357 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=183150.0, ans=0.125 2023-06-18 16:19:58,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=183150.0, ans=0.0 2023-06-18 16:20:34,006 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.022e+02 3.662e+02 5.243e+02 7.950e+02 2.244e+03, threshold=1.049e+03, percent-clipped=17.0 2023-06-18 16:20:34,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=183210.0, ans=0.125 2023-06-18 16:20:43,983 INFO [train.py:996] (2/4) Epoch 2, batch 50, loss[loss=0.2724, simple_loss=0.3435, pruned_loss=0.1007, over 21235.00 frames. ], tot_loss[loss=0.3098, simple_loss=0.3637, pruned_loss=0.128, over 952371.44 frames. ], batch size: 176, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 16:22:05,850 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=183390.0, ans=0.2 2023-06-18 16:22:07,335 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=183390.0, ans=0.125 2023-06-18 16:22:08,112 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.26 vs. limit=15.0 2023-06-18 16:22:53,293 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=183510.0, ans=0.0 2023-06-18 16:22:55,529 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.61 vs. limit=10.0 2023-06-18 16:22:56,660 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.65 vs. limit=15.0 2023-06-18 16:23:20,717 INFO [train.py:996] (2/4) Epoch 2, batch 100, loss[loss=0.3293, simple_loss=0.421, pruned_loss=0.1188, over 21742.00 frames. ], tot_loss[loss=0.3204, simple_loss=0.3855, pruned_loss=0.1276, over 1690134.81 frames. ], batch size: 332, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 16:23:25,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=183570.0, ans=0.125 2023-06-18 16:25:03,285 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=183750.0, ans=0.0 2023-06-18 16:25:24,687 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.109e+02 2.827e+02 3.490e+02 4.158e+02 9.308e+02, threshold=6.980e+02, percent-clipped=0.0 2023-06-18 16:25:29,420 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=183810.0, ans=0.125 2023-06-18 16:25:44,218 INFO [train.py:996] (2/4) Epoch 2, batch 150, loss[loss=0.3133, simple_loss=0.3781, pruned_loss=0.1242, over 21470.00 frames. ], tot_loss[loss=0.3218, simple_loss=0.3864, pruned_loss=0.1286, over 2249611.81 frames. ], batch size: 131, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 16:27:21,417 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.06 vs. limit=22.5 2023-06-18 16:27:26,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=184050.0, ans=0.125 2023-06-18 16:28:08,797 INFO [train.py:996] (2/4) Epoch 2, batch 200, loss[loss=0.3221, simple_loss=0.3707, pruned_loss=0.1367, over 21866.00 frames. ], tot_loss[loss=0.3178, simple_loss=0.3824, pruned_loss=0.1267, over 2697163.76 frames. ], batch size: 371, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 16:28:27,508 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.15 vs. limit=15.0 2023-06-18 16:28:28,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=184170.0, ans=0.125 2023-06-18 16:29:01,728 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 16:30:03,483 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.067e+02 2.883e+02 3.671e+02 4.524e+02 7.455e+02, threshold=7.342e+02, percent-clipped=3.0 2023-06-18 16:30:03,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=184410.0, ans=0.1 2023-06-18 16:30:29,153 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 16:30:31,430 INFO [train.py:996] (2/4) Epoch 2, batch 250, loss[loss=0.2875, simple_loss=0.3422, pruned_loss=0.1164, over 21748.00 frames. ], tot_loss[loss=0.3144, simple_loss=0.3767, pruned_loss=0.126, over 3053041.77 frames. ], batch size: 371, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 16:30:34,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=184470.0, ans=0.1 2023-06-18 16:30:36,433 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=184470.0, ans=0.125 2023-06-18 16:31:24,297 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=184590.0, ans=0.0 2023-06-18 16:32:43,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=184710.0, ans=0.125 2023-06-18 16:33:00,017 INFO [train.py:996] (2/4) Epoch 2, batch 300, loss[loss=0.2786, simple_loss=0.3353, pruned_loss=0.111, over 21341.00 frames. ], tot_loss[loss=0.3132, simple_loss=0.3727, pruned_loss=0.1269, over 3322958.65 frames. ], batch size: 159, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 16:33:24,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=184770.0, ans=0.125 2023-06-18 16:33:26,611 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.08 vs. limit=6.0 2023-06-18 16:35:16,557 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 2.885e+02 3.453e+02 4.241e+02 7.673e+02, threshold=6.906e+02, percent-clipped=1.0 2023-06-18 16:35:39,462 INFO [train.py:996] (2/4) Epoch 2, batch 350, loss[loss=0.3204, simple_loss=0.4022, pruned_loss=0.1193, over 21334.00 frames. ], tot_loss[loss=0.3068, simple_loss=0.3662, pruned_loss=0.1237, over 3530180.23 frames. ], batch size: 548, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 16:36:25,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=185130.0, ans=0.125 2023-06-18 16:37:02,952 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=185190.0, ans=0.0 2023-06-18 16:37:05,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=185190.0, ans=0.1 2023-06-18 16:37:07,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=185190.0, ans=0.125 2023-06-18 16:37:50,115 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.67 vs. limit=22.5 2023-06-18 16:37:52,071 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 16:38:03,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=185310.0, ans=0.125 2023-06-18 16:38:05,896 INFO [train.py:996] (2/4) Epoch 2, batch 400, loss[loss=0.3289, simple_loss=0.38, pruned_loss=0.1389, over 21355.00 frames. ], tot_loss[loss=0.3012, simple_loss=0.3581, pruned_loss=0.1221, over 3694127.01 frames. ], batch size: 471, lr: 1.99e-02, grad_scale: 32.0 2023-06-18 16:39:12,074 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=23.51 vs. limit=15.0 2023-06-18 16:40:04,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=185550.0, ans=0.0 2023-06-18 16:40:25,133 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.968e+02 2.835e+02 3.626e+02 4.923e+02 9.458e+02, threshold=7.251e+02, percent-clipped=2.0 2023-06-18 16:40:42,461 INFO [train.py:996] (2/4) Epoch 2, batch 450, loss[loss=0.3855, simple_loss=0.447, pruned_loss=0.162, over 21535.00 frames. ], tot_loss[loss=0.2974, simple_loss=0.3537, pruned_loss=0.1205, over 3822890.59 frames. ], batch size: 508, lr: 1.99e-02, grad_scale: 32.0 2023-06-18 16:41:26,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=185730.0, ans=0.025 2023-06-18 16:42:44,681 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.35 vs. limit=15.0 2023-06-18 16:42:48,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=185910.0, ans=0.1 2023-06-18 16:43:08,790 INFO [train.py:996] (2/4) Epoch 2, batch 500, loss[loss=0.3067, simple_loss=0.3911, pruned_loss=0.1112, over 21756.00 frames. ], tot_loss[loss=0.297, simple_loss=0.3558, pruned_loss=0.119, over 3920358.05 frames. ], batch size: 298, lr: 1.99e-02, grad_scale: 32.0 2023-06-18 16:43:51,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=186030.0, ans=0.1 2023-06-18 16:45:18,127 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.895e+02 2.995e+02 3.727e+02 5.025e+02 8.561e+02, threshold=7.454e+02, percent-clipped=5.0 2023-06-18 16:45:43,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=186210.0, ans=0.125 2023-06-18 16:45:44,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=186270.0, ans=0.125 2023-06-18 16:45:49,756 INFO [train.py:996] (2/4) Epoch 2, batch 550, loss[loss=0.306, simple_loss=0.3556, pruned_loss=0.1282, over 21881.00 frames. ], tot_loss[loss=0.2974, simple_loss=0.3573, pruned_loss=0.1187, over 4002250.79 frames. ], batch size: 351, lr: 1.99e-02, grad_scale: 32.0 2023-06-18 16:46:11,542 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.77 vs. limit=15.0 2023-06-18 16:46:16,993 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=186330.0, ans=15.0 2023-06-18 16:46:38,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=186390.0, ans=0.2 2023-06-18 16:48:05,027 INFO [train.py:996] (2/4) Epoch 2, batch 600, loss[loss=0.3713, simple_loss=0.4491, pruned_loss=0.1467, over 21523.00 frames. ], tot_loss[loss=0.2999, simple_loss=0.3608, pruned_loss=0.1195, over 4064836.92 frames. ], batch size: 471, lr: 1.99e-02, grad_scale: 64.0 2023-06-18 16:49:07,838 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=186630.0, ans=0.2 2023-06-18 16:49:17,169 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=186690.0, ans=0.05 2023-06-18 16:49:53,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=186750.0, ans=0.0 2023-06-18 16:50:13,427 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=186810.0, ans=0.125 2023-06-18 16:50:17,217 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.150e+02 3.275e+02 3.817e+02 5.405e+02 1.141e+03, threshold=7.634e+02, percent-clipped=7.0 2023-06-18 16:50:26,072 INFO [train.py:996] (2/4) Epoch 2, batch 650, loss[loss=0.2776, simple_loss=0.3287, pruned_loss=0.1132, over 21223.00 frames. ], tot_loss[loss=0.3017, simple_loss=0.3629, pruned_loss=0.1202, over 4119671.47 frames. ], batch size: 143, lr: 1.99e-02, grad_scale: 32.0 2023-06-18 16:50:29,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=186870.0, ans=0.125 2023-06-18 16:50:50,890 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=186870.0, ans=0.0 2023-06-18 16:51:08,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=186930.0, ans=0.125 2023-06-18 16:51:15,697 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=186930.0, ans=0.1 2023-06-18 16:51:42,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=186990.0, ans=0.125 2023-06-18 16:51:56,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=186990.0, ans=0.0 2023-06-18 16:52:01,454 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=187050.0, ans=0.0 2023-06-18 16:52:38,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=187110.0, ans=0.125 2023-06-18 16:53:00,525 INFO [train.py:996] (2/4) Epoch 2, batch 700, loss[loss=0.3673, simple_loss=0.4323, pruned_loss=0.1512, over 21689.00 frames. ], tot_loss[loss=0.3022, simple_loss=0.3637, pruned_loss=0.1204, over 4152293.03 frames. ], batch size: 389, lr: 1.98e-02, grad_scale: 32.0 2023-06-18 16:54:36,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=187350.0, ans=0.0 2023-06-18 16:55:08,915 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.227e+02 3.018e+02 3.846e+02 4.638e+02 8.103e+02, threshold=7.692e+02, percent-clipped=2.0 2023-06-18 16:55:22,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=187470.0, ans=0.125 2023-06-18 16:55:22,984 INFO [train.py:996] (2/4) Epoch 2, batch 750, loss[loss=0.3089, simple_loss=0.424, pruned_loss=0.09697, over 20786.00 frames. ], tot_loss[loss=0.3049, simple_loss=0.3648, pruned_loss=0.1226, over 4184551.81 frames. ], batch size: 607, lr: 1.98e-02, grad_scale: 32.0 2023-06-18 16:56:09,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=187530.0, ans=0.1 2023-06-18 16:56:14,471 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.60 vs. limit=6.0 2023-06-18 16:57:31,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=187710.0, ans=0.125 2023-06-18 16:57:32,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=187710.0, ans=0.0 2023-06-18 16:57:49,836 INFO [train.py:996] (2/4) Epoch 2, batch 800, loss[loss=0.2696, simple_loss=0.3271, pruned_loss=0.1061, over 21592.00 frames. ], tot_loss[loss=0.3046, simple_loss=0.3632, pruned_loss=0.123, over 4201336.53 frames. ], batch size: 263, lr: 1.98e-02, grad_scale: 32.0 2023-06-18 16:57:53,666 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.87 vs. limit=15.0 2023-06-18 16:58:42,069 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.53 vs. limit=12.0 2023-06-18 16:58:49,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=187890.0, ans=0.0 2023-06-18 16:59:24,158 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.222e+02 3.112e+02 3.737e+02 4.664e+02 8.129e+02, threshold=7.474e+02, percent-clipped=2.0 2023-06-18 16:59:45,759 INFO [train.py:996] (2/4) Epoch 2, batch 850, loss[loss=0.316, simple_loss=0.3456, pruned_loss=0.1432, over 21572.00 frames. ], tot_loss[loss=0.302, simple_loss=0.3596, pruned_loss=0.1222, over 4221976.24 frames. ], batch size: 441, lr: 1.98e-02, grad_scale: 32.0 2023-06-18 16:59:48,127 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-06-18 16:59:49,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=188070.0, ans=0.125 2023-06-18 16:59:53,357 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=188070.0, ans=0.1 2023-06-18 17:00:42,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=188190.0, ans=0.125 2023-06-18 17:00:59,921 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 17:01:08,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=188250.0, ans=0.1 2023-06-18 17:01:33,055 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=188310.0, ans=0.125 2023-06-18 17:01:33,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=188310.0, ans=0.125 2023-06-18 17:02:00,150 INFO [train.py:996] (2/4) Epoch 2, batch 900, loss[loss=0.336, simple_loss=0.3772, pruned_loss=0.1474, over 21730.00 frames. ], tot_loss[loss=0.2996, simple_loss=0.356, pruned_loss=0.1216, over 4242110.58 frames. ], batch size: 389, lr: 1.98e-02, grad_scale: 32.0 2023-06-18 17:02:05,160 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.14 vs. limit=12.0 2023-06-18 17:02:10,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=188370.0, ans=0.125 2023-06-18 17:02:55,017 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.10 vs. limit=10.0 2023-06-18 17:02:57,609 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=188490.0, ans=0.125 2023-06-18 17:03:04,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=188490.0, ans=0.2 2023-06-18 17:03:44,625 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.055e+02 2.912e+02 3.443e+02 3.950e+02 7.749e+02, threshold=6.886e+02, percent-clipped=2.0 2023-06-18 17:03:56,026 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=188610.0, ans=0.125 2023-06-18 17:04:02,593 INFO [train.py:996] (2/4) Epoch 2, batch 950, loss[loss=0.2252, simple_loss=0.2949, pruned_loss=0.07777, over 21219.00 frames. ], tot_loss[loss=0.296, simple_loss=0.3519, pruned_loss=0.1201, over 4253398.64 frames. ], batch size: 159, lr: 1.98e-02, grad_scale: 16.0 2023-06-18 17:04:05,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=188670.0, ans=0.0 2023-06-18 17:04:08,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=188670.0, ans=0.04949747468305833 2023-06-18 17:05:57,592 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=188910.0, ans=0.125 2023-06-18 17:06:18,724 INFO [train.py:996] (2/4) Epoch 2, batch 1000, loss[loss=0.3091, simple_loss=0.3658, pruned_loss=0.1262, over 21787.00 frames. ], tot_loss[loss=0.2963, simple_loss=0.3525, pruned_loss=0.1201, over 4269660.46 frames. ], batch size: 124, lr: 1.98e-02, grad_scale: 16.0 2023-06-18 17:08:17,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=189210.0, ans=0.1 2023-06-18 17:08:20,458 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.261e+02 3.046e+02 3.757e+02 4.775e+02 8.015e+02, threshold=7.515e+02, percent-clipped=4.0 2023-06-18 17:08:27,771 INFO [train.py:996] (2/4) Epoch 2, batch 1050, loss[loss=0.3071, simple_loss=0.3667, pruned_loss=0.1237, over 21545.00 frames. ], tot_loss[loss=0.2984, simple_loss=0.3541, pruned_loss=0.1214, over 4274208.24 frames. ], batch size: 471, lr: 1.97e-02, grad_scale: 16.0 2023-06-18 17:08:43,078 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=189270.0, ans=0.125 2023-06-18 17:09:04,921 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=189330.0, ans=0.125 2023-06-18 17:09:50,749 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.49 vs. limit=15.0 2023-06-18 17:10:05,263 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.12 vs. limit=15.0 2023-06-18 17:10:14,114 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=189510.0, ans=0.0 2023-06-18 17:10:44,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=189510.0, ans=0.0 2023-06-18 17:10:46,340 INFO [train.py:996] (2/4) Epoch 2, batch 1100, loss[loss=0.2668, simple_loss=0.2985, pruned_loss=0.1175, over 21213.00 frames. ], tot_loss[loss=0.2975, simple_loss=0.3536, pruned_loss=0.1207, over 4274894.86 frames. ], batch size: 548, lr: 1.97e-02, grad_scale: 16.0 2023-06-18 17:11:05,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=189570.0, ans=0.125 2023-06-18 17:12:30,855 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.301e+02 3.384e+02 3.965e+02 4.947e+02 9.506e+02, threshold=7.929e+02, percent-clipped=4.0 2023-06-18 17:12:31,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=189810.0, ans=0.0 2023-06-18 17:12:48,332 INFO [train.py:996] (2/4) Epoch 2, batch 1150, loss[loss=0.2366, simple_loss=0.2706, pruned_loss=0.1013, over 16315.00 frames. ], tot_loss[loss=0.2968, simple_loss=0.354, pruned_loss=0.1198, over 4278029.23 frames. ], batch size: 60, lr: 1.97e-02, grad_scale: 16.0 2023-06-18 17:13:32,172 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=189930.0, ans=0.125 2023-06-18 17:13:36,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=189930.0, ans=0.0 2023-06-18 17:13:39,903 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.48 vs. limit=6.0 2023-06-18 17:13:42,754 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.79 vs. limit=15.0 2023-06-18 17:13:46,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=189990.0, ans=0.035 2023-06-18 17:14:49,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=190110.0, ans=0.125 2023-06-18 17:15:15,717 INFO [train.py:996] (2/4) Epoch 2, batch 1200, loss[loss=0.2941, simple_loss=0.3574, pruned_loss=0.1154, over 21530.00 frames. ], tot_loss[loss=0.295, simple_loss=0.3527, pruned_loss=0.1186, over 4278027.83 frames. ], batch size: 230, lr: 1.97e-02, grad_scale: 32.0 2023-06-18 17:16:02,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=190290.0, ans=0.125 2023-06-18 17:17:04,008 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.383e+02 3.111e+02 4.019e+02 4.816e+02 7.916e+02, threshold=8.038e+02, percent-clipped=0.0 2023-06-18 17:17:25,287 INFO [train.py:996] (2/4) Epoch 2, batch 1250, loss[loss=0.3117, simple_loss=0.3595, pruned_loss=0.1319, over 21958.00 frames. ], tot_loss[loss=0.298, simple_loss=0.3554, pruned_loss=0.1203, over 4284124.00 frames. ], batch size: 316, lr: 1.97e-02, grad_scale: 32.0 2023-06-18 17:17:49,391 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 17:18:06,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=190530.0, ans=0.125 2023-06-18 17:19:41,851 INFO [train.py:996] (2/4) Epoch 2, batch 1300, loss[loss=0.2556, simple_loss=0.3328, pruned_loss=0.08917, over 21634.00 frames. ], tot_loss[loss=0.299, simple_loss=0.3567, pruned_loss=0.1207, over 4285743.64 frames. ], batch size: 230, lr: 1.97e-02, grad_scale: 32.0 2023-06-18 17:19:53,475 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 17:20:13,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=190830.0, ans=0.125 2023-06-18 17:20:14,128 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.35 vs. limit=22.5 2023-06-18 17:20:52,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=190890.0, ans=0.0 2023-06-18 17:21:41,132 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.254e+02 3.359e+02 4.196e+02 5.340e+02 8.417e+02, threshold=8.392e+02, percent-clipped=1.0 2023-06-18 17:21:58,796 INFO [train.py:996] (2/4) Epoch 2, batch 1350, loss[loss=0.2878, simple_loss=0.3547, pruned_loss=0.1105, over 21821.00 frames. ], tot_loss[loss=0.3002, simple_loss=0.3576, pruned_loss=0.1213, over 4288453.89 frames. ], batch size: 107, lr: 1.97e-02, grad_scale: 32.0 2023-06-18 17:22:03,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=191070.0, ans=0.0 2023-06-18 17:23:08,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=191250.0, ans=0.125 2023-06-18 17:23:15,696 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=191250.0, ans=0.2 2023-06-18 17:23:46,505 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=191310.0, ans=0.0 2023-06-18 17:23:50,833 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=191310.0, ans=0.125 2023-06-18 17:24:00,734 INFO [train.py:996] (2/4) Epoch 2, batch 1400, loss[loss=0.2703, simple_loss=0.3138, pruned_loss=0.1133, over 21733.00 frames. ], tot_loss[loss=0.2996, simple_loss=0.3566, pruned_loss=0.1213, over 4283104.02 frames. ], batch size: 316, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 17:25:16,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=191550.0, ans=0.2 2023-06-18 17:26:00,684 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.196e+02 3.213e+02 3.659e+02 4.697e+02 8.447e+02, threshold=7.317e+02, percent-clipped=1.0 2023-06-18 17:26:06,936 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff3.min_abs, batch_count=191670.0, ans=0.2 2023-06-18 17:26:07,941 INFO [train.py:996] (2/4) Epoch 2, batch 1450, loss[loss=0.3124, simple_loss=0.3635, pruned_loss=0.1307, over 21292.00 frames. ], tot_loss[loss=0.3023, simple_loss=0.3587, pruned_loss=0.1229, over 4285598.30 frames. ], batch size: 176, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 17:26:22,502 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.53 vs. limit=15.0 2023-06-18 17:26:35,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=191730.0, ans=0.05 2023-06-18 17:26:39,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_na.min_abs, batch_count=191730.0, ans=0.02 2023-06-18 17:26:44,433 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=191730.0, ans=0.125 2023-06-18 17:28:12,509 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 17:28:22,319 INFO [train.py:996] (2/4) Epoch 2, batch 1500, loss[loss=0.3085, simple_loss=0.3561, pruned_loss=0.1305, over 21890.00 frames. ], tot_loss[loss=0.3053, simple_loss=0.3606, pruned_loss=0.125, over 4291539.81 frames. ], batch size: 371, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 17:30:25,612 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.059e+02 3.308e+02 4.250e+02 5.444e+02 1.048e+03, threshold=8.500e+02, percent-clipped=7.0 2023-06-18 17:30:39,390 INFO [train.py:996] (2/4) Epoch 2, batch 1550, loss[loss=0.2438, simple_loss=0.3022, pruned_loss=0.09275, over 21784.00 frames. ], tot_loss[loss=0.3004, simple_loss=0.3564, pruned_loss=0.1222, over 4285034.27 frames. ], batch size: 124, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 17:32:07,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=192450.0, ans=0.1 2023-06-18 17:32:53,940 INFO [train.py:996] (2/4) Epoch 2, batch 1600, loss[loss=0.2739, simple_loss=0.3251, pruned_loss=0.1114, over 21810.00 frames. ], tot_loss[loss=0.2976, simple_loss=0.3537, pruned_loss=0.1207, over 4288881.52 frames. ], batch size: 282, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 17:34:47,618 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.005e+02 3.115e+02 3.710e+02 4.975e+02 8.119e+02, threshold=7.421e+02, percent-clipped=0.0 2023-06-18 17:34:54,917 INFO [train.py:996] (2/4) Epoch 2, batch 1650, loss[loss=0.2902, simple_loss=0.3528, pruned_loss=0.1138, over 21442.00 frames. ], tot_loss[loss=0.2956, simple_loss=0.3529, pruned_loss=0.1191, over 4288369.81 frames. ], batch size: 194, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 17:34:58,501 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=192870.0, ans=0.125 2023-06-18 17:36:11,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=192990.0, ans=0.125 2023-06-18 17:36:57,339 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=193110.0, ans=0.2 2023-06-18 17:37:12,716 INFO [train.py:996] (2/4) Epoch 2, batch 1700, loss[loss=0.414, simple_loss=0.4513, pruned_loss=0.1884, over 21418.00 frames. ], tot_loss[loss=0.301, simple_loss=0.358, pruned_loss=0.122, over 4286853.38 frames. ], batch size: 507, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 17:37:13,282 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=193170.0, ans=0.125 2023-06-18 17:37:16,137 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=193170.0, ans=0.2 2023-06-18 17:38:00,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=193230.0, ans=0.125 2023-06-18 17:38:30,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=193290.0, ans=0.125 2023-06-18 17:39:04,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=193350.0, ans=0.125 2023-06-18 17:39:16,040 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.313e+02 3.411e+02 4.217e+02 5.262e+02 9.170e+02, threshold=8.435e+02, percent-clipped=4.0 2023-06-18 17:39:17,980 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=193410.0, ans=0.125 2023-06-18 17:39:23,247 INFO [train.py:996] (2/4) Epoch 2, batch 1750, loss[loss=0.2359, simple_loss=0.3227, pruned_loss=0.07462, over 21815.00 frames. ], tot_loss[loss=0.2994, simple_loss=0.359, pruned_loss=0.1199, over 4286586.24 frames. ], batch size: 316, lr: 1.95e-02, grad_scale: 32.0 2023-06-18 17:39:48,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=193470.0, ans=0.125 2023-06-18 17:39:53,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=193470.0, ans=0.125 2023-06-18 17:40:00,819 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=193470.0, ans=0.125 2023-06-18 17:41:02,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=193650.0, ans=0.125 2023-06-18 17:41:04,292 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.80 vs. limit=15.0 2023-06-18 17:41:59,279 INFO [train.py:996] (2/4) Epoch 2, batch 1800, loss[loss=0.2492, simple_loss=0.3269, pruned_loss=0.08569, over 21641.00 frames. ], tot_loss[loss=0.2957, simple_loss=0.3559, pruned_loss=0.1178, over 4280393.58 frames. ], batch size: 263, lr: 1.95e-02, grad_scale: 32.0 2023-06-18 17:42:29,269 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=193830.0, ans=0.0 2023-06-18 17:43:40,211 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-06-18 17:44:08,481 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.933e+02 2.686e+02 3.358e+02 3.965e+02 7.229e+02, threshold=6.717e+02, percent-clipped=0.0 2023-06-18 17:44:12,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=194010.0, ans=0.2 2023-06-18 17:44:15,777 INFO [train.py:996] (2/4) Epoch 2, batch 1850, loss[loss=0.2763, simple_loss=0.338, pruned_loss=0.1073, over 21232.00 frames. ], tot_loss[loss=0.2903, simple_loss=0.3533, pruned_loss=0.1136, over 4278535.50 frames. ], batch size: 143, lr: 1.95e-02, grad_scale: 32.0 2023-06-18 17:44:43,918 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=194070.0, ans=0.1 2023-06-18 17:44:48,269 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=194130.0, ans=0.0 2023-06-18 17:45:52,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=194310.0, ans=0.125 2023-06-18 17:46:25,751 INFO [train.py:996] (2/4) Epoch 2, batch 1900, loss[loss=0.2696, simple_loss=0.3179, pruned_loss=0.1107, over 21875.00 frames. ], tot_loss[loss=0.2906, simple_loss=0.3535, pruned_loss=0.1139, over 4280177.69 frames. ], batch size: 118, lr: 1.95e-02, grad_scale: 32.0 2023-06-18 17:46:29,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=194370.0, ans=0.07 2023-06-18 17:46:36,351 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=194370.0, ans=0.1 2023-06-18 17:47:21,983 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=194490.0, ans=0.125 2023-06-18 17:47:30,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=194490.0, ans=0.125 2023-06-18 17:47:35,688 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.41 vs. limit=15.0 2023-06-18 17:47:39,296 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=194550.0, ans=0.125 2023-06-18 17:47:39,962 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.72 vs. limit=15.0 2023-06-18 17:47:43,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=194550.0, ans=0.125 2023-06-18 17:48:18,324 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.132e+02 3.031e+02 3.569e+02 4.348e+02 9.339e+02, threshold=7.139e+02, percent-clipped=5.0 2023-06-18 17:48:21,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=194610.0, ans=0.125 2023-06-18 17:48:25,697 INFO [train.py:996] (2/4) Epoch 2, batch 1950, loss[loss=0.2111, simple_loss=0.2954, pruned_loss=0.06342, over 21640.00 frames. ], tot_loss[loss=0.2885, simple_loss=0.3486, pruned_loss=0.1142, over 4274335.08 frames. ], batch size: 263, lr: 1.95e-02, grad_scale: 32.0 2023-06-18 17:48:48,004 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=194670.0, ans=0.125 2023-06-18 17:48:50,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=194670.0, ans=0.125 2023-06-18 17:48:58,323 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.59 vs. limit=15.0 2023-06-18 17:49:02,477 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=194730.0, ans=0.0 2023-06-18 17:49:10,602 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=194730.0, ans=0.1 2023-06-18 17:49:25,179 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=194790.0, ans=0.0 2023-06-18 17:49:44,905 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.51 vs. limit=15.0 2023-06-18 17:49:48,696 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=194850.0, ans=0.1 2023-06-18 17:50:19,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=194910.0, ans=0.125 2023-06-18 17:50:23,685 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.71 vs. limit=15.0 2023-06-18 17:50:31,142 INFO [train.py:996] (2/4) Epoch 2, batch 2000, loss[loss=0.399, simple_loss=0.4581, pruned_loss=0.17, over 21508.00 frames. ], tot_loss[loss=0.2843, simple_loss=0.3433, pruned_loss=0.1126, over 4270503.29 frames. ], batch size: 471, lr: 1.95e-02, grad_scale: 32.0 2023-06-18 17:50:50,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=194970.0, ans=0.125 2023-06-18 17:51:33,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=195030.0, ans=0.0 2023-06-18 17:51:35,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=195090.0, ans=0.125 2023-06-18 17:51:51,721 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=195150.0, ans=0.125 2023-06-18 17:51:53,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=195150.0, ans=0.125 2023-06-18 17:51:57,808 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=195150.0, ans=0.1 2023-06-18 17:52:30,335 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.735e+02 2.996e+02 3.369e+02 4.467e+02 7.434e+02, threshold=6.738e+02, percent-clipped=1.0 2023-06-18 17:52:36,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=195270.0, ans=0.125 2023-06-18 17:52:37,738 INFO [train.py:996] (2/4) Epoch 2, batch 2050, loss[loss=0.3888, simple_loss=0.4184, pruned_loss=0.1796, over 21614.00 frames. ], tot_loss[loss=0.2842, simple_loss=0.3441, pruned_loss=0.1121, over 4262992.86 frames. ], batch size: 471, lr: 1.95e-02, grad_scale: 32.0 2023-06-18 17:52:38,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=195270.0, ans=0.125 2023-06-18 17:52:40,896 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=195270.0, ans=0.125 2023-06-18 17:52:42,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=195270.0, ans=0.0 2023-06-18 17:52:42,331 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=195270.0, ans=0.125 2023-06-18 17:53:15,583 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=195330.0, ans=0.05 2023-06-18 17:53:56,565 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.04 vs. limit=22.5 2023-06-18 17:54:07,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=195450.0, ans=0.0 2023-06-18 17:54:52,684 INFO [train.py:996] (2/4) Epoch 2, batch 2100, loss[loss=0.2889, simple_loss=0.3809, pruned_loss=0.09847, over 21567.00 frames. ], tot_loss[loss=0.2886, simple_loss=0.3489, pruned_loss=0.1142, over 4266509.90 frames. ], batch size: 441, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 17:55:18,560 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.71 vs. limit=8.0 2023-06-18 17:55:24,463 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.66 vs. limit=15.0 2023-06-18 17:55:33,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=195630.0, ans=0.1 2023-06-18 17:55:39,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=195690.0, ans=0.0 2023-06-18 17:56:01,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=195690.0, ans=0.2 2023-06-18 17:56:06,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=195690.0, ans=0.1 2023-06-18 17:56:38,759 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=195810.0, ans=0.125 2023-06-18 17:56:46,288 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=195810.0, ans=0.125 2023-06-18 17:56:47,251 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.163e+02 2.911e+02 3.457e+02 4.182e+02 7.593e+02, threshold=6.915e+02, percent-clipped=4.0 2023-06-18 17:56:54,803 INFO [train.py:996] (2/4) Epoch 2, batch 2150, loss[loss=0.2744, simple_loss=0.3276, pruned_loss=0.1106, over 21811.00 frames. ], tot_loss[loss=0.2935, simple_loss=0.3523, pruned_loss=0.1174, over 4261827.54 frames. ], batch size: 317, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 17:57:25,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=195870.0, ans=0.125 2023-06-18 17:59:06,230 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=196110.0, ans=0.0 2023-06-18 17:59:17,874 INFO [train.py:996] (2/4) Epoch 2, batch 2200, loss[loss=0.3294, simple_loss=0.3969, pruned_loss=0.131, over 21710.00 frames. ], tot_loss[loss=0.2937, simple_loss=0.3535, pruned_loss=0.117, over 4266546.67 frames. ], batch size: 414, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 17:59:32,364 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=196170.0, ans=0.0 2023-06-18 17:59:33,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=196170.0, ans=0.2 2023-06-18 17:59:35,945 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.58 vs. limit=15.0 2023-06-18 18:00:20,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=196290.0, ans=0.125 2023-06-18 18:00:44,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=196350.0, ans=0.04949747468305833 2023-06-18 18:00:56,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=196410.0, ans=0.125 2023-06-18 18:00:57,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=196410.0, ans=0.125 2023-06-18 18:01:21,551 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.177e+02 3.132e+02 3.798e+02 4.923e+02 7.724e+02, threshold=7.595e+02, percent-clipped=3.0 2023-06-18 18:01:29,000 INFO [train.py:996] (2/4) Epoch 2, batch 2250, loss[loss=0.2578, simple_loss=0.3316, pruned_loss=0.09203, over 21728.00 frames. ], tot_loss[loss=0.289, simple_loss=0.3505, pruned_loss=0.1138, over 4268574.15 frames. ], batch size: 332, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 18:02:40,982 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=196590.0, ans=0.0 2023-06-18 18:03:28,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=196710.0, ans=0.125 2023-06-18 18:03:31,111 INFO [train.py:996] (2/4) Epoch 2, batch 2300, loss[loss=0.2872, simple_loss=0.3283, pruned_loss=0.123, over 21864.00 frames. ], tot_loss[loss=0.287, simple_loss=0.3469, pruned_loss=0.1136, over 4269262.62 frames. ], batch size: 373, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 18:04:28,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=196890.0, ans=0.1 2023-06-18 18:05:16,919 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 3.103e+02 3.931e+02 4.901e+02 7.209e+02, threshold=7.862e+02, percent-clipped=0.0 2023-06-18 18:05:29,588 INFO [train.py:996] (2/4) Epoch 2, batch 2350, loss[loss=0.2967, simple_loss=0.3343, pruned_loss=0.1295, over 21506.00 frames. ], tot_loss[loss=0.2854, simple_loss=0.3422, pruned_loss=0.1143, over 4272265.35 frames. ], batch size: 391, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 18:05:34,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=197070.0, ans=0.125 2023-06-18 18:06:07,412 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.21 vs. limit=15.0 2023-06-18 18:06:36,861 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.35 vs. limit=10.0 2023-06-18 18:06:39,191 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=197190.0, ans=0.0 2023-06-18 18:07:10,503 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=197250.0, ans=0.0 2023-06-18 18:07:13,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=197250.0, ans=0.125 2023-06-18 18:07:30,523 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=197310.0, ans=0.0 2023-06-18 18:07:48,755 INFO [train.py:996] (2/4) Epoch 2, batch 2400, loss[loss=0.2715, simple_loss=0.3085, pruned_loss=0.1172, over 21245.00 frames. ], tot_loss[loss=0.2902, simple_loss=0.3452, pruned_loss=0.1176, over 4276100.79 frames. ], batch size: 548, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 18:09:14,041 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=197490.0, ans=0.125 2023-06-18 18:09:34,972 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.61 vs. limit=15.0 2023-06-18 18:09:47,276 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.237e+02 3.219e+02 3.807e+02 5.185e+02 8.292e+02, threshold=7.615e+02, percent-clipped=0.0 2023-06-18 18:09:49,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=197610.0, ans=0.125 2023-06-18 18:10:11,964 INFO [train.py:996] (2/4) Epoch 2, batch 2450, loss[loss=0.3323, simple_loss=0.39, pruned_loss=0.1374, over 21277.00 frames. ], tot_loss[loss=0.2958, simple_loss=0.3521, pruned_loss=0.1198, over 4269793.43 frames. ], batch size: 143, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 18:10:19,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=197670.0, ans=0.125 2023-06-18 18:10:28,833 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.88 vs. limit=15.0 2023-06-18 18:10:53,782 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.86 vs. limit=15.0 2023-06-18 18:11:05,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=197790.0, ans=0.0 2023-06-18 18:11:27,357 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=197850.0, ans=0.1 2023-06-18 18:11:59,498 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.23 vs. limit=22.5 2023-06-18 18:12:02,916 INFO [train.py:996] (2/4) Epoch 2, batch 2500, loss[loss=0.293, simple_loss=0.3714, pruned_loss=0.1073, over 21404.00 frames. ], tot_loss[loss=0.2968, simple_loss=0.3518, pruned_loss=0.1209, over 4272403.46 frames. ], batch size: 194, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 18:12:19,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=197970.0, ans=0.125 2023-06-18 18:12:34,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=198030.0, ans=0.125 2023-06-18 18:12:42,225 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=22.5 2023-06-18 18:13:19,975 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=198090.0, ans=0.0 2023-06-18 18:13:39,600 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 18:13:44,114 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=198150.0, ans=0.0 2023-06-18 18:14:03,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=198210.0, ans=0.0 2023-06-18 18:14:05,425 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.12 vs. limit=22.5 2023-06-18 18:14:05,620 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.119e+02 3.063e+02 3.765e+02 4.527e+02 8.351e+02, threshold=7.530e+02, percent-clipped=1.0 2023-06-18 18:14:20,536 INFO [train.py:996] (2/4) Epoch 2, batch 2550, loss[loss=0.2661, simple_loss=0.3191, pruned_loss=0.1066, over 21415.00 frames. ], tot_loss[loss=0.2951, simple_loss=0.3507, pruned_loss=0.1197, over 4267518.35 frames. ], batch size: 194, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 18:14:59,965 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.04 vs. limit=10.0 2023-06-18 18:15:06,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=198330.0, ans=0.125 2023-06-18 18:15:48,003 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=198450.0, ans=0.125 2023-06-18 18:16:01,355 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=198450.0, ans=0.0 2023-06-18 18:16:35,994 INFO [train.py:996] (2/4) Epoch 2, batch 2600, loss[loss=0.3087, simple_loss=0.3516, pruned_loss=0.1329, over 21571.00 frames. ], tot_loss[loss=0.2974, simple_loss=0.3516, pruned_loss=0.1216, over 4263713.18 frames. ], batch size: 230, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 18:17:08,467 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.65 vs. limit=10.0 2023-06-18 18:18:35,135 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.261e+02 3.099e+02 3.578e+02 4.381e+02 8.745e+02, threshold=7.157e+02, percent-clipped=2.0 2023-06-18 18:18:47,910 INFO [train.py:996] (2/4) Epoch 2, batch 2650, loss[loss=0.2773, simple_loss=0.3522, pruned_loss=0.1012, over 21610.00 frames. ], tot_loss[loss=0.3008, simple_loss=0.3543, pruned_loss=0.1237, over 4273646.53 frames. ], batch size: 230, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 18:19:05,427 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=198870.0, ans=0.125 2023-06-18 18:20:29,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=199110.0, ans=0.0 2023-06-18 18:20:56,623 INFO [train.py:996] (2/4) Epoch 2, batch 2700, loss[loss=0.3089, simple_loss=0.4161, pruned_loss=0.1009, over 19765.00 frames. ], tot_loss[loss=0.2973, simple_loss=0.3518, pruned_loss=0.1214, over 4264411.36 frames. ], batch size: 703, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 18:21:14,328 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.91 vs. limit=22.5 2023-06-18 18:21:25,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=199290.0, ans=0.1 2023-06-18 18:21:38,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=199290.0, ans=0.125 2023-06-18 18:21:55,030 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.27 vs. limit=15.0 2023-06-18 18:22:39,834 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.312e+02 2.957e+02 3.708e+02 4.749e+02 8.354e+02, threshold=7.415e+02, percent-clipped=2.0 2023-06-18 18:22:53,013 INFO [train.py:996] (2/4) Epoch 2, batch 2750, loss[loss=0.3279, simple_loss=0.3792, pruned_loss=0.1383, over 21882.00 frames. ], tot_loss[loss=0.2956, simple_loss=0.3497, pruned_loss=0.1208, over 4271962.56 frames. ], batch size: 107, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 18:22:56,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=199470.0, ans=0.125 2023-06-18 18:24:56,955 INFO [train.py:996] (2/4) Epoch 2, batch 2800, loss[loss=0.2705, simple_loss=0.3148, pruned_loss=0.1131, over 21729.00 frames. ], tot_loss[loss=0.2973, simple_loss=0.3528, pruned_loss=0.1209, over 4272729.39 frames. ], batch size: 124, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 18:25:58,537 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.19 vs. limit=15.0 2023-06-18 18:27:04,494 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=200010.0, ans=0.125 2023-06-18 18:27:05,419 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.554e+02 3.576e+02 4.220e+02 5.445e+02 9.845e+02, threshold=8.440e+02, percent-clipped=5.0 2023-06-18 18:27:13,002 INFO [train.py:996] (2/4) Epoch 2, batch 2850, loss[loss=0.2482, simple_loss=0.299, pruned_loss=0.09866, over 21627.00 frames. ], tot_loss[loss=0.3014, simple_loss=0.3573, pruned_loss=0.1227, over 4276782.66 frames. ], batch size: 230, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 18:27:56,269 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=12.03 vs. limit=15.0 2023-06-18 18:29:01,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=200310.0, ans=0.2 2023-06-18 18:29:22,568 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=200370.0, ans=0.025 2023-06-18 18:29:23,631 INFO [train.py:996] (2/4) Epoch 2, batch 2900, loss[loss=0.2848, simple_loss=0.3363, pruned_loss=0.1167, over 21368.00 frames. ], tot_loss[loss=0.2963, simple_loss=0.3519, pruned_loss=0.1204, over 4271188.31 frames. ], batch size: 176, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 18:29:24,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=200370.0, ans=0.1 2023-06-18 18:30:01,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=200430.0, ans=0.1 2023-06-18 18:30:35,925 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=200490.0, ans=0.0 2023-06-18 18:31:12,947 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.36 vs. limit=15.0 2023-06-18 18:31:26,824 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.117e+02 3.217e+02 3.802e+02 4.716e+02 7.016e+02, threshold=7.604e+02, percent-clipped=0.0 2023-06-18 18:31:34,050 INFO [train.py:996] (2/4) Epoch 2, batch 2950, loss[loss=0.267, simple_loss=0.3279, pruned_loss=0.1031, over 21637.00 frames. ], tot_loss[loss=0.2988, simple_loss=0.3551, pruned_loss=0.1213, over 4279148.29 frames. ], batch size: 263, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 18:31:58,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=200670.0, ans=0.0 2023-06-18 18:32:30,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=200730.0, ans=0.04949747468305833 2023-06-18 18:32:33,787 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=200730.0, ans=0.125 2023-06-18 18:32:34,208 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.41 vs. limit=15.0 2023-06-18 18:32:39,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=200730.0, ans=10.0 2023-06-18 18:32:49,830 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=200790.0, ans=0.0 2023-06-18 18:32:51,253 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=200790.0, ans=0.125 2023-06-18 18:33:24,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=200850.0, ans=0.2 2023-06-18 18:33:25,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=200850.0, ans=0.125 2023-06-18 18:33:35,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=200910.0, ans=0.2 2023-06-18 18:34:01,260 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.45 vs. limit=15.0 2023-06-18 18:34:01,567 INFO [train.py:996] (2/4) Epoch 2, batch 3000, loss[loss=0.2687, simple_loss=0.3037, pruned_loss=0.1169, over 20008.00 frames. ], tot_loss[loss=0.2993, simple_loss=0.3568, pruned_loss=0.1209, over 4275372.47 frames. ], batch size: 702, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 18:34:01,568 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-18 18:34:48,148 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.3.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.7014, 2.5294, 2.7168, 2.7821, 2.3395, 2.2695, 2.8915, 2.7752], device='cuda:2') 2023-06-18 18:34:49,996 INFO [train.py:1028] (2/4) Epoch 2, validation: loss=0.277, simple_loss=0.3697, pruned_loss=0.09215, over 1796401.00 frames. 2023-06-18 18:34:49,997 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-18 18:34:52,386 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=200970.0, ans=0.125 2023-06-18 18:34:52,889 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.86 vs. limit=22.5 2023-06-18 18:34:54,109 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=200970.0, ans=0.125 2023-06-18 18:35:06,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=200970.0, ans=0.0 2023-06-18 18:35:14,103 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=201030.0, ans=0.0 2023-06-18 18:35:16,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=201030.0, ans=0.0 2023-06-18 18:35:57,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=201150.0, ans=0.0 2023-06-18 18:36:50,968 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.184e+02 2.746e+02 3.199e+02 3.819e+02 7.191e+02, threshold=6.398e+02, percent-clipped=0.0 2023-06-18 18:37:02,541 INFO [train.py:996] (2/4) Epoch 2, batch 3050, loss[loss=0.3591, simple_loss=0.4015, pruned_loss=0.1584, over 21765.00 frames. ], tot_loss[loss=0.2986, simple_loss=0.358, pruned_loss=0.1196, over 4273923.44 frames. ], batch size: 441, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 18:37:33,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=201330.0, ans=0.125 2023-06-18 18:38:13,980 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=201450.0, ans=0.025 2023-06-18 18:38:59,847 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 18:39:10,278 INFO [train.py:996] (2/4) Epoch 2, batch 3100, loss[loss=0.3716, simple_loss=0.4252, pruned_loss=0.159, over 21555.00 frames. ], tot_loss[loss=0.2967, simple_loss=0.3568, pruned_loss=0.1184, over 4273370.23 frames. ], batch size: 508, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 18:39:12,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=201570.0, ans=0.2 2023-06-18 18:41:08,929 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.143e+02 2.803e+02 3.511e+02 4.276e+02 6.392e+02, threshold=7.021e+02, percent-clipped=0.0 2023-06-18 18:41:28,769 INFO [train.py:996] (2/4) Epoch 2, batch 3150, loss[loss=0.366, simple_loss=0.4098, pruned_loss=0.1611, over 21600.00 frames. ], tot_loss[loss=0.2957, simple_loss=0.3557, pruned_loss=0.1178, over 4273796.91 frames. ], batch size: 415, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 18:41:46,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=201870.0, ans=0.125 2023-06-18 18:41:54,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=201930.0, ans=0.2 2023-06-18 18:42:04,813 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.86 vs. limit=15.0 2023-06-18 18:42:13,217 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=201930.0, ans=0.125 2023-06-18 18:42:13,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=201930.0, ans=0.1 2023-06-18 18:42:14,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=201930.0, ans=0.125 2023-06-18 18:42:28,854 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=201930.0, ans=0.125 2023-06-18 18:42:48,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=202050.0, ans=0.035 2023-06-18 18:42:48,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=202050.0, ans=0.125 2023-06-18 18:43:03,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=202050.0, ans=0.125 2023-06-18 18:43:03,269 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=202050.0, ans=0.0 2023-06-18 18:43:03,789 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.70 vs. limit=15.0 2023-06-18 18:43:14,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=202110.0, ans=0.1 2023-06-18 18:43:38,667 INFO [train.py:996] (2/4) Epoch 2, batch 3200, loss[loss=0.2867, simple_loss=0.3529, pruned_loss=0.1102, over 21660.00 frames. ], tot_loss[loss=0.2968, simple_loss=0.3576, pruned_loss=0.118, over 4280140.86 frames. ], batch size: 298, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 18:43:58,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=202170.0, ans=0.04949747468305833 2023-06-18 18:44:42,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=202290.0, ans=0.2 2023-06-18 18:45:20,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=202350.0, ans=0.125 2023-06-18 18:45:50,458 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.948e+02 3.485e+02 4.090e+02 5.140e+02 8.255e+02, threshold=8.180e+02, percent-clipped=4.0 2023-06-18 18:46:02,310 INFO [train.py:996] (2/4) Epoch 2, batch 3250, loss[loss=0.3254, simple_loss=0.4053, pruned_loss=0.1228, over 20914.00 frames. ], tot_loss[loss=0.2999, simple_loss=0.3595, pruned_loss=0.1202, over 4283075.74 frames. ], batch size: 607, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 18:46:06,166 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.45 vs. limit=15.0 2023-06-18 18:46:09,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=202470.0, ans=0.1 2023-06-18 18:47:46,791 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=202710.0, ans=0.1 2023-06-18 18:48:10,960 INFO [train.py:996] (2/4) Epoch 2, batch 3300, loss[loss=0.2442, simple_loss=0.3119, pruned_loss=0.0883, over 21332.00 frames. ], tot_loss[loss=0.2999, simple_loss=0.3574, pruned_loss=0.1212, over 4283778.12 frames. ], batch size: 131, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 18:48:15,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=202770.0, ans=0.1 2023-06-18 18:49:42,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=202890.0, ans=0.125 2023-06-18 18:49:43,482 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.78 vs. limit=15.0 2023-06-18 18:49:47,592 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=202950.0, ans=0.125 2023-06-18 18:50:11,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=203010.0, ans=0.125 2023-06-18 18:50:26,647 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.293e+02 3.159e+02 3.765e+02 4.410e+02 7.992e+02, threshold=7.530e+02, percent-clipped=0.0 2023-06-18 18:50:30,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=203010.0, ans=0.125 2023-06-18 18:50:31,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=203070.0, ans=0.09899494936611666 2023-06-18 18:50:32,590 INFO [train.py:996] (2/4) Epoch 2, batch 3350, loss[loss=0.3027, simple_loss=0.3445, pruned_loss=0.1305, over 21240.00 frames. ], tot_loss[loss=0.3023, simple_loss=0.361, pruned_loss=0.1218, over 4276652.53 frames. ], batch size: 608, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 18:50:32,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=203070.0, ans=0.0 2023-06-18 18:52:47,707 INFO [train.py:996] (2/4) Epoch 2, batch 3400, loss[loss=0.2648, simple_loss=0.3383, pruned_loss=0.09571, over 16629.00 frames. ], tot_loss[loss=0.3026, simple_loss=0.3606, pruned_loss=0.1223, over 4278542.00 frames. ], batch size: 60, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 18:53:04,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=203370.0, ans=0.0 2023-06-18 18:53:41,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=203490.0, ans=0.125 2023-06-18 18:53:41,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=203490.0, ans=0.0 2023-06-18 18:53:42,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=203490.0, ans=0.125 2023-06-18 18:53:52,618 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=203490.0, ans=0.1 2023-06-18 18:53:59,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=203490.0, ans=0.125 2023-06-18 18:54:40,628 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=203550.0, ans=0.1 2023-06-18 18:54:52,078 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.825e+02 2.921e+02 3.464e+02 4.240e+02 8.326e+02, threshold=6.928e+02, percent-clipped=2.0 2023-06-18 18:54:55,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=203610.0, ans=0.0 2023-06-18 18:54:58,313 INFO [train.py:996] (2/4) Epoch 2, batch 3450, loss[loss=0.3385, simple_loss=0.395, pruned_loss=0.141, over 21698.00 frames. ], tot_loss[loss=0.2987, simple_loss=0.3552, pruned_loss=0.1211, over 4282782.17 frames. ], batch size: 332, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 18:55:21,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=203670.0, ans=0.5 2023-06-18 18:56:21,688 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.49 vs. limit=22.5 2023-06-18 18:56:22,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=203790.0, ans=0.0 2023-06-18 18:57:28,384 INFO [train.py:996] (2/4) Epoch 2, batch 3500, loss[loss=0.4311, simple_loss=0.4643, pruned_loss=0.1989, over 21469.00 frames. ], tot_loss[loss=0.3074, simple_loss=0.3638, pruned_loss=0.1255, over 4284988.54 frames. ], batch size: 471, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 18:58:50,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=204150.0, ans=0.1 2023-06-18 18:59:25,792 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.040e+02 3.105e+02 3.610e+02 4.528e+02 8.050e+02, threshold=7.220e+02, percent-clipped=2.0 2023-06-18 18:59:46,327 INFO [train.py:996] (2/4) Epoch 2, batch 3550, loss[loss=0.2727, simple_loss=0.3316, pruned_loss=0.1068, over 21639.00 frames. ], tot_loss[loss=0.3099, simple_loss=0.3669, pruned_loss=0.1265, over 4283969.32 frames. ], batch size: 247, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 19:01:43,303 INFO [train.py:996] (2/4) Epoch 2, batch 3600, loss[loss=0.3449, simple_loss=0.3963, pruned_loss=0.1468, over 21350.00 frames. ], tot_loss[loss=0.3056, simple_loss=0.36, pruned_loss=0.1256, over 4280529.35 frames. ], batch size: 549, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 19:02:07,105 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=204570.0, ans=0.125 2023-06-18 19:03:01,466 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=204690.0, ans=0.125 2023-06-18 19:03:28,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=204750.0, ans=0.125 2023-06-18 19:03:58,753 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.112e+02 2.966e+02 3.675e+02 4.582e+02 9.024e+02, threshold=7.350e+02, percent-clipped=3.0 2023-06-18 19:04:11,489 INFO [train.py:996] (2/4) Epoch 2, batch 3650, loss[loss=0.2618, simple_loss=0.33, pruned_loss=0.09683, over 21600.00 frames. ], tot_loss[loss=0.3063, simple_loss=0.3608, pruned_loss=0.1259, over 4271759.23 frames. ], batch size: 230, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 19:05:23,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=204990.0, ans=0.125 2023-06-18 19:06:10,724 INFO [train.py:996] (2/4) Epoch 2, batch 3700, loss[loss=0.3276, simple_loss=0.3748, pruned_loss=0.1402, over 21400.00 frames. ], tot_loss[loss=0.3029, simple_loss=0.3583, pruned_loss=0.1238, over 4281496.52 frames. ], batch size: 549, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 19:06:50,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=205230.0, ans=0.125 2023-06-18 19:06:53,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=205230.0, ans=0.2 2023-06-18 19:06:58,817 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 19:07:26,395 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.22 vs. limit=15.0 2023-06-18 19:07:27,441 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=205290.0, ans=0.1 2023-06-18 19:07:33,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=205350.0, ans=0.0 2023-06-18 19:07:49,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=205350.0, ans=0.125 2023-06-18 19:07:50,091 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.30 vs. limit=15.0 2023-06-18 19:07:50,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=205350.0, ans=0.0 2023-06-18 19:07:52,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=205350.0, ans=0.125 2023-06-18 19:08:22,243 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.170e+02 2.719e+02 3.274e+02 3.792e+02 6.740e+02, threshold=6.549e+02, percent-clipped=0.0 2023-06-18 19:08:33,923 INFO [train.py:996] (2/4) Epoch 2, batch 3750, loss[loss=0.224, simple_loss=0.2967, pruned_loss=0.07563, over 21785.00 frames. ], tot_loss[loss=0.2999, simple_loss=0.3551, pruned_loss=0.1224, over 4287824.50 frames. ], batch size: 282, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 19:08:43,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=205470.0, ans=0.1 2023-06-18 19:09:14,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=205530.0, ans=0.125 2023-06-18 19:09:32,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=205590.0, ans=0.125 2023-06-18 19:10:24,781 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.98 vs. limit=15.0 2023-06-18 19:10:33,187 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-18 19:11:06,933 INFO [train.py:996] (2/4) Epoch 2, batch 3800, loss[loss=0.3103, simple_loss=0.3663, pruned_loss=0.1271, over 21684.00 frames. ], tot_loss[loss=0.2961, simple_loss=0.352, pruned_loss=0.1201, over 4277976.63 frames. ], batch size: 351, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 19:11:35,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=205830.0, ans=0.2 2023-06-18 19:12:41,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=206010.0, ans=0.1 2023-06-18 19:12:49,554 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.002e+02 3.000e+02 3.869e+02 4.812e+02 8.338e+02, threshold=7.738e+02, percent-clipped=10.0 2023-06-18 19:12:49,963 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=206010.0, ans=0.125 2023-06-18 19:12:59,571 INFO [train.py:996] (2/4) Epoch 2, batch 3850, loss[loss=0.3963, simple_loss=0.4878, pruned_loss=0.1524, over 19777.00 frames. ], tot_loss[loss=0.2963, simple_loss=0.3517, pruned_loss=0.1204, over 4279216.26 frames. ], batch size: 702, lr: 1.90e-02, grad_scale: 16.0 2023-06-18 19:13:44,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=206130.0, ans=0.125 2023-06-18 19:14:53,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=206370.0, ans=0.2 2023-06-18 19:14:54,902 INFO [train.py:996] (2/4) Epoch 2, batch 3900, loss[loss=0.2877, simple_loss=0.3572, pruned_loss=0.1091, over 16744.00 frames. ], tot_loss[loss=0.2953, simple_loss=0.3492, pruned_loss=0.1206, over 4281108.26 frames. ], batch size: 60, lr: 1.89e-02, grad_scale: 16.0 2023-06-18 19:15:10,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=206370.0, ans=0.125 2023-06-18 19:15:24,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=206370.0, ans=0.125 2023-06-18 19:15:38,382 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.29 vs. limit=22.5 2023-06-18 19:16:38,461 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=206550.0, ans=0.0 2023-06-18 19:17:07,616 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.133e+02 2.842e+02 3.227e+02 4.001e+02 6.107e+02, threshold=6.453e+02, percent-clipped=0.0 2023-06-18 19:17:23,774 INFO [train.py:996] (2/4) Epoch 2, batch 3950, loss[loss=0.2988, simple_loss=0.3536, pruned_loss=0.122, over 19899.00 frames. ], tot_loss[loss=0.2978, simple_loss=0.354, pruned_loss=0.1208, over 4279379.66 frames. ], batch size: 703, lr: 1.89e-02, grad_scale: 16.0 2023-06-18 19:17:25,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=206670.0, ans=0.125 2023-06-18 19:17:27,847 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.35 vs. limit=15.0 2023-06-18 19:19:07,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=206910.0, ans=0.125 2023-06-18 19:19:09,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=206910.0, ans=0.2 2023-06-18 19:19:28,696 INFO [train.py:996] (2/4) Epoch 2, batch 4000, loss[loss=0.2716, simple_loss=0.3157, pruned_loss=0.1137, over 21876.00 frames. ], tot_loss[loss=0.2871, simple_loss=0.3433, pruned_loss=0.1155, over 4272498.14 frames. ], batch size: 373, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 19:19:43,142 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=206970.0, ans=0.2 2023-06-18 19:19:43,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=206970.0, ans=0.07 2023-06-18 19:20:15,921 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=207090.0, ans=0.2 2023-06-18 19:20:52,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=207150.0, ans=0.2 2023-06-18 19:20:56,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=207150.0, ans=0.0 2023-06-18 19:21:10,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=207150.0, ans=0.125 2023-06-18 19:21:38,176 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 2.725e+02 3.231e+02 3.855e+02 7.242e+02, threshold=6.463e+02, percent-clipped=2.0 2023-06-18 19:21:42,615 INFO [train.py:996] (2/4) Epoch 2, batch 4050, loss[loss=0.3314, simple_loss=0.3815, pruned_loss=0.1406, over 21527.00 frames. ], tot_loss[loss=0.2855, simple_loss=0.3427, pruned_loss=0.1141, over 4263562.75 frames. ], batch size: 507, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 19:21:56,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=207270.0, ans=0.0 2023-06-18 19:22:06,292 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.27 vs. limit=10.0 2023-06-18 19:22:18,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=207330.0, ans=0.125 2023-06-18 19:23:53,143 INFO [train.py:996] (2/4) Epoch 2, batch 4100, loss[loss=0.2868, simple_loss=0.3288, pruned_loss=0.1224, over 21269.00 frames. ], tot_loss[loss=0.2878, simple_loss=0.3448, pruned_loss=0.1155, over 4265794.90 frames. ], batch size: 159, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 19:24:19,969 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=207630.0, ans=0.125 2023-06-18 19:24:33,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=207630.0, ans=0.0 2023-06-18 19:25:02,905 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.53 vs. limit=15.0 2023-06-18 19:25:27,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=207750.0, ans=0.125 2023-06-18 19:25:31,515 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=207750.0, ans=0.07 2023-06-18 19:25:49,501 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.115e+02 2.985e+02 3.648e+02 4.743e+02 7.155e+02, threshold=7.295e+02, percent-clipped=4.0 2023-06-18 19:26:12,885 INFO [train.py:996] (2/4) Epoch 2, batch 4150, loss[loss=0.3158, simple_loss=0.3619, pruned_loss=0.1349, over 21421.00 frames. ], tot_loss[loss=0.2831, simple_loss=0.3441, pruned_loss=0.111, over 4274909.80 frames. ], batch size: 508, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 19:26:29,054 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.92 vs. limit=22.5 2023-06-18 19:26:29,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=207930.0, ans=0.125 2023-06-18 19:27:42,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=208110.0, ans=0.0 2023-06-18 19:27:49,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=208110.0, ans=0.2 2023-06-18 19:27:52,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=208110.0, ans=0.125 2023-06-18 19:28:13,464 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.08 vs. limit=6.0 2023-06-18 19:28:14,150 INFO [train.py:996] (2/4) Epoch 2, batch 4200, loss[loss=0.225, simple_loss=0.2938, pruned_loss=0.07813, over 21222.00 frames. ], tot_loss[loss=0.2796, simple_loss=0.3411, pruned_loss=0.109, over 4273404.31 frames. ], batch size: 159, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 19:28:58,607 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.44 vs. limit=15.0 2023-06-18 19:29:20,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=208290.0, ans=0.2 2023-06-18 19:30:28,000 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.663e+02 3.201e+02 3.847e+02 4.698e+02 6.915e+02, threshold=7.694e+02, percent-clipped=0.0 2023-06-18 19:30:32,568 INFO [train.py:996] (2/4) Epoch 2, batch 4250, loss[loss=0.3507, simple_loss=0.3895, pruned_loss=0.156, over 21340.00 frames. ], tot_loss[loss=0.2903, simple_loss=0.3528, pruned_loss=0.1139, over 4270795.01 frames. ], batch size: 548, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 19:30:34,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=208470.0, ans=0.0 2023-06-18 19:30:39,508 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-18 19:31:13,233 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=208530.0, ans=0.125 2023-06-18 19:31:16,765 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=208530.0, ans=0.125 2023-06-18 19:32:36,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=208710.0, ans=0.2 2023-06-18 19:32:38,488 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=208710.0, ans=0.125 2023-06-18 19:32:47,715 INFO [train.py:996] (2/4) Epoch 2, batch 4300, loss[loss=0.2653, simple_loss=0.337, pruned_loss=0.09679, over 21442.00 frames. ], tot_loss[loss=0.2968, simple_loss=0.3589, pruned_loss=0.1174, over 4271404.99 frames. ], batch size: 211, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 19:34:21,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=208890.0, ans=0.125 2023-06-18 19:34:54,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=209010.0, ans=0.0 2023-06-18 19:34:58,459 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=209010.0, ans=0.125 2023-06-18 19:35:14,046 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.035e+02 3.172e+02 3.778e+02 4.372e+02 7.557e+02, threshold=7.556e+02, percent-clipped=0.0 2023-06-18 19:35:25,682 INFO [train.py:996] (2/4) Epoch 2, batch 4350, loss[loss=0.257, simple_loss=0.3049, pruned_loss=0.1046, over 21362.00 frames. ], tot_loss[loss=0.2937, simple_loss=0.3556, pruned_loss=0.1159, over 4263511.06 frames. ], batch size: 177, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 19:35:54,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=209130.0, ans=0.0 2023-06-18 19:36:13,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=209130.0, ans=0.1 2023-06-18 19:36:28,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=209190.0, ans=0.0 2023-06-18 19:36:40,365 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=209250.0, ans=0.1 2023-06-18 19:36:47,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=209250.0, ans=0.125 2023-06-18 19:37:21,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=209310.0, ans=0.025 2023-06-18 19:37:23,109 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=209310.0, ans=0.2 2023-06-18 19:37:38,253 INFO [train.py:996] (2/4) Epoch 2, batch 4400, loss[loss=0.2698, simple_loss=0.3468, pruned_loss=0.09641, over 21705.00 frames. ], tot_loss[loss=0.292, simple_loss=0.3522, pruned_loss=0.1159, over 4265889.71 frames. ], batch size: 298, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 19:38:11,496 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.09 vs. limit=22.5 2023-06-18 19:38:50,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=209550.0, ans=0.125 2023-06-18 19:39:47,529 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.163e+02 2.882e+02 3.350e+02 4.158e+02 7.095e+02, threshold=6.699e+02, percent-clipped=0.0 2023-06-18 19:39:57,952 INFO [train.py:996] (2/4) Epoch 2, batch 4450, loss[loss=0.3258, simple_loss=0.4041, pruned_loss=0.1238, over 21839.00 frames. ], tot_loss[loss=0.2939, simple_loss=0.356, pruned_loss=0.1159, over 4267387.18 frames. ], batch size: 316, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 19:40:27,943 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=209670.0, ans=0.2 2023-06-18 19:40:32,826 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.20 vs. limit=22.5 2023-06-18 19:40:42,147 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=209730.0, ans=0.0 2023-06-18 19:40:46,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=209730.0, ans=0.125 2023-06-18 19:41:30,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=209850.0, ans=0.05 2023-06-18 19:41:37,627 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.82 vs. limit=12.0 2023-06-18 19:41:39,288 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.88 vs. limit=15.0 2023-06-18 19:42:15,033 INFO [train.py:996] (2/4) Epoch 2, batch 4500, loss[loss=0.2819, simple_loss=0.3421, pruned_loss=0.1108, over 21890.00 frames. ], tot_loss[loss=0.2981, simple_loss=0.359, pruned_loss=0.1186, over 4270645.63 frames. ], batch size: 118, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 19:42:30,791 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=209970.0, ans=0.125 2023-06-18 19:43:32,692 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.66 vs. limit=15.0 2023-06-18 19:43:51,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=210150.0, ans=0.05 2023-06-18 19:43:54,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=210150.0, ans=0.125 2023-06-18 19:44:15,212 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=210210.0, ans=0.125 2023-06-18 19:44:16,218 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.046e+02 2.865e+02 3.524e+02 4.358e+02 7.119e+02, threshold=7.048e+02, percent-clipped=2.0 2023-06-18 19:44:38,592 INFO [train.py:996] (2/4) Epoch 2, batch 4550, loss[loss=0.3403, simple_loss=0.3947, pruned_loss=0.1429, over 21740.00 frames. ], tot_loss[loss=0.2995, simple_loss=0.3617, pruned_loss=0.1187, over 4278900.95 frames. ], batch size: 298, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 19:45:42,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=210390.0, ans=0.125 2023-06-18 19:46:49,063 INFO [train.py:996] (2/4) Epoch 2, batch 4600, loss[loss=0.2536, simple_loss=0.3177, pruned_loss=0.09472, over 21511.00 frames. ], tot_loss[loss=0.3024, simple_loss=0.3647, pruned_loss=0.1201, over 4281226.80 frames. ], batch size: 195, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 19:48:35,714 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=210750.0, ans=0.0 2023-06-18 19:48:57,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=210810.0, ans=0.125 2023-06-18 19:48:59,910 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.188e+02 3.115e+02 3.942e+02 4.901e+02 8.215e+02, threshold=7.883e+02, percent-clipped=5.0 2023-06-18 19:49:07,880 INFO [train.py:996] (2/4) Epoch 2, batch 4650, loss[loss=0.1596, simple_loss=0.2157, pruned_loss=0.05171, over 16181.00 frames. ], tot_loss[loss=0.2981, simple_loss=0.3608, pruned_loss=0.1177, over 4274546.59 frames. ], batch size: 60, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 19:49:40,913 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.36 vs. limit=10.0 2023-06-18 19:50:14,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=210990.0, ans=0.035 2023-06-18 19:51:14,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=211110.0, ans=0.0 2023-06-18 19:51:18,631 INFO [train.py:996] (2/4) Epoch 2, batch 4700, loss[loss=0.244, simple_loss=0.3018, pruned_loss=0.09304, over 21406.00 frames. ], tot_loss[loss=0.2878, simple_loss=0.3483, pruned_loss=0.1136, over 4274121.22 frames. ], batch size: 131, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 19:51:25,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=211170.0, ans=10.0 2023-06-18 19:51:25,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=211170.0, ans=0.125 2023-06-18 19:51:33,840 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=211230.0, ans=0.07 2023-06-18 19:51:40,044 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.73 vs. limit=15.0 2023-06-18 19:52:57,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=211410.0, ans=0.0 2023-06-18 19:53:11,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=211410.0, ans=0.5 2023-06-18 19:53:11,787 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=211410.0, ans=0.125 2023-06-18 19:53:12,756 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.869e+02 2.814e+02 3.366e+02 4.246e+02 7.056e+02, threshold=6.733e+02, percent-clipped=0.0 2023-06-18 19:53:13,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=211410.0, ans=0.05 2023-06-18 19:53:17,074 INFO [train.py:996] (2/4) Epoch 2, batch 4750, loss[loss=0.2571, simple_loss=0.3124, pruned_loss=0.1009, over 21655.00 frames. ], tot_loss[loss=0.2863, simple_loss=0.3438, pruned_loss=0.1144, over 4281329.62 frames. ], batch size: 230, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 19:53:58,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=211530.0, ans=0.1 2023-06-18 19:54:26,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=211590.0, ans=0.125 2023-06-18 19:55:31,822 INFO [train.py:996] (2/4) Epoch 2, batch 4800, loss[loss=0.2459, simple_loss=0.32, pruned_loss=0.08591, over 21533.00 frames. ], tot_loss[loss=0.2887, simple_loss=0.3461, pruned_loss=0.1157, over 4280526.06 frames. ], batch size: 230, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 19:55:42,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=211770.0, ans=0.1 2023-06-18 19:56:59,593 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.42 vs. limit=15.0 2023-06-18 19:57:11,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=212010.0, ans=0.1 2023-06-18 19:57:14,611 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 19:57:24,066 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.995e+02 2.885e+02 3.329e+02 4.193e+02 7.094e+02, threshold=6.657e+02, percent-clipped=1.0 2023-06-18 19:57:28,261 INFO [train.py:996] (2/4) Epoch 2, batch 4850, loss[loss=0.3561, simple_loss=0.3948, pruned_loss=0.1587, over 21372.00 frames. ], tot_loss[loss=0.2876, simple_loss=0.3453, pruned_loss=0.1149, over 4276029.20 frames. ], batch size: 507, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 19:58:13,185 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.29 vs. limit=15.0 2023-06-18 19:59:03,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=212310.0, ans=0.015 2023-06-18 19:59:36,624 INFO [train.py:996] (2/4) Epoch 2, batch 4900, loss[loss=0.2833, simple_loss=0.3637, pruned_loss=0.1014, over 21582.00 frames. ], tot_loss[loss=0.2907, simple_loss=0.3477, pruned_loss=0.1168, over 4281358.02 frames. ], batch size: 230, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 20:00:02,099 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=212430.0, ans=0.0 2023-06-18 20:00:37,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=212490.0, ans=0.125 2023-06-18 20:00:37,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=212490.0, ans=0.07 2023-06-18 20:00:40,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=212490.0, ans=0.2 2023-06-18 20:01:48,727 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.330e+02 3.116e+02 3.732e+02 4.478e+02 7.040e+02, threshold=7.463e+02, percent-clipped=1.0 2023-06-18 20:01:53,123 INFO [train.py:996] (2/4) Epoch 2, batch 4950, loss[loss=0.3294, simple_loss=0.3945, pruned_loss=0.1322, over 21450.00 frames. ], tot_loss[loss=0.2906, simple_loss=0.3514, pruned_loss=0.1149, over 4280221.74 frames. ], batch size: 507, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 20:02:52,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=212730.0, ans=0.125 2023-06-18 20:03:16,604 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=212850.0, ans=0.125 2023-06-18 20:03:59,822 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.17 vs. limit=10.0 2023-06-18 20:04:10,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=212910.0, ans=6.0 2023-06-18 20:04:13,524 INFO [train.py:996] (2/4) Epoch 2, batch 5000, loss[loss=0.2475, simple_loss=0.326, pruned_loss=0.08451, over 21403.00 frames. ], tot_loss[loss=0.2862, simple_loss=0.3499, pruned_loss=0.1112, over 4283376.77 frames. ], batch size: 131, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 20:04:47,077 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=213030.0, ans=0.0 2023-06-18 20:05:11,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=213030.0, ans=0.09899494936611666 2023-06-18 20:05:18,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=213090.0, ans=0.2 2023-06-18 20:05:43,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=213150.0, ans=0.0 2023-06-18 20:06:04,053 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=213210.0, ans=0.125 2023-06-18 20:06:09,534 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.908e+02 2.689e+02 3.169e+02 3.752e+02 6.029e+02, threshold=6.337e+02, percent-clipped=0.0 2023-06-18 20:06:20,640 INFO [train.py:996] (2/4) Epoch 2, batch 5050, loss[loss=0.3043, simple_loss=0.3614, pruned_loss=0.1236, over 21851.00 frames. ], tot_loss[loss=0.2884, simple_loss=0.3494, pruned_loss=0.1137, over 4286951.26 frames. ], batch size: 118, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 20:06:22,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=213270.0, ans=0.1 2023-06-18 20:06:58,066 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=213330.0, ans=0.0 2023-06-18 20:08:24,335 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.06 vs. limit=6.0 2023-06-18 20:08:31,988 INFO [train.py:996] (2/4) Epoch 2, batch 5100, loss[loss=0.2324, simple_loss=0.3055, pruned_loss=0.0797, over 21803.00 frames. ], tot_loss[loss=0.2882, simple_loss=0.3474, pruned_loss=0.1145, over 4287120.54 frames. ], batch size: 298, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 20:08:55,516 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=213630.0, ans=0.1 2023-06-18 20:09:27,508 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.19 vs. limit=15.0 2023-06-18 20:10:25,207 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=213810.0, ans=0.125 2023-06-18 20:10:29,706 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=213810.0, ans=0.1 2023-06-18 20:10:34,771 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.030e+02 2.968e+02 3.545e+02 4.685e+02 7.236e+02, threshold=7.090e+02, percent-clipped=4.0 2023-06-18 20:10:43,516 INFO [train.py:996] (2/4) Epoch 2, batch 5150, loss[loss=0.3702, simple_loss=0.4035, pruned_loss=0.1685, over 21562.00 frames. ], tot_loss[loss=0.289, simple_loss=0.347, pruned_loss=0.1155, over 4284351.68 frames. ], batch size: 471, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 20:11:10,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=213930.0, ans=0.2 2023-06-18 20:11:53,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=213990.0, ans=0.125 2023-06-18 20:13:01,834 INFO [train.py:996] (2/4) Epoch 2, batch 5200, loss[loss=0.3491, simple_loss=0.4235, pruned_loss=0.1373, over 21223.00 frames. ], tot_loss[loss=0.2908, simple_loss=0.3492, pruned_loss=0.1162, over 4283473.54 frames. ], batch size: 548, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 20:13:36,325 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.48 vs. limit=22.5 2023-06-18 20:13:42,191 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=214230.0, ans=0.125 2023-06-18 20:13:58,605 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=214290.0, ans=0.125 2023-06-18 20:14:09,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=214290.0, ans=0.0 2023-06-18 20:14:59,701 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.974e+02 2.964e+02 3.541e+02 4.269e+02 6.155e+02, threshold=7.082e+02, percent-clipped=0.0 2023-06-18 20:15:04,634 INFO [train.py:996] (2/4) Epoch 2, batch 5250, loss[loss=0.3428, simple_loss=0.3926, pruned_loss=0.1465, over 21718.00 frames. ], tot_loss[loss=0.2887, simple_loss=0.3515, pruned_loss=0.113, over 4281680.39 frames. ], batch size: 441, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 20:15:05,574 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.41 vs. limit=15.0 2023-06-18 20:15:12,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=214470.0, ans=0.125 2023-06-18 20:16:18,040 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.36 vs. limit=8.0 2023-06-18 20:16:20,618 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.88 vs. limit=15.0 2023-06-18 20:16:37,806 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.15 vs. limit=22.5 2023-06-18 20:17:18,946 INFO [train.py:996] (2/4) Epoch 2, batch 5300, loss[loss=0.3373, simple_loss=0.3727, pruned_loss=0.151, over 21766.00 frames. ], tot_loss[loss=0.2908, simple_loss=0.3515, pruned_loss=0.115, over 4283637.34 frames. ], batch size: 441, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 20:17:27,789 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.47 vs. limit=15.0 2023-06-18 20:17:54,376 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.79 vs. limit=15.0 2023-06-18 20:18:19,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.whiten.whitening_limit, batch_count=214830.0, ans=12.0 2023-06-18 20:18:44,248 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=214890.0, ans=0.2 2023-06-18 20:19:09,053 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=215010.0, ans=0.125 2023-06-18 20:19:17,122 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.126e+02 2.854e+02 3.243e+02 3.685e+02 6.916e+02, threshold=6.486e+02, percent-clipped=0.0 2023-06-18 20:19:21,305 INFO [train.py:996] (2/4) Epoch 2, batch 5350, loss[loss=0.2766, simple_loss=0.3273, pruned_loss=0.113, over 21859.00 frames. ], tot_loss[loss=0.2923, simple_loss=0.3508, pruned_loss=0.1169, over 4285377.32 frames. ], batch size: 124, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 20:20:04,174 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=215130.0, ans=0.125 2023-06-18 20:20:38,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=215190.0, ans=0.125 2023-06-18 20:21:41,611 INFO [train.py:996] (2/4) Epoch 2, batch 5400, loss[loss=0.2772, simple_loss=0.3322, pruned_loss=0.1111, over 21480.00 frames. ], tot_loss[loss=0.2949, simple_loss=0.3519, pruned_loss=0.1189, over 4285113.25 frames. ], batch size: 131, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 20:22:03,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=215370.0, ans=0.0 2023-06-18 20:23:23,331 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=215550.0, ans=0.125 2023-06-18 20:23:43,971 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.922e+02 2.870e+02 3.526e+02 4.433e+02 7.880e+02, threshold=7.051e+02, percent-clipped=2.0 2023-06-18 20:24:07,770 INFO [train.py:996] (2/4) Epoch 2, batch 5450, loss[loss=0.3787, simple_loss=0.4974, pruned_loss=0.13, over 19687.00 frames. ], tot_loss[loss=0.2933, simple_loss=0.3536, pruned_loss=0.1165, over 4288355.61 frames. ], batch size: 702, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 20:24:17,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=215670.0, ans=0.0 2023-06-18 20:24:36,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=215730.0, ans=0.125 2023-06-18 20:26:08,520 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=215910.0, ans=0.0 2023-06-18 20:26:09,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=215910.0, ans=0.125 2023-06-18 20:26:20,179 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=215910.0, ans=0.125 2023-06-18 20:26:26,985 INFO [train.py:996] (2/4) Epoch 2, batch 5500, loss[loss=0.3094, simple_loss=0.3908, pruned_loss=0.114, over 21660.00 frames. ], tot_loss[loss=0.2916, simple_loss=0.3581, pruned_loss=0.1125, over 4287666.16 frames. ], batch size: 389, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 20:26:30,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=215970.0, ans=0.125 2023-06-18 20:26:32,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=215970.0, ans=0.0 2023-06-18 20:26:33,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=215970.0, ans=0.0 2023-06-18 20:28:27,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=216210.0, ans=0.125 2023-06-18 20:28:51,968 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.961e+02 2.788e+02 3.266e+02 4.093e+02 7.420e+02, threshold=6.532e+02, percent-clipped=2.0 2023-06-18 20:29:02,090 INFO [train.py:996] (2/4) Epoch 2, batch 5550, loss[loss=0.2546, simple_loss=0.3432, pruned_loss=0.08298, over 21670.00 frames. ], tot_loss[loss=0.2869, simple_loss=0.3548, pruned_loss=0.1095, over 4284146.54 frames. ], batch size: 298, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 20:30:17,034 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.23 vs. limit=15.0 2023-06-18 20:31:13,796 INFO [train.py:996] (2/4) Epoch 2, batch 5600, loss[loss=0.3699, simple_loss=0.4402, pruned_loss=0.1498, over 21645.00 frames. ], tot_loss[loss=0.2827, simple_loss=0.3527, pruned_loss=0.1063, over 4280940.40 frames. ], batch size: 441, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 20:31:19,435 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.08 vs. limit=22.5 2023-06-18 20:32:12,643 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=216690.0, ans=0.125 2023-06-18 20:32:38,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=216750.0, ans=0.0 2023-06-18 20:32:49,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=216810.0, ans=6.0 2023-06-18 20:33:05,994 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.078e+02 3.046e+02 3.574e+02 4.352e+02 8.183e+02, threshold=7.147e+02, percent-clipped=1.0 2023-06-18 20:33:20,524 INFO [train.py:996] (2/4) Epoch 2, batch 5650, loss[loss=0.3154, simple_loss=0.3594, pruned_loss=0.1357, over 21851.00 frames. ], tot_loss[loss=0.2865, simple_loss=0.3552, pruned_loss=0.1089, over 4288499.40 frames. ], batch size: 371, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 20:33:52,411 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.53 vs. limit=15.0 2023-06-18 20:34:13,568 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=216990.0, ans=0.125 2023-06-18 20:35:31,735 INFO [train.py:996] (2/4) Epoch 2, batch 5700, loss[loss=0.3369, simple_loss=0.4222, pruned_loss=0.1258, over 20796.00 frames. ], tot_loss[loss=0.2892, simple_loss=0.3553, pruned_loss=0.1115, over 4286688.42 frames. ], batch size: 608, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 20:36:42,403 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=217290.0, ans=10.0 2023-06-18 20:37:00,706 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=217290.0, ans=0.125 2023-06-18 20:37:51,401 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.948e+02 3.053e+02 4.055e+02 5.336e+02 8.968e+02, threshold=8.109e+02, percent-clipped=6.0 2023-06-18 20:37:55,691 INFO [train.py:996] (2/4) Epoch 2, batch 5750, loss[loss=0.3084, simple_loss=0.3774, pruned_loss=0.1198, over 21482.00 frames. ], tot_loss[loss=0.2834, simple_loss=0.3517, pruned_loss=0.1075, over 4288038.31 frames. ], batch size: 508, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 20:38:13,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=217470.0, ans=0.05 2023-06-18 20:39:33,810 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=217650.0, ans=0.125 2023-06-18 20:39:41,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=217650.0, ans=0.125 2023-06-18 20:40:27,369 INFO [train.py:996] (2/4) Epoch 2, batch 5800, loss[loss=0.3231, simple_loss=0.3968, pruned_loss=0.1247, over 21607.00 frames. ], tot_loss[loss=0.279, simple_loss=0.3489, pruned_loss=0.1045, over 4281792.14 frames. ], batch size: 441, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 20:42:37,689 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.479e+02 2.376e+02 3.080e+02 4.315e+02 9.402e+02, threshold=6.161e+02, percent-clipped=2.0 2023-06-18 20:42:42,061 INFO [train.py:996] (2/4) Epoch 2, batch 5850, loss[loss=0.2122, simple_loss=0.3161, pruned_loss=0.05412, over 21770.00 frames. ], tot_loss[loss=0.2734, simple_loss=0.3462, pruned_loss=0.1002, over 4285933.06 frames. ], batch size: 282, lr: 1.85e-02, grad_scale: 64.0 2023-06-18 20:43:09,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=218130.0, ans=0.1 2023-06-18 20:44:52,892 INFO [train.py:996] (2/4) Epoch 2, batch 5900, loss[loss=0.2597, simple_loss=0.3156, pruned_loss=0.1019, over 21206.00 frames. ], tot_loss[loss=0.2623, simple_loss=0.338, pruned_loss=0.09324, over 4286251.00 frames. ], batch size: 143, lr: 1.84e-02, grad_scale: 32.0 2023-06-18 20:45:20,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=218370.0, ans=0.0 2023-06-18 20:45:28,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=218430.0, ans=0.0 2023-06-18 20:46:03,422 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.65 vs. limit=15.0 2023-06-18 20:46:13,365 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.84 vs. limit=15.0 2023-06-18 20:46:44,937 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=218610.0, ans=0.0 2023-06-18 20:46:52,902 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.535e+02 2.321e+02 3.028e+02 3.868e+02 8.968e+02, threshold=6.057e+02, percent-clipped=3.0 2023-06-18 20:46:57,508 INFO [train.py:996] (2/4) Epoch 2, batch 5950, loss[loss=0.2535, simple_loss=0.3087, pruned_loss=0.09918, over 21565.00 frames. ], tot_loss[loss=0.2681, simple_loss=0.3391, pruned_loss=0.09851, over 4288168.49 frames. ], batch size: 195, lr: 1.84e-02, grad_scale: 32.0 2023-06-18 20:48:12,296 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=218850.0, ans=10.0 2023-06-18 20:48:43,114 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.66 vs. limit=10.0 2023-06-18 20:48:52,202 INFO [train.py:996] (2/4) Epoch 2, batch 6000, loss[loss=0.2755, simple_loss=0.3194, pruned_loss=0.1158, over 21763.00 frames. ], tot_loss[loss=0.2708, simple_loss=0.3363, pruned_loss=0.1026, over 4291994.54 frames. ], batch size: 371, lr: 1.84e-02, grad_scale: 32.0 2023-06-18 20:48:52,203 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-18 20:49:42,897 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.3.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.5958, 2.1972, 2.5746, 2.6200, 2.1057, 2.1964, 2.6672, 2.5422], device='cuda:2') 2023-06-18 20:49:47,782 INFO [train.py:1028] (2/4) Epoch 2, validation: loss=0.2855, simple_loss=0.3796, pruned_loss=0.09574, over 1796401.00 frames. 2023-06-18 20:49:47,787 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-18 20:50:21,862 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=219030.0, ans=0.5 2023-06-18 20:50:35,103 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=219090.0, ans=0.125 2023-06-18 20:51:21,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=219210.0, ans=0.2 2023-06-18 20:51:42,978 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.008e+02 2.988e+02 3.378e+02 4.160e+02 8.273e+02, threshold=6.755e+02, percent-clipped=6.0 2023-06-18 20:51:45,878 INFO [train.py:996] (2/4) Epoch 2, batch 6050, loss[loss=0.2452, simple_loss=0.3093, pruned_loss=0.09053, over 21602.00 frames. ], tot_loss[loss=0.2694, simple_loss=0.3313, pruned_loss=0.1038, over 4282070.40 frames. ], batch size: 391, lr: 1.84e-02, grad_scale: 32.0 2023-06-18 20:51:52,186 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=219270.0, ans=0.2 2023-06-18 20:52:19,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=219330.0, ans=0.125 2023-06-18 20:52:48,054 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=219390.0, ans=0.0 2023-06-18 20:53:45,888 INFO [train.py:996] (2/4) Epoch 2, batch 6100, loss[loss=0.2448, simple_loss=0.2784, pruned_loss=0.1056, over 20030.00 frames. ], tot_loss[loss=0.2664, simple_loss=0.329, pruned_loss=0.1019, over 4281133.17 frames. ], batch size: 703, lr: 1.84e-02, grad_scale: 16.0 2023-06-18 20:54:52,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=219690.0, ans=0.1 2023-06-18 20:55:08,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=219750.0, ans=0.125 2023-06-18 20:55:45,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=219810.0, ans=0.1 2023-06-18 20:55:54,860 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.805e+02 3.086e+02 4.066e+02 5.102e+02 8.027e+02, threshold=8.133e+02, percent-clipped=8.0 2023-06-18 20:55:56,245 INFO [train.py:996] (2/4) Epoch 2, batch 6150, loss[loss=0.2421, simple_loss=0.3104, pruned_loss=0.08691, over 21614.00 frames. ], tot_loss[loss=0.2728, simple_loss=0.3337, pruned_loss=0.106, over 4285608.97 frames. ], batch size: 230, lr: 1.84e-02, grad_scale: 16.0 2023-06-18 20:56:52,645 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_na.min_abs, batch_count=219990.0, ans=0.02 2023-06-18 20:57:23,937 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=220050.0, ans=0.1 2023-06-18 20:57:23,956 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=220050.0, ans=0.125 2023-06-18 20:58:07,513 INFO [train.py:996] (2/4) Epoch 2, batch 6200, loss[loss=0.2951, simple_loss=0.378, pruned_loss=0.1061, over 19888.00 frames. ], tot_loss[loss=0.2745, simple_loss=0.3372, pruned_loss=0.1059, over 4277445.74 frames. ], batch size: 702, lr: 1.84e-02, grad_scale: 16.0 2023-06-18 20:58:42,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=220230.0, ans=0.125 2023-06-18 20:59:43,646 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.04 vs. limit=22.5 2023-06-18 21:00:26,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=220410.0, ans=0.125 2023-06-18 21:00:27,271 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.825e+02 2.590e+02 3.175e+02 3.801e+02 7.451e+02, threshold=6.350e+02, percent-clipped=0.0 2023-06-18 21:00:28,868 INFO [train.py:996] (2/4) Epoch 2, batch 6250, loss[loss=0.2637, simple_loss=0.3598, pruned_loss=0.0838, over 21623.00 frames. ], tot_loss[loss=0.2752, simple_loss=0.3411, pruned_loss=0.1047, over 4278035.50 frames. ], batch size: 263, lr: 1.84e-02, grad_scale: 16.0 2023-06-18 21:01:23,856 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-18 21:01:44,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=220590.0, ans=0.2 2023-06-18 21:02:41,105 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=220710.0, ans=0.0 2023-06-18 21:02:43,590 INFO [train.py:996] (2/4) Epoch 2, batch 6300, loss[loss=0.3059, simple_loss=0.346, pruned_loss=0.1329, over 21582.00 frames. ], tot_loss[loss=0.275, simple_loss=0.3439, pruned_loss=0.1031, over 4282553.17 frames. ], batch size: 548, lr: 1.83e-02, grad_scale: 16.0 2023-06-18 21:03:51,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=220890.0, ans=0.125 2023-06-18 21:03:59,307 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.23 vs. limit=12.0 2023-06-18 21:04:05,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=220950.0, ans=0.2 2023-06-18 21:04:37,466 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.43 vs. limit=15.0 2023-06-18 21:04:38,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=221010.0, ans=0.125 2023-06-18 21:04:39,187 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.847e+02 2.847e+02 3.430e+02 4.458e+02 8.729e+02, threshold=6.860e+02, percent-clipped=4.0 2023-06-18 21:04:40,736 INFO [train.py:996] (2/4) Epoch 2, batch 6350, loss[loss=0.2912, simple_loss=0.3608, pruned_loss=0.1108, over 21462.00 frames. ], tot_loss[loss=0.283, simple_loss=0.3498, pruned_loss=0.1081, over 4281814.84 frames. ], batch size: 194, lr: 1.83e-02, grad_scale: 16.0 2023-06-18 21:05:17,573 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-18 21:05:53,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=221190.0, ans=0.0 2023-06-18 21:06:15,691 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=221250.0, ans=0.125 2023-06-18 21:06:29,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=221250.0, ans=0.125 2023-06-18 21:06:36,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=221310.0, ans=0.125 2023-06-18 21:07:00,535 INFO [train.py:996] (2/4) Epoch 2, batch 6400, loss[loss=0.3485, simple_loss=0.3972, pruned_loss=0.1499, over 21841.00 frames. ], tot_loss[loss=0.2917, simple_loss=0.3558, pruned_loss=0.1138, over 4285806.02 frames. ], batch size: 124, lr: 1.83e-02, grad_scale: 32.0 2023-06-18 21:07:38,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=221430.0, ans=0.125 2023-06-18 21:07:40,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=221430.0, ans=0.125 2023-06-18 21:07:59,257 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=221490.0, ans=0.125 2023-06-18 21:08:01,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=221490.0, ans=0.0 2023-06-18 21:08:10,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=221490.0, ans=0.125 2023-06-18 21:08:27,523 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=221550.0, ans=0.0 2023-06-18 21:09:22,170 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.827e+02 2.900e+02 3.291e+02 3.845e+02 6.146e+02, threshold=6.583e+02, percent-clipped=0.0 2023-06-18 21:09:22,200 INFO [train.py:996] (2/4) Epoch 2, batch 6450, loss[loss=0.267, simple_loss=0.3377, pruned_loss=0.09817, over 21724.00 frames. ], tot_loss[loss=0.2918, simple_loss=0.3578, pruned_loss=0.1129, over 4290144.06 frames. ], batch size: 332, lr: 1.83e-02, grad_scale: 16.0 2023-06-18 21:11:01,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=221910.0, ans=0.125 2023-06-18 21:11:10,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=221910.0, ans=0.2 2023-06-18 21:11:21,741 INFO [train.py:996] (2/4) Epoch 2, batch 6500, loss[loss=0.2389, simple_loss=0.2874, pruned_loss=0.09521, over 14950.00 frames. ], tot_loss[loss=0.2849, simple_loss=0.3485, pruned_loss=0.1107, over 4285994.43 frames. ], batch size: 61, lr: 1.83e-02, grad_scale: 16.0 2023-06-18 21:11:30,273 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=15.0 2023-06-18 21:11:47,848 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=222030.0, ans=0.0 2023-06-18 21:12:15,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=222090.0, ans=0.125 2023-06-18 21:12:40,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=222150.0, ans=0.125 2023-06-18 21:12:45,615 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.65 vs. limit=15.0 2023-06-18 21:13:14,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=222210.0, ans=0.1 2023-06-18 21:13:40,649 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.861e+02 2.666e+02 3.386e+02 4.253e+02 7.478e+02, threshold=6.772e+02, percent-clipped=2.0 2023-06-18 21:13:40,673 INFO [train.py:996] (2/4) Epoch 2, batch 6550, loss[loss=0.2557, simple_loss=0.3303, pruned_loss=0.09056, over 21761.00 frames. ], tot_loss[loss=0.2827, simple_loss=0.346, pruned_loss=0.1097, over 4273611.71 frames. ], batch size: 298, lr: 1.83e-02, grad_scale: 16.0 2023-06-18 21:13:42,776 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=222270.0, ans=10.0 2023-06-18 21:13:57,248 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=222270.0, ans=22.5 2023-06-18 21:14:42,374 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.07 vs. limit=15.0 2023-06-18 21:15:44,877 INFO [train.py:996] (2/4) Epoch 2, batch 6600, loss[loss=0.2447, simple_loss=0.2939, pruned_loss=0.09778, over 21150.00 frames. ], tot_loss[loss=0.2805, simple_loss=0.3412, pruned_loss=0.1099, over 4278891.80 frames. ], batch size: 159, lr: 1.83e-02, grad_scale: 16.0 2023-06-18 21:16:24,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=222630.0, ans=0.5 2023-06-18 21:16:45,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=222690.0, ans=0.125 2023-06-18 21:17:23,758 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.26 vs. limit=15.0 2023-06-18 21:17:42,536 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.832e+02 2.521e+02 3.069e+02 3.605e+02 6.340e+02, threshold=6.138e+02, percent-clipped=0.0 2023-06-18 21:17:42,559 INFO [train.py:996] (2/4) Epoch 2, batch 6650, loss[loss=0.2578, simple_loss=0.3082, pruned_loss=0.1037, over 21686.00 frames. ], tot_loss[loss=0.2742, simple_loss=0.3342, pruned_loss=0.1071, over 4272590.40 frames. ], batch size: 299, lr: 1.83e-02, grad_scale: 16.0 2023-06-18 21:18:14,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=222870.0, ans=0.05 2023-06-18 21:18:43,827 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.87 vs. limit=15.0 2023-06-18 21:18:54,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=222990.0, ans=0.0 2023-06-18 21:19:16,949 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=223050.0, ans=0.125 2023-06-18 21:19:53,889 INFO [train.py:996] (2/4) Epoch 2, batch 6700, loss[loss=0.2564, simple_loss=0.3035, pruned_loss=0.1046, over 21771.00 frames. ], tot_loss[loss=0.2733, simple_loss=0.331, pruned_loss=0.1078, over 4262940.52 frames. ], batch size: 317, lr: 1.82e-02, grad_scale: 16.0 2023-06-18 21:20:09,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=223170.0, ans=0.125 2023-06-18 21:21:06,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=223350.0, ans=0.04949747468305833 2023-06-18 21:21:10,844 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.98 vs. limit=22.5 2023-06-18 21:21:32,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=223410.0, ans=0.0 2023-06-18 21:21:59,657 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.058e+02 3.168e+02 3.639e+02 4.397e+02 6.633e+02, threshold=7.278e+02, percent-clipped=4.0 2023-06-18 21:21:59,681 INFO [train.py:996] (2/4) Epoch 2, batch 6750, loss[loss=0.2438, simple_loss=0.2944, pruned_loss=0.09663, over 21442.00 frames. ], tot_loss[loss=0.2722, simple_loss=0.3295, pruned_loss=0.1074, over 4252484.77 frames. ], batch size: 212, lr: 1.82e-02, grad_scale: 16.0 2023-06-18 21:23:06,862 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.79 vs. limit=15.0 2023-06-18 21:23:43,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=223710.0, ans=0.0 2023-06-18 21:23:46,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=223710.0, ans=0.125 2023-06-18 21:24:15,893 INFO [train.py:996] (2/4) Epoch 2, batch 6800, loss[loss=0.2841, simple_loss=0.3339, pruned_loss=0.1172, over 21245.00 frames. ], tot_loss[loss=0.2787, simple_loss=0.3331, pruned_loss=0.1122, over 4256200.17 frames. ], batch size: 176, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 21:24:44,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=223830.0, ans=0.125 2023-06-18 21:24:51,432 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 21:24:56,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=223890.0, ans=0.5 2023-06-18 21:25:14,527 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=223890.0, ans=10.0 2023-06-18 21:25:18,057 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-18 21:25:21,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=223950.0, ans=0.0 2023-06-18 21:26:06,243 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.897e+02 2.622e+02 3.115e+02 3.857e+02 6.286e+02, threshold=6.230e+02, percent-clipped=0.0 2023-06-18 21:26:06,266 INFO [train.py:996] (2/4) Epoch 2, batch 6850, loss[loss=0.2809, simple_loss=0.3209, pruned_loss=0.1205, over 21508.00 frames. ], tot_loss[loss=0.2794, simple_loss=0.3317, pruned_loss=0.1135, over 4266084.83 frames. ], batch size: 548, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 21:28:11,364 INFO [train.py:996] (2/4) Epoch 2, batch 6900, loss[loss=0.2151, simple_loss=0.2894, pruned_loss=0.07045, over 21298.00 frames. ], tot_loss[loss=0.2788, simple_loss=0.3316, pruned_loss=0.113, over 4276116.83 frames. ], batch size: 176, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 21:29:28,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=224490.0, ans=0.0 2023-06-18 21:30:21,247 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.93 vs. limit=8.0 2023-06-18 21:30:45,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=224610.0, ans=0.2 2023-06-18 21:30:47,617 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.868e+02 2.531e+02 2.951e+02 3.498e+02 5.383e+02, threshold=5.901e+02, percent-clipped=0.0 2023-06-18 21:30:47,639 INFO [train.py:996] (2/4) Epoch 2, batch 6950, loss[loss=0.2109, simple_loss=0.3041, pruned_loss=0.05887, over 21639.00 frames. ], tot_loss[loss=0.2749, simple_loss=0.3321, pruned_loss=0.1089, over 4265077.49 frames. ], batch size: 263, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 21:32:49,129 INFO [train.py:996] (2/4) Epoch 2, batch 7000, loss[loss=0.2668, simple_loss=0.3163, pruned_loss=0.1086, over 21752.00 frames. ], tot_loss[loss=0.282, simple_loss=0.3385, pruned_loss=0.1128, over 4268015.30 frames. ], batch size: 112, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 21:32:49,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=224970.0, ans=0.0 2023-06-18 21:32:57,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=224970.0, ans=0.125 2023-06-18 21:33:47,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=225090.0, ans=0.125 2023-06-18 21:35:00,447 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.917e+02 2.871e+02 3.500e+02 4.345e+02 7.832e+02, threshold=7.000e+02, percent-clipped=4.0 2023-06-18 21:35:00,471 INFO [train.py:996] (2/4) Epoch 2, batch 7050, loss[loss=0.2427, simple_loss=0.3424, pruned_loss=0.07157, over 21225.00 frames. ], tot_loss[loss=0.2782, simple_loss=0.3365, pruned_loss=0.1099, over 4267324.70 frames. ], batch size: 548, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 21:36:04,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=225390.0, ans=0.125 2023-06-18 21:36:32,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=225450.0, ans=0.0 2023-06-18 21:37:14,080 INFO [train.py:996] (2/4) Epoch 2, batch 7100, loss[loss=0.2666, simple_loss=0.332, pruned_loss=0.1005, over 21254.00 frames. ], tot_loss[loss=0.2814, simple_loss=0.341, pruned_loss=0.1108, over 4267420.46 frames. ], batch size: 159, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 21:37:36,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=225630.0, ans=0.0 2023-06-18 21:37:37,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=225630.0, ans=0.125 2023-06-18 21:38:14,956 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=225690.0, ans=0.035 2023-06-18 21:38:58,374 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=225750.0, ans=0.0 2023-06-18 21:39:17,176 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.822e+02 2.803e+02 3.580e+02 4.596e+02 1.041e+03, threshold=7.161e+02, percent-clipped=7.0 2023-06-18 21:39:17,212 INFO [train.py:996] (2/4) Epoch 2, batch 7150, loss[loss=0.3152, simple_loss=0.3714, pruned_loss=0.1295, over 21591.00 frames. ], tot_loss[loss=0.2775, simple_loss=0.339, pruned_loss=0.108, over 4261489.49 frames. ], batch size: 389, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 21:39:30,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=225870.0, ans=15.0 2023-06-18 21:39:53,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=225930.0, ans=0.05 2023-06-18 21:40:25,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=225990.0, ans=0.0 2023-06-18 21:41:22,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=226170.0, ans=0.125 2023-06-18 21:41:23,738 INFO [train.py:996] (2/4) Epoch 2, batch 7200, loss[loss=0.2976, simple_loss=0.3363, pruned_loss=0.1294, over 21806.00 frames. ], tot_loss[loss=0.2831, simple_loss=0.3423, pruned_loss=0.1119, over 4263796.12 frames. ], batch size: 352, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 21:41:40,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=226170.0, ans=0.125 2023-06-18 21:41:56,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=226170.0, ans=0.125 2023-06-18 21:42:51,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=226350.0, ans=0.125 2023-06-18 21:43:04,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=226410.0, ans=0.1 2023-06-18 21:43:24,378 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.21 vs. limit=15.0 2023-06-18 21:43:32,287 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.330e+02 3.074e+02 3.664e+02 4.641e+02 7.244e+02, threshold=7.329e+02, percent-clipped=2.0 2023-06-18 21:43:32,310 INFO [train.py:996] (2/4) Epoch 2, batch 7250, loss[loss=0.3105, simple_loss=0.3269, pruned_loss=0.1471, over 21426.00 frames. ], tot_loss[loss=0.2803, simple_loss=0.3363, pruned_loss=0.1122, over 4265527.44 frames. ], batch size: 509, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 21:43:48,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=226470.0, ans=0.07 2023-06-18 21:44:57,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=226650.0, ans=0.0 2023-06-18 21:45:00,921 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=226650.0, ans=0.125 2023-06-18 21:45:22,353 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 21:45:26,298 INFO [train.py:996] (2/4) Epoch 2, batch 7300, loss[loss=0.2569, simple_loss=0.3017, pruned_loss=0.1061, over 21685.00 frames. ], tot_loss[loss=0.2758, simple_loss=0.3301, pruned_loss=0.1107, over 4260741.66 frames. ], batch size: 417, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 21:45:35,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=226770.0, ans=0.1 2023-06-18 21:45:40,641 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.33 vs. limit=12.0 2023-06-18 21:46:06,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=226830.0, ans=0.125 2023-06-18 21:46:26,514 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=226890.0, ans=0.125 2023-06-18 21:47:36,821 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.031e+02 2.896e+02 3.515e+02 4.271e+02 6.710e+02, threshold=7.030e+02, percent-clipped=0.0 2023-06-18 21:47:36,845 INFO [train.py:996] (2/4) Epoch 2, batch 7350, loss[loss=0.2832, simple_loss=0.3196, pruned_loss=0.1234, over 21200.00 frames. ], tot_loss[loss=0.2737, simple_loss=0.3268, pruned_loss=0.1103, over 4259004.55 frames. ], batch size: 608, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 21:47:43,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=227070.0, ans=0.0 2023-06-18 21:49:49,002 INFO [train.py:996] (2/4) Epoch 2, batch 7400, loss[loss=0.24, simple_loss=0.2926, pruned_loss=0.0937, over 21797.00 frames. ], tot_loss[loss=0.2818, simple_loss=0.3358, pruned_loss=0.1138, over 4261525.28 frames. ], batch size: 107, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 21:50:57,744 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.18 vs. limit=15.0 2023-06-18 21:51:01,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=227490.0, ans=0.0 2023-06-18 21:51:04,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=227490.0, ans=0.125 2023-06-18 21:51:16,174 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=227550.0, ans=0.125 2023-06-18 21:51:19,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=227550.0, ans=0.0 2023-06-18 21:51:40,690 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.56 vs. limit=15.0 2023-06-18 21:52:10,485 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.917e+02 2.891e+02 3.433e+02 4.412e+02 7.257e+02, threshold=6.867e+02, percent-clipped=1.0 2023-06-18 21:52:10,513 INFO [train.py:996] (2/4) Epoch 2, batch 7450, loss[loss=0.2767, simple_loss=0.3138, pruned_loss=0.1198, over 21246.00 frames. ], tot_loss[loss=0.2797, simple_loss=0.3356, pruned_loss=0.1119, over 4253607.67 frames. ], batch size: 144, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 21:52:17,947 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.66 vs. limit=15.0 2023-06-18 21:54:12,223 INFO [train.py:996] (2/4) Epoch 2, batch 7500, loss[loss=0.2812, simple_loss=0.3588, pruned_loss=0.1018, over 21346.00 frames. ], tot_loss[loss=0.2842, simple_loss=0.3409, pruned_loss=0.1137, over 4256617.59 frames. ], batch size: 211, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 21:54:39,962 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.08 vs. limit=15.0 2023-06-18 21:55:17,240 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=228090.0, ans=0.125 2023-06-18 21:56:10,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=228210.0, ans=0.125 2023-06-18 21:56:22,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=228210.0, ans=0.2 2023-06-18 21:56:33,947 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.135e+02 3.064e+02 3.636e+02 4.254e+02 7.923e+02, threshold=7.272e+02, percent-clipped=2.0 2023-06-18 21:56:33,971 INFO [train.py:996] (2/4) Epoch 2, batch 7550, loss[loss=0.2434, simple_loss=0.3322, pruned_loss=0.07725, over 21816.00 frames. ], tot_loss[loss=0.2835, simple_loss=0.3454, pruned_loss=0.1109, over 4254931.00 frames. ], batch size: 282, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 21:57:29,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=228390.0, ans=0.125 2023-06-18 21:58:35,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=228510.0, ans=0.1 2023-06-18 21:58:51,190 INFO [train.py:996] (2/4) Epoch 2, batch 7600, loss[loss=0.2089, simple_loss=0.2855, pruned_loss=0.06616, over 21206.00 frames. ], tot_loss[loss=0.2799, simple_loss=0.3428, pruned_loss=0.1085, over 4259271.63 frames. ], batch size: 176, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 22:00:42,174 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=228750.0, ans=0.125 2023-06-18 22:00:43,952 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=228750.0, ans=0.04949747468305833 2023-06-18 22:01:06,612 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.075e+02 3.201e+02 4.085e+02 5.282e+02 1.069e+03, threshold=8.169e+02, percent-clipped=6.0 2023-06-18 22:01:06,636 INFO [train.py:996] (2/4) Epoch 2, batch 7650, loss[loss=0.3086, simple_loss=0.3522, pruned_loss=0.1324, over 21883.00 frames. ], tot_loss[loss=0.2829, simple_loss=0.3432, pruned_loss=0.1113, over 4272146.82 frames. ], batch size: 332, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 22:01:09,325 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.96 vs. limit=15.0 2023-06-18 22:01:30,312 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=228930.0, ans=0.1 2023-06-18 22:01:31,746 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=228930.0, ans=0.0 2023-06-18 22:01:56,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=228990.0, ans=0.2 2023-06-18 22:02:42,109 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=229050.0, ans=0.125 2023-06-18 22:03:09,043 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=229110.0, ans=0.125 2023-06-18 22:03:09,062 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=229110.0, ans=0.125 2023-06-18 22:03:09,066 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=229110.0, ans=0.125 2023-06-18 22:03:23,246 INFO [train.py:996] (2/4) Epoch 2, batch 7700, loss[loss=0.3141, simple_loss=0.3657, pruned_loss=0.1312, over 21368.00 frames. ], tot_loss[loss=0.2897, simple_loss=0.3475, pruned_loss=0.1159, over 4282296.98 frames. ], batch size: 159, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 22:03:32,194 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.39 vs. limit=22.5 2023-06-18 22:03:36,128 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=229170.0, ans=0.125 2023-06-18 22:03:36,652 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.15 vs. limit=6.0 2023-06-18 22:04:09,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=229290.0, ans=0.125 2023-06-18 22:04:43,216 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.21 vs. limit=12.0 2023-06-18 22:05:04,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=229350.0, ans=0.025 2023-06-18 22:05:39,895 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.239e+02 2.910e+02 3.418e+02 4.204e+02 6.924e+02, threshold=6.836e+02, percent-clipped=0.0 2023-06-18 22:05:39,918 INFO [train.py:996] (2/4) Epoch 2, batch 7750, loss[loss=0.369, simple_loss=0.4628, pruned_loss=0.1377, over 21205.00 frames. ], tot_loss[loss=0.2937, simple_loss=0.3528, pruned_loss=0.1173, over 4279291.82 frames. ], batch size: 549, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 22:06:12,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=229470.0, ans=0.95 2023-06-18 22:06:29,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=229530.0, ans=0.0 2023-06-18 22:06:29,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=229530.0, ans=0.2 2023-06-18 22:06:50,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=229590.0, ans=0.125 2023-06-18 22:07:02,579 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=229650.0, ans=0.125 2023-06-18 22:07:03,169 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.39 vs. limit=15.0 2023-06-18 22:07:18,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=229710.0, ans=0.125 2023-06-18 22:07:28,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=229710.0, ans=0.125 2023-06-18 22:07:38,205 INFO [train.py:996] (2/4) Epoch 2, batch 7800, loss[loss=0.3314, simple_loss=0.3916, pruned_loss=0.1356, over 21579.00 frames. ], tot_loss[loss=0.2963, simple_loss=0.3555, pruned_loss=0.1185, over 4272558.86 frames. ], batch size: 441, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 22:08:07,134 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.39 vs. limit=6.0 2023-06-18 22:08:47,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=229890.0, ans=0.1 2023-06-18 22:09:46,309 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.229e+02 2.917e+02 3.531e+02 4.288e+02 7.064e+02, threshold=7.063e+02, percent-clipped=1.0 2023-06-18 22:09:46,333 INFO [train.py:996] (2/4) Epoch 2, batch 7850, loss[loss=0.2734, simple_loss=0.3427, pruned_loss=0.102, over 21794.00 frames. ], tot_loss[loss=0.2908, simple_loss=0.3477, pruned_loss=0.117, over 4278470.98 frames. ], batch size: 352, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 22:09:46,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=230070.0, ans=0.125 2023-06-18 22:09:55,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=230070.0, ans=0.0 2023-06-18 22:10:02,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=230130.0, ans=0.035 2023-06-18 22:10:38,848 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=230130.0, ans=0.0 2023-06-18 22:10:41,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=230190.0, ans=0.035 2023-06-18 22:10:42,419 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.25 vs. limit=22.5 2023-06-18 22:11:01,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=230250.0, ans=22.5 2023-06-18 22:11:53,027 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=230310.0, ans=0.125 2023-06-18 22:12:06,689 INFO [train.py:996] (2/4) Epoch 2, batch 7900, loss[loss=0.2446, simple_loss=0.2906, pruned_loss=0.09928, over 21878.00 frames. ], tot_loss[loss=0.2863, simple_loss=0.3423, pruned_loss=0.1151, over 4277511.99 frames. ], batch size: 98, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 22:12:38,913 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=15.0 2023-06-18 22:13:26,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=230490.0, ans=0.2 2023-06-18 22:14:33,511 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.009e+02 2.975e+02 3.296e+02 3.925e+02 7.503e+02, threshold=6.593e+02, percent-clipped=2.0 2023-06-18 22:14:33,535 INFO [train.py:996] (2/4) Epoch 2, batch 7950, loss[loss=0.2587, simple_loss=0.3306, pruned_loss=0.09343, over 21431.00 frames. ], tot_loss[loss=0.2888, simple_loss=0.3479, pruned_loss=0.1149, over 4278513.07 frames. ], batch size: 176, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 22:14:46,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=230670.0, ans=0.125 2023-06-18 22:15:33,461 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.76 vs. limit=22.5 2023-06-18 22:16:10,493 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.27 vs. limit=6.0 2023-06-18 22:16:12,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=230850.0, ans=0.2 2023-06-18 22:16:56,562 INFO [train.py:996] (2/4) Epoch 2, batch 8000, loss[loss=0.3485, simple_loss=0.4012, pruned_loss=0.1479, over 21776.00 frames. ], tot_loss[loss=0.2951, simple_loss=0.3543, pruned_loss=0.1179, over 4266647.90 frames. ], batch size: 441, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 22:17:12,594 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.48 vs. limit=10.0 2023-06-18 22:17:19,415 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=22.5 2023-06-18 22:19:34,456 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.058e+02 3.162e+02 3.821e+02 4.807e+02 7.380e+02, threshold=7.642e+02, percent-clipped=7.0 2023-06-18 22:19:34,482 INFO [train.py:996] (2/4) Epoch 2, batch 8050, loss[loss=0.4247, simple_loss=0.4684, pruned_loss=0.1906, over 21461.00 frames. ], tot_loss[loss=0.295, simple_loss=0.3562, pruned_loss=0.1169, over 4264586.94 frames. ], batch size: 507, lr: 1.79e-02, grad_scale: 32.0 2023-06-18 22:19:36,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=231270.0, ans=0.125 2023-06-18 22:19:38,620 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.41 vs. limit=8.0 2023-06-18 22:20:32,433 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=231330.0, ans=0.1 2023-06-18 22:20:42,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=231390.0, ans=0.125 2023-06-18 22:21:06,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=231450.0, ans=0.125 2023-06-18 22:21:21,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=231450.0, ans=0.125 2023-06-18 22:21:49,800 INFO [train.py:996] (2/4) Epoch 2, batch 8100, loss[loss=0.2889, simple_loss=0.3493, pruned_loss=0.1143, over 21793.00 frames. ], tot_loss[loss=0.2954, simple_loss=0.3549, pruned_loss=0.118, over 4270372.46 frames. ], batch size: 124, lr: 1.79e-02, grad_scale: 32.0 2023-06-18 22:24:08,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=231810.0, ans=0.1 2023-06-18 22:24:25,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=231810.0, ans=0.1 2023-06-18 22:24:25,614 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.92 vs. limit=10.0 2023-06-18 22:24:27,847 INFO [train.py:996] (2/4) Epoch 2, batch 8150, loss[loss=0.1635, simple_loss=0.2179, pruned_loss=0.05457, over 16694.00 frames. ], tot_loss[loss=0.2995, simple_loss=0.3599, pruned_loss=0.1195, over 4255304.78 frames. ], batch size: 60, lr: 1.79e-02, grad_scale: 16.0 2023-06-18 22:24:33,134 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=231870.0, ans=0.04949747468305833 2023-06-18 22:24:34,095 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.136e+02 2.988e+02 3.821e+02 5.220e+02 8.604e+02, threshold=7.643e+02, percent-clipped=3.0 2023-06-18 22:24:53,306 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.65 vs. limit=15.0 2023-06-18 22:25:42,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=231990.0, ans=0.125 2023-06-18 22:25:57,653 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.87 vs. limit=15.0 2023-06-18 22:26:23,457 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=232110.0, ans=0.5 2023-06-18 22:26:27,246 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.55 vs. limit=15.0 2023-06-18 22:26:32,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=232110.0, ans=0.1 2023-06-18 22:26:32,825 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.63 vs. limit=6.0 2023-06-18 22:26:36,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=232110.0, ans=0.0 2023-06-18 22:26:44,094 INFO [train.py:996] (2/4) Epoch 2, batch 8200, loss[loss=0.2837, simple_loss=0.3256, pruned_loss=0.1209, over 21612.00 frames. ], tot_loss[loss=0.2899, simple_loss=0.3501, pruned_loss=0.1149, over 4258076.63 frames. ], batch size: 415, lr: 1.79e-02, grad_scale: 16.0 2023-06-18 22:27:21,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=232230.0, ans=0.125 2023-06-18 22:27:21,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=232230.0, ans=0.2 2023-06-18 22:27:53,944 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.65 vs. limit=15.0 2023-06-18 22:28:14,791 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.03 vs. limit=15.0 2023-06-18 22:28:15,637 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=232350.0, ans=0.125 2023-06-18 22:28:33,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=232410.0, ans=0.125 2023-06-18 22:28:40,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=232410.0, ans=0.125 2023-06-18 22:28:43,248 INFO [train.py:996] (2/4) Epoch 2, batch 8250, loss[loss=0.2812, simple_loss=0.3638, pruned_loss=0.09934, over 21699.00 frames. ], tot_loss[loss=0.2893, simple_loss=0.3503, pruned_loss=0.1142, over 4255578.02 frames. ], batch size: 247, lr: 1.79e-02, grad_scale: 16.0 2023-06-18 22:28:44,763 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.241e+02 3.255e+02 3.894e+02 4.730e+02 9.572e+02, threshold=7.788e+02, percent-clipped=3.0 2023-06-18 22:28:59,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=232470.0, ans=0.1 2023-06-18 22:30:02,142 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=232590.0, ans=0.125 2023-06-18 22:30:55,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=232710.0, ans=0.125 2023-06-18 22:31:07,390 INFO [train.py:996] (2/4) Epoch 2, batch 8300, loss[loss=0.3112, simple_loss=0.3855, pruned_loss=0.1185, over 21201.00 frames. ], tot_loss[loss=0.2846, simple_loss=0.347, pruned_loss=0.1111, over 4257357.74 frames. ], batch size: 548, lr: 1.79e-02, grad_scale: 16.0 2023-06-18 22:31:23,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=232770.0, ans=0.125 2023-06-18 22:31:42,153 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 22:31:58,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=232830.0, ans=0.09899494936611666 2023-06-18 22:32:29,813 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.29 vs. limit=22.5 2023-06-18 22:32:32,309 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=232950.0, ans=0.0 2023-06-18 22:32:35,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=232950.0, ans=0.125 2023-06-18 22:33:23,339 INFO [train.py:996] (2/4) Epoch 2, batch 8350, loss[loss=0.2554, simple_loss=0.3221, pruned_loss=0.09432, over 21544.00 frames. ], tot_loss[loss=0.2812, simple_loss=0.3456, pruned_loss=0.1084, over 4253515.39 frames. ], batch size: 230, lr: 1.79e-02, grad_scale: 16.0 2023-06-18 22:33:30,640 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.845e+02 2.718e+02 3.193e+02 3.748e+02 6.520e+02, threshold=6.386e+02, percent-clipped=0.0 2023-06-18 22:33:36,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=233070.0, ans=0.0 2023-06-18 22:34:10,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=233130.0, ans=10.0 2023-06-18 22:35:52,759 INFO [train.py:996] (2/4) Epoch 2, batch 8400, loss[loss=0.2154, simple_loss=0.3, pruned_loss=0.06536, over 21375.00 frames. ], tot_loss[loss=0.2778, simple_loss=0.3446, pruned_loss=0.1055, over 4257194.83 frames. ], batch size: 194, lr: 1.79e-02, grad_scale: 32.0 2023-06-18 22:38:00,540 INFO [train.py:996] (2/4) Epoch 2, batch 8450, loss[loss=0.325, simple_loss=0.3727, pruned_loss=0.1386, over 20771.00 frames. ], tot_loss[loss=0.2767, simple_loss=0.3427, pruned_loss=0.1053, over 4271192.18 frames. ], batch size: 608, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 22:38:02,007 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.885e+02 2.629e+02 3.179e+02 4.022e+02 7.095e+02, threshold=6.359e+02, percent-clipped=3.0 2023-06-18 22:38:03,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=233670.0, ans=0.0 2023-06-18 22:39:18,948 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=233850.0, ans=0.1 2023-06-18 22:39:19,469 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.97 vs. limit=22.5 2023-06-18 22:39:53,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=233910.0, ans=0.2 2023-06-18 22:39:55,618 INFO [train.py:996] (2/4) Epoch 2, batch 8500, loss[loss=0.262, simple_loss=0.3093, pruned_loss=0.1073, over 21767.00 frames. ], tot_loss[loss=0.2777, simple_loss=0.339, pruned_loss=0.1082, over 4267493.34 frames. ], batch size: 351, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 22:40:18,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=233970.0, ans=0.125 2023-06-18 22:40:54,866 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=234030.0, ans=0.1 2023-06-18 22:40:58,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=234090.0, ans=0.125 2023-06-18 22:41:06,088 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.72 vs. limit=15.0 2023-06-18 22:41:21,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=234150.0, ans=0.125 2023-06-18 22:41:23,432 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.00 vs. limit=15.0 2023-06-18 22:41:24,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=234150.0, ans=0.2 2023-06-18 22:42:11,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=234210.0, ans=0.0 2023-06-18 22:42:28,184 INFO [train.py:996] (2/4) Epoch 2, batch 8550, loss[loss=0.3339, simple_loss=0.4087, pruned_loss=0.1295, over 21684.00 frames. ], tot_loss[loss=0.2857, simple_loss=0.3455, pruned_loss=0.1129, over 4269097.03 frames. ], batch size: 414, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 22:42:29,585 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.276e+02 3.313e+02 3.994e+02 5.022e+02 7.456e+02, threshold=7.988e+02, percent-clipped=4.0 2023-06-18 22:43:37,558 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.80 vs. limit=15.0 2023-06-18 22:43:41,759 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=234450.0, ans=0.035 2023-06-18 22:44:28,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=234510.0, ans=0.125 2023-06-18 22:44:41,773 INFO [train.py:996] (2/4) Epoch 2, batch 8600, loss[loss=0.2879, simple_loss=0.3331, pruned_loss=0.1213, over 21083.00 frames. ], tot_loss[loss=0.2933, simple_loss=0.353, pruned_loss=0.1168, over 4272135.75 frames. ], batch size: 607, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 22:45:09,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=234570.0, ans=0.125 2023-06-18 22:45:19,391 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.54 vs. limit=15.0 2023-06-18 22:45:34,368 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=234690.0, ans=0.125 2023-06-18 22:46:06,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=234750.0, ans=0.125 2023-06-18 22:46:06,240 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=234750.0, ans=0.125 2023-06-18 22:46:31,749 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=234750.0, ans=0.125 2023-06-18 22:47:05,575 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=234810.0, ans=0.125 2023-06-18 22:47:07,837 INFO [train.py:996] (2/4) Epoch 2, batch 8650, loss[loss=0.2447, simple_loss=0.3144, pruned_loss=0.0875, over 21796.00 frames. ], tot_loss[loss=0.2956, simple_loss=0.3585, pruned_loss=0.1163, over 4271210.50 frames. ], batch size: 124, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 22:47:14,948 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.021e+02 3.220e+02 3.691e+02 4.621e+02 7.023e+02, threshold=7.382e+02, percent-clipped=0.0 2023-06-18 22:47:21,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=234870.0, ans=0.0 2023-06-18 22:47:50,112 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=234990.0, ans=0.0 2023-06-18 22:48:36,711 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=235110.0, ans=0.1 2023-06-18 22:48:49,042 INFO [train.py:996] (2/4) Epoch 2, batch 8700, loss[loss=0.2447, simple_loss=0.2972, pruned_loss=0.09606, over 21461.00 frames. ], tot_loss[loss=0.2883, simple_loss=0.3506, pruned_loss=0.113, over 4272787.87 frames. ], batch size: 131, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 22:49:41,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=235230.0, ans=0.04949747468305833 2023-06-18 22:51:10,603 INFO [train.py:996] (2/4) Epoch 2, batch 8750, loss[loss=0.2994, simple_loss=0.3428, pruned_loss=0.128, over 21242.00 frames. ], tot_loss[loss=0.2882, simple_loss=0.349, pruned_loss=0.1137, over 4266210.79 frames. ], batch size: 159, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 22:51:12,168 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.932e+02 2.975e+02 3.407e+02 4.208e+02 7.792e+02, threshold=6.814e+02, percent-clipped=1.0 2023-06-18 22:51:30,281 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.19 vs. limit=15.0 2023-06-18 22:52:47,669 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=235650.0, ans=0.0 2023-06-18 22:53:34,727 INFO [train.py:996] (2/4) Epoch 2, batch 8800, loss[loss=0.3119, simple_loss=0.4024, pruned_loss=0.1107, over 19855.00 frames. ], tot_loss[loss=0.2949, simple_loss=0.356, pruned_loss=0.1169, over 4270407.54 frames. ], batch size: 702, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 22:54:28,165 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=235890.0, ans=0.0 2023-06-18 22:55:52,235 INFO [train.py:996] (2/4) Epoch 2, batch 8850, loss[loss=0.2983, simple_loss=0.3685, pruned_loss=0.1141, over 21850.00 frames. ], tot_loss[loss=0.301, simple_loss=0.3637, pruned_loss=0.1192, over 4271216.40 frames. ], batch size: 107, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 22:55:53,508 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.391e+02 3.077e+02 3.659e+02 4.478e+02 7.244e+02, threshold=7.318e+02, percent-clipped=2.0 2023-06-18 22:55:58,525 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=236070.0, ans=0.125 2023-06-18 22:56:01,804 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.96 vs. limit=10.0 2023-06-18 22:56:26,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=236130.0, ans=0.125 2023-06-18 22:57:40,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=236310.0, ans=0.0 2023-06-18 22:57:57,912 INFO [train.py:996] (2/4) Epoch 2, batch 8900, loss[loss=0.2875, simple_loss=0.3584, pruned_loss=0.1083, over 21247.00 frames. ], tot_loss[loss=0.2959, simple_loss=0.3569, pruned_loss=0.1174, over 4272710.25 frames. ], batch size: 548, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 22:58:46,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=236490.0, ans=0.2 2023-06-18 22:59:59,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=236610.0, ans=0.0 2023-06-18 23:00:09,264 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=236610.0, ans=0.05 2023-06-18 23:00:24,382 INFO [train.py:996] (2/4) Epoch 2, batch 8950, loss[loss=0.3139, simple_loss=0.4121, pruned_loss=0.1079, over 21214.00 frames. ], tot_loss[loss=0.293, simple_loss=0.3556, pruned_loss=0.1152, over 4273213.24 frames. ], batch size: 549, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 23:00:31,489 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.947e+02 2.983e+02 3.644e+02 4.576e+02 8.464e+02, threshold=7.288e+02, percent-clipped=6.0 2023-06-18 23:00:58,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=236730.0, ans=0.125 2023-06-18 23:01:51,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=236910.0, ans=0.07 2023-06-18 23:01:53,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=236910.0, ans=0.125 2023-06-18 23:02:28,410 INFO [train.py:996] (2/4) Epoch 2, batch 9000, loss[loss=0.4004, simple_loss=0.4965, pruned_loss=0.1522, over 19801.00 frames. ], tot_loss[loss=0.2889, simple_loss=0.3489, pruned_loss=0.1144, over 4276446.86 frames. ], batch size: 702, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 23:02:28,410 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-18 23:03:37,456 INFO [train.py:1028] (2/4) Epoch 2, validation: loss=0.2827, simple_loss=0.3814, pruned_loss=0.09199, over 1796401.00 frames. 2023-06-18 23:03:37,457 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-18 23:03:37,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=236970.0, ans=0.0 2023-06-18 23:04:09,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=237030.0, ans=0.0 2023-06-18 23:04:27,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=237090.0, ans=0.125 2023-06-18 23:05:10,293 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=237210.0, ans=0.125 2023-06-18 23:05:34,972 INFO [train.py:996] (2/4) Epoch 2, batch 9050, loss[loss=0.2366, simple_loss=0.2768, pruned_loss=0.09815, over 20776.00 frames. ], tot_loss[loss=0.2828, simple_loss=0.3436, pruned_loss=0.111, over 4269862.43 frames. ], batch size: 608, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 23:05:36,490 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.063e+02 2.756e+02 3.216e+02 4.039e+02 6.653e+02, threshold=6.431e+02, percent-clipped=0.0 2023-06-18 23:06:01,032 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.70 vs. limit=22.5 2023-06-18 23:06:25,475 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.84 vs. limit=22.5 2023-06-18 23:06:47,792 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.82 vs. limit=15.0 2023-06-18 23:06:52,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=237390.0, ans=0.0 2023-06-18 23:07:03,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=237450.0, ans=0.125 2023-06-18 23:07:28,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=237510.0, ans=0.125 2023-06-18 23:07:47,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=237570.0, ans=0.0 2023-06-18 23:07:48,317 INFO [train.py:996] (2/4) Epoch 2, batch 9100, loss[loss=0.3167, simple_loss=0.3802, pruned_loss=0.1266, over 19869.00 frames. ], tot_loss[loss=0.2915, simple_loss=0.352, pruned_loss=0.1155, over 4272310.83 frames. ], batch size: 703, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 23:08:46,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=237690.0, ans=0.125 2023-06-18 23:08:48,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=237690.0, ans=0.0 2023-06-18 23:08:57,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=237690.0, ans=0.125 2023-06-18 23:09:02,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=237750.0, ans=0.0 2023-06-18 23:09:58,746 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=237810.0, ans=0.125 2023-06-18 23:10:08,454 INFO [train.py:996] (2/4) Epoch 2, batch 9150, loss[loss=0.2809, simple_loss=0.3622, pruned_loss=0.09976, over 21807.00 frames. ], tot_loss[loss=0.2901, simple_loss=0.3561, pruned_loss=0.112, over 4274056.77 frames. ], batch size: 282, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 23:10:18,842 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.985e+02 2.926e+02 3.668e+02 4.524e+02 1.019e+03, threshold=7.337e+02, percent-clipped=2.0 2023-06-18 23:10:59,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=237930.0, ans=0.125 2023-06-18 23:11:09,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=237990.0, ans=0.07 2023-06-18 23:11:35,232 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.82 vs. limit=15.0 2023-06-18 23:11:51,055 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=238050.0, ans=0.0 2023-06-18 23:12:05,709 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=238050.0, ans=0.2 2023-06-18 23:12:29,969 INFO [train.py:996] (2/4) Epoch 2, batch 9200, loss[loss=0.3092, simple_loss=0.3796, pruned_loss=0.1194, over 21741.00 frames. ], tot_loss[loss=0.2896, simple_loss=0.3585, pruned_loss=0.1103, over 4271349.50 frames. ], batch size: 351, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 23:12:32,340 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.74 vs. limit=12.0 2023-06-18 23:13:01,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=238230.0, ans=0.125 2023-06-18 23:14:40,579 INFO [train.py:996] (2/4) Epoch 2, batch 9250, loss[loss=0.2824, simple_loss=0.3444, pruned_loss=0.1102, over 21502.00 frames. ], tot_loss[loss=0.2933, simple_loss=0.3592, pruned_loss=0.1137, over 4263203.73 frames. ], batch size: 131, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 23:14:41,995 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.823e+02 2.826e+02 3.250e+02 3.675e+02 5.270e+02, threshold=6.500e+02, percent-clipped=0.0 2023-06-18 23:14:45,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=238470.0, ans=0.0 2023-06-18 23:14:55,349 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=238470.0, ans=0.0 2023-06-18 23:15:07,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=238530.0, ans=0.1 2023-06-18 23:15:41,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=238590.0, ans=0.125 2023-06-18 23:15:58,295 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.85 vs. limit=6.0 2023-06-18 23:17:00,735 INFO [train.py:996] (2/4) Epoch 2, batch 9300, loss[loss=0.3083, simple_loss=0.355, pruned_loss=0.1308, over 21812.00 frames. ], tot_loss[loss=0.2914, simple_loss=0.3542, pruned_loss=0.1143, over 4267576.19 frames. ], batch size: 372, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 23:17:07,241 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=238770.0, ans=0.1 2023-06-18 23:17:42,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=238830.0, ans=0.125 2023-06-18 23:17:48,593 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 23:17:49,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=238890.0, ans=0.025 2023-06-18 23:18:56,609 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=239010.0, ans=0.125 2023-06-18 23:19:09,906 INFO [train.py:996] (2/4) Epoch 2, batch 9350, loss[loss=0.3151, simple_loss=0.3788, pruned_loss=0.1257, over 21873.00 frames. ], tot_loss[loss=0.2964, simple_loss=0.3622, pruned_loss=0.1153, over 4268346.44 frames. ], batch size: 118, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 23:19:11,410 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 3.045e+02 3.493e+02 4.279e+02 7.856e+02, threshold=6.986e+02, percent-clipped=1.0 2023-06-18 23:20:00,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=239130.0, ans=0.0 2023-06-18 23:20:23,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=239190.0, ans=0.125 2023-06-18 23:20:55,869 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=239250.0, ans=10.0 2023-06-18 23:21:04,721 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=239250.0, ans=0.2 2023-06-18 23:21:29,682 INFO [train.py:996] (2/4) Epoch 2, batch 9400, loss[loss=0.2946, simple_loss=0.3365, pruned_loss=0.1263, over 20110.00 frames. ], tot_loss[loss=0.2976, simple_loss=0.3633, pruned_loss=0.1159, over 4267679.12 frames. ], batch size: 702, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 23:22:10,156 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=239430.0, ans=0.2 2023-06-18 23:22:38,830 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=239490.0, ans=0.125 2023-06-18 23:23:21,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=239610.0, ans=0.0 2023-06-18 23:23:40,462 INFO [train.py:996] (2/4) Epoch 2, batch 9450, loss[loss=0.2369, simple_loss=0.2926, pruned_loss=0.09059, over 21149.00 frames. ], tot_loss[loss=0.2903, simple_loss=0.353, pruned_loss=0.1138, over 4256198.60 frames. ], batch size: 176, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 23:23:41,897 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.969e+02 2.730e+02 3.234e+02 3.774e+02 7.408e+02, threshold=6.469e+02, percent-clipped=1.0 2023-06-18 23:23:44,338 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.26 vs. limit=15.0 2023-06-18 23:23:51,524 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=239670.0, ans=0.125 2023-06-18 23:24:11,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=239730.0, ans=0.125 2023-06-18 23:24:55,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=239850.0, ans=0.1 2023-06-18 23:25:56,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=239910.0, ans=0.0 2023-06-18 23:26:05,437 INFO [train.py:996] (2/4) Epoch 2, batch 9500, loss[loss=0.2769, simple_loss=0.3373, pruned_loss=0.1083, over 21162.00 frames. ], tot_loss[loss=0.2842, simple_loss=0.3451, pruned_loss=0.1116, over 4260326.85 frames. ], batch size: 143, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 23:26:12,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=239970.0, ans=0.125 2023-06-18 23:27:19,307 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 23:27:58,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=240210.0, ans=0.125 2023-06-18 23:28:01,184 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=240210.0, ans=0.125 2023-06-18 23:28:19,626 INFO [train.py:996] (2/4) Epoch 2, batch 9550, loss[loss=0.3408, simple_loss=0.4014, pruned_loss=0.1402, over 21782.00 frames. ], tot_loss[loss=0.2892, simple_loss=0.3504, pruned_loss=0.1139, over 4267576.11 frames. ], batch size: 124, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 23:28:20,639 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.70 vs. limit=15.0 2023-06-18 23:28:21,127 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 2.886e+02 3.595e+02 5.008e+02 8.398e+02, threshold=7.190e+02, percent-clipped=11.0 2023-06-18 23:28:37,875 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=240270.0, ans=0.125 2023-06-18 23:29:33,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=240390.0, ans=0.0 2023-06-18 23:30:10,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=240510.0, ans=0.125 2023-06-18 23:30:15,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=240510.0, ans=0.125 2023-06-18 23:30:36,135 INFO [train.py:996] (2/4) Epoch 2, batch 9600, loss[loss=0.2574, simple_loss=0.3209, pruned_loss=0.09698, over 21658.00 frames. ], tot_loss[loss=0.2924, simple_loss=0.353, pruned_loss=0.1158, over 4265403.22 frames. ], batch size: 230, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 23:30:38,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=240570.0, ans=0.125 2023-06-18 23:30:39,503 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=240570.0, ans=0.125 2023-06-18 23:32:13,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=240810.0, ans=0.2 2023-06-18 23:32:16,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=240810.0, ans=0.0 2023-06-18 23:32:43,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=240810.0, ans=0.0 2023-06-18 23:32:45,695 INFO [train.py:996] (2/4) Epoch 2, batch 9650, loss[loss=0.2815, simple_loss=0.3466, pruned_loss=0.1082, over 21941.00 frames. ], tot_loss[loss=0.2928, simple_loss=0.3531, pruned_loss=0.1163, over 4268433.27 frames. ], batch size: 316, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 23:33:00,241 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.899e+02 3.277e+02 4.400e+02 6.787e+02, threshold=6.553e+02, percent-clipped=0.0 2023-06-18 23:33:06,361 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.70 vs. limit=15.0 2023-06-18 23:33:50,069 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.91 vs. limit=15.0 2023-06-18 23:35:17,955 INFO [train.py:996] (2/4) Epoch 2, batch 9700, loss[loss=0.3203, simple_loss=0.4146, pruned_loss=0.113, over 20851.00 frames. ], tot_loss[loss=0.2959, simple_loss=0.3575, pruned_loss=0.1172, over 4275698.04 frames. ], batch size: 608, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 23:35:31,808 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=241170.0, ans=0.07 2023-06-18 23:36:11,300 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.65 vs. limit=15.0 2023-06-18 23:36:46,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=241350.0, ans=0.2 2023-06-18 23:37:09,757 INFO [train.py:996] (2/4) Epoch 2, batch 9750, loss[loss=0.2994, simple_loss=0.3253, pruned_loss=0.1367, over 21333.00 frames. ], tot_loss[loss=0.2899, simple_loss=0.3499, pruned_loss=0.1149, over 4273128.04 frames. ], batch size: 473, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 23:37:11,135 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.952e+02 2.764e+02 3.183e+02 3.672e+02 6.481e+02, threshold=6.367e+02, percent-clipped=0.0 2023-06-18 23:37:59,085 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=241590.0, ans=0.125 2023-06-18 23:38:02,534 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.12 vs. limit=15.0 2023-06-18 23:38:49,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=241710.0, ans=0.0 2023-06-18 23:39:09,306 INFO [train.py:996] (2/4) Epoch 2, batch 9800, loss[loss=0.3073, simple_loss=0.3559, pruned_loss=0.1294, over 21427.00 frames. ], tot_loss[loss=0.2901, simple_loss=0.3496, pruned_loss=0.1153, over 4260600.16 frames. ], batch size: 194, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 23:39:58,724 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 23:40:17,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=241890.0, ans=0.2 2023-06-18 23:40:33,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=241950.0, ans=0.0 2023-06-18 23:40:37,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=242010.0, ans=0.0 2023-06-18 23:40:51,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=242010.0, ans=0.125 2023-06-18 23:41:00,960 INFO [train.py:996] (2/4) Epoch 2, batch 9850, loss[loss=0.2986, simple_loss=0.3407, pruned_loss=0.1282, over 21834.00 frames. ], tot_loss[loss=0.2877, simple_loss=0.3451, pruned_loss=0.1152, over 4264439.60 frames. ], batch size: 414, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 23:41:02,240 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.263e+02 2.994e+02 3.598e+02 4.495e+02 7.134e+02, threshold=7.196e+02, percent-clipped=3.0 2023-06-18 23:41:50,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=242130.0, ans=0.125 2023-06-18 23:42:26,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=242250.0, ans=0.125 2023-06-18 23:43:09,483 INFO [train.py:996] (2/4) Epoch 2, batch 9900, loss[loss=0.2605, simple_loss=0.3021, pruned_loss=0.1094, over 21322.00 frames. ], tot_loss[loss=0.2842, simple_loss=0.3408, pruned_loss=0.1138, over 4260216.65 frames. ], batch size: 144, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 23:43:35,997 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=242370.0, ans=0.125 2023-06-18 23:43:37,758 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.99 vs. limit=22.5 2023-06-18 23:43:56,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=242430.0, ans=0.0 2023-06-18 23:45:24,185 INFO [train.py:996] (2/4) Epoch 2, batch 9950, loss[loss=0.3134, simple_loss=0.3584, pruned_loss=0.1342, over 19926.00 frames. ], tot_loss[loss=0.2899, simple_loss=0.3445, pruned_loss=0.1177, over 4257393.57 frames. ], batch size: 702, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 23:45:25,636 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.224e+02 3.032e+02 3.636e+02 4.504e+02 8.608e+02, threshold=7.273e+02, percent-clipped=3.0 2023-06-18 23:46:27,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=242850.0, ans=0.07 2023-06-18 23:46:59,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=242910.0, ans=0.04949747468305833 2023-06-18 23:47:25,780 INFO [train.py:996] (2/4) Epoch 2, batch 10000, loss[loss=0.3381, simple_loss=0.3853, pruned_loss=0.1455, over 21673.00 frames. ], tot_loss[loss=0.2887, simple_loss=0.3427, pruned_loss=0.1174, over 4261906.86 frames. ], batch size: 441, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 23:48:33,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=243090.0, ans=0.1 2023-06-18 23:48:54,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=243150.0, ans=0.0 2023-06-18 23:49:46,145 INFO [train.py:996] (2/4) Epoch 2, batch 10050, loss[loss=0.2242, simple_loss=0.2875, pruned_loss=0.08042, over 21416.00 frames. ], tot_loss[loss=0.2893, simple_loss=0.3438, pruned_loss=0.1174, over 4262540.63 frames. ], batch size: 211, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 23:49:47,543 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.989e+02 2.747e+02 3.338e+02 4.174e+02 6.778e+02, threshold=6.677e+02, percent-clipped=0.0 2023-06-18 23:50:15,861 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.64 vs. limit=6.0 2023-06-18 23:50:37,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=243390.0, ans=0.2 2023-06-18 23:51:16,979 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=243510.0, ans=10.0 2023-06-18 23:51:29,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=243510.0, ans=0.07 2023-06-18 23:51:45,344 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.09 vs. limit=6.0 2023-06-18 23:51:50,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=243510.0, ans=0.0 2023-06-18 23:51:58,745 INFO [train.py:996] (2/4) Epoch 2, batch 10100, loss[loss=0.2397, simple_loss=0.2802, pruned_loss=0.09959, over 20881.00 frames. ], tot_loss[loss=0.2844, simple_loss=0.3398, pruned_loss=0.1145, over 4264520.63 frames. ], batch size: 613, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 23:52:36,904 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=243630.0, ans=0.2 2023-06-18 23:53:13,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=243690.0, ans=0.1 2023-06-18 23:53:34,679 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.65 vs. limit=22.5 2023-06-18 23:53:48,348 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 23:54:11,448 INFO [train.py:996] (2/4) Epoch 2, batch 10150, loss[loss=0.322, simple_loss=0.3897, pruned_loss=0.1272, over 19973.00 frames. ], tot_loss[loss=0.2898, simple_loss=0.3464, pruned_loss=0.1166, over 4267007.62 frames. ], batch size: 702, lr: 1.75e-02, grad_scale: 64.0 2023-06-18 23:54:12,896 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.964e+02 2.705e+02 3.448e+02 4.239e+02 1.060e+03, threshold=6.897e+02, percent-clipped=5.0 2023-06-18 23:56:12,735 INFO [train.py:996] (2/4) Epoch 2, batch 10200, loss[loss=0.258, simple_loss=0.3393, pruned_loss=0.08838, over 21619.00 frames. ], tot_loss[loss=0.2877, simple_loss=0.3464, pruned_loss=0.1145, over 4271185.67 frames. ], batch size: 389, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 23:56:25,374 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.66 vs. limit=15.0 2023-06-18 23:57:28,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=244290.0, ans=0.125 2023-06-18 23:58:24,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=244410.0, ans=0.125 2023-06-18 23:58:30,099 INFO [train.py:996] (2/4) Epoch 2, batch 10250, loss[loss=0.3118, simple_loss=0.3749, pruned_loss=0.1243, over 21315.00 frames. ], tot_loss[loss=0.2779, simple_loss=0.3402, pruned_loss=0.1078, over 4273044.54 frames. ], batch size: 143, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 23:58:31,936 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=244470.0, ans=0.0 2023-06-18 23:58:32,919 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.714e+02 2.447e+02 2.831e+02 3.418e+02 6.746e+02, threshold=5.661e+02, percent-clipped=0.0 2023-06-18 23:58:33,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=244470.0, ans=0.0 2023-06-18 23:58:44,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=244530.0, ans=0.125 2023-06-19 00:00:27,730 INFO [train.py:996] (2/4) Epoch 2, batch 10300, loss[loss=0.2736, simple_loss=0.3611, pruned_loss=0.09305, over 21735.00 frames. ], tot_loss[loss=0.278, simple_loss=0.342, pruned_loss=0.107, over 4277746.16 frames. ], batch size: 247, lr: 1.75e-02, grad_scale: 32.0 2023-06-19 00:00:40,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=244770.0, ans=0.125 2023-06-19 00:00:50,198 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=244830.0, ans=0.015 2023-06-19 00:01:00,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=244830.0, ans=0.0 2023-06-19 00:01:00,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=244830.0, ans=0.125 2023-06-19 00:01:22,238 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-06-19 00:02:13,781 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.25 vs. limit=15.0 2023-06-19 00:02:37,381 INFO [train.py:996] (2/4) Epoch 2, batch 10350, loss[loss=0.3234, simple_loss=0.4306, pruned_loss=0.108, over 19732.00 frames. ], tot_loss[loss=0.2786, simple_loss=0.3444, pruned_loss=0.1064, over 4271268.77 frames. ], batch size: 702, lr: 1.74e-02, grad_scale: 32.0 2023-06-19 00:02:38,450 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.82 vs. limit=15.0 2023-06-19 00:02:45,450 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.405e+02 2.780e+02 3.474e+02 4.368e+02 7.573e+02, threshold=6.948e+02, percent-clipped=10.0 2023-06-19 00:03:25,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=245190.0, ans=0.125 2023-06-19 00:04:35,181 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=18.35 vs. limit=15.0 2023-06-19 00:04:49,421 INFO [train.py:996] (2/4) Epoch 2, batch 10400, loss[loss=0.1888, simple_loss=0.239, pruned_loss=0.06927, over 21200.00 frames. ], tot_loss[loss=0.2726, simple_loss=0.3366, pruned_loss=0.1043, over 4266421.88 frames. ], batch size: 176, lr: 1.74e-02, grad_scale: 32.0 2023-06-19 00:05:38,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=245490.0, ans=0.2 2023-06-19 00:05:51,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=245490.0, ans=0.0 2023-06-19 00:05:58,184 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=245550.0, ans=0.125 2023-06-19 00:06:53,247 INFO [train.py:996] (2/4) Epoch 2, batch 10450, loss[loss=0.2926, simple_loss=0.3682, pruned_loss=0.1085, over 21845.00 frames. ], tot_loss[loss=0.2795, simple_loss=0.3411, pruned_loss=0.1089, over 4265478.25 frames. ], batch size: 316, lr: 1.74e-02, grad_scale: 32.0 2023-06-19 00:07:07,319 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.050e+02 3.343e+02 4.131e+02 5.392e+02 9.378e+02, threshold=8.262e+02, percent-clipped=3.0 2023-06-19 00:07:24,647 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.28 vs. limit=6.0 2023-06-19 00:07:53,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=245730.0, ans=0.125 2023-06-19 00:08:05,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=245790.0, ans=0.125 2023-06-19 00:08:49,888 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.10 vs. limit=22.5 2023-06-19 00:09:11,793 INFO [train.py:996] (2/4) Epoch 2, batch 10500, loss[loss=0.2611, simple_loss=0.3172, pruned_loss=0.1025, over 21820.00 frames. ], tot_loss[loss=0.2769, simple_loss=0.3391, pruned_loss=0.1074, over 4264375.66 frames. ], batch size: 102, lr: 1.74e-02, grad_scale: 32.0 2023-06-19 00:10:15,366 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 00:10:28,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=246150.0, ans=0.125 2023-06-19 00:11:20,154 INFO [train.py:996] (2/4) Epoch 2, batch 10550, loss[loss=0.2744, simple_loss=0.3148, pruned_loss=0.117, over 21913.00 frames. ], tot_loss[loss=0.2741, simple_loss=0.3339, pruned_loss=0.1072, over 4254628.49 frames. ], batch size: 373, lr: 1.74e-02, grad_scale: 32.0 2023-06-19 00:11:23,107 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.840e+02 2.402e+02 2.798e+02 3.186e+02 5.857e+02, threshold=5.595e+02, percent-clipped=0.0 2023-06-19 00:11:38,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=246270.0, ans=0.5 2023-06-19 00:12:41,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=246450.0, ans=0.035 2023-06-19 00:12:41,926 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.06 vs. limit=15.0 2023-06-19 00:13:15,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=246510.0, ans=0.2 2023-06-19 00:13:18,269 INFO [train.py:996] (2/4) Epoch 2, batch 10600, loss[loss=0.2215, simple_loss=0.3065, pruned_loss=0.06826, over 21723.00 frames. ], tot_loss[loss=0.2699, simple_loss=0.3296, pruned_loss=0.1051, over 4261072.26 frames. ], batch size: 247, lr: 1.74e-02, grad_scale: 32.0 2023-06-19 00:13:26,587 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.98 vs. limit=15.0 2023-06-19 00:14:03,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=246630.0, ans=0.125 2023-06-19 00:14:16,118 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 00:15:45,042 INFO [train.py:996] (2/4) Epoch 2, batch 10650, loss[loss=0.3375, simple_loss=0.3887, pruned_loss=0.1431, over 21451.00 frames. ], tot_loss[loss=0.2709, simple_loss=0.3337, pruned_loss=0.1041, over 4255918.45 frames. ], batch size: 507, lr: 1.74e-02, grad_scale: 32.0 2023-06-19 00:15:55,451 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.945e+02 2.840e+02 3.386e+02 4.217e+02 7.399e+02, threshold=6.773e+02, percent-clipped=7.0 2023-06-19 00:16:16,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=246870.0, ans=0.035 2023-06-19 00:16:43,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=246990.0, ans=0.125 2023-06-19 00:17:04,779 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=247050.0, ans=0.0 2023-06-19 00:17:22,671 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=12.0 2023-06-19 00:18:07,788 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=247170.0, ans=0.125 2023-06-19 00:18:08,604 INFO [train.py:996] (2/4) Epoch 2, batch 10700, loss[loss=0.3309, simple_loss=0.3673, pruned_loss=0.1472, over 19864.00 frames. ], tot_loss[loss=0.2716, simple_loss=0.3324, pruned_loss=0.1054, over 4262968.64 frames. ], batch size: 702, lr: 1.74e-02, grad_scale: 32.0 2023-06-19 00:19:19,188 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.32 vs. limit=6.0 2023-06-19 00:19:28,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=247350.0, ans=0.0 2023-06-19 00:19:47,457 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.44 vs. limit=15.0 2023-06-19 00:20:08,477 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.16 vs. limit=15.0 2023-06-19 00:20:09,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=247410.0, ans=0.125 2023-06-19 00:20:10,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=247410.0, ans=0.0 2023-06-19 00:20:15,900 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.06 vs. limit=15.0 2023-06-19 00:20:29,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=247470.0, ans=0.0 2023-06-19 00:20:30,657 INFO [train.py:996] (2/4) Epoch 2, batch 10750, loss[loss=0.3013, simple_loss=0.3815, pruned_loss=0.1106, over 21766.00 frames. ], tot_loss[loss=0.2836, simple_loss=0.3446, pruned_loss=0.1113, over 4262553.44 frames. ], batch size: 332, lr: 1.74e-02, grad_scale: 32.0 2023-06-19 00:20:33,615 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.972e+02 2.858e+02 3.262e+02 4.061e+02 7.112e+02, threshold=6.525e+02, percent-clipped=1.0 2023-06-19 00:21:37,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=247590.0, ans=0.125 2023-06-19 00:21:56,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=247650.0, ans=0.125 2023-06-19 00:22:27,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=247650.0, ans=0.1 2023-06-19 00:22:56,670 INFO [train.py:996] (2/4) Epoch 2, batch 10800, loss[loss=0.3736, simple_loss=0.4137, pruned_loss=0.1668, over 21416.00 frames. ], tot_loss[loss=0.2882, simple_loss=0.3513, pruned_loss=0.1125, over 4271622.94 frames. ], batch size: 471, lr: 1.74e-02, grad_scale: 32.0 2023-06-19 00:22:57,724 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.44 vs. limit=10.0 2023-06-19 00:22:58,697 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=247770.0, ans=0.125 2023-06-19 00:23:37,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=247830.0, ans=0.0 2023-06-19 00:24:03,447 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=247890.0, ans=0.0 2023-06-19 00:24:05,746 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.64 vs. limit=15.0 2023-06-19 00:24:09,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=247890.0, ans=0.05 2023-06-19 00:24:19,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=247950.0, ans=0.1 2023-06-19 00:25:16,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=248070.0, ans=0.2 2023-06-19 00:25:17,339 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.53 vs. limit=15.0 2023-06-19 00:25:17,896 INFO [train.py:996] (2/4) Epoch 2, batch 10850, loss[loss=0.2602, simple_loss=0.3187, pruned_loss=0.1009, over 22032.00 frames. ], tot_loss[loss=0.2871, simple_loss=0.3503, pruned_loss=0.112, over 4277115.64 frames. ], batch size: 103, lr: 1.73e-02, grad_scale: 32.0 2023-06-19 00:25:18,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=248070.0, ans=0.125 2023-06-19 00:25:21,160 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.858e+02 2.702e+02 3.119e+02 3.804e+02 6.070e+02, threshold=6.238e+02, percent-clipped=0.0 2023-06-19 00:25:33,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=248070.0, ans=0.0 2023-06-19 00:27:21,652 INFO [train.py:996] (2/4) Epoch 2, batch 10900, loss[loss=0.2504, simple_loss=0.3309, pruned_loss=0.08494, over 21568.00 frames. ], tot_loss[loss=0.2811, simple_loss=0.3436, pruned_loss=0.1093, over 4270482.84 frames. ], batch size: 263, lr: 1.73e-02, grad_scale: 32.0 2023-06-19 00:27:29,977 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.54 vs. limit=22.5 2023-06-19 00:28:05,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=248430.0, ans=0.125 2023-06-19 00:28:17,652 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=248490.0, ans=0.125 2023-06-19 00:28:28,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=248490.0, ans=0.125 2023-06-19 00:29:05,830 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.01 vs. limit=10.0 2023-06-19 00:29:17,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=248610.0, ans=0.0 2023-06-19 00:29:25,754 INFO [train.py:996] (2/4) Epoch 2, batch 10950, loss[loss=0.2942, simple_loss=0.3415, pruned_loss=0.1235, over 20051.00 frames. ], tot_loss[loss=0.2758, simple_loss=0.3377, pruned_loss=0.1069, over 4262171.32 frames. ], batch size: 703, lr: 1.73e-02, grad_scale: 32.0 2023-06-19 00:29:28,627 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.213e+02 2.875e+02 3.449e+02 4.246e+02 8.484e+02, threshold=6.899e+02, percent-clipped=4.0 2023-06-19 00:29:42,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=248670.0, ans=0.0 2023-06-19 00:30:19,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=248790.0, ans=0.2 2023-06-19 00:30:38,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=248790.0, ans=0.0 2023-06-19 00:31:43,321 INFO [train.py:996] (2/4) Epoch 2, batch 11000, loss[loss=0.2732, simple_loss=0.3147, pruned_loss=0.1158, over 21502.00 frames. ], tot_loss[loss=0.276, simple_loss=0.3364, pruned_loss=0.1079, over 4262197.12 frames. ], batch size: 442, lr: 1.73e-02, grad_scale: 32.0 2023-06-19 00:32:10,856 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.03 vs. limit=15.0 2023-06-19 00:33:40,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=249210.0, ans=0.125 2023-06-19 00:33:47,310 INFO [train.py:996] (2/4) Epoch 2, batch 11050, loss[loss=0.2565, simple_loss=0.3387, pruned_loss=0.08711, over 20755.00 frames. ], tot_loss[loss=0.2766, simple_loss=0.3344, pruned_loss=0.1094, over 4267367.01 frames. ], batch size: 607, lr: 1.73e-02, grad_scale: 32.0 2023-06-19 00:33:50,071 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.192e+02 2.804e+02 3.313e+02 4.113e+02 7.332e+02, threshold=6.626e+02, percent-clipped=1.0 2023-06-19 00:34:49,101 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=249390.0, ans=0.0 2023-06-19 00:35:07,630 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=249450.0, ans=0.125 2023-06-19 00:35:09,616 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.34 vs. limit=15.0 2023-06-19 00:35:18,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=249450.0, ans=0.125 2023-06-19 00:35:28,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=249510.0, ans=0.2 2023-06-19 00:35:30,235 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=249510.0, ans=0.0 2023-06-19 00:35:41,938 INFO [train.py:996] (2/4) Epoch 2, batch 11100, loss[loss=0.2518, simple_loss=0.3016, pruned_loss=0.101, over 21253.00 frames. ], tot_loss[loss=0.2759, simple_loss=0.3324, pruned_loss=0.1097, over 4268968.30 frames. ], batch size: 144, lr: 1.73e-02, grad_scale: 32.0 2023-06-19 00:37:08,149 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=249750.0, ans=0.0 2023-06-19 00:37:27,725 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.15 vs. limit=15.0 2023-06-19 00:37:44,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=249870.0, ans=0.125 2023-06-19 00:37:45,291 INFO [train.py:996] (2/4) Epoch 2, batch 11150, loss[loss=0.2441, simple_loss=0.3159, pruned_loss=0.08611, over 21718.00 frames. ], tot_loss[loss=0.2737, simple_loss=0.3299, pruned_loss=0.1087, over 4269945.15 frames. ], batch size: 282, lr: 1.73e-02, grad_scale: 16.0 2023-06-19 00:37:49,644 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.023e+02 2.743e+02 3.116e+02 3.616e+02 5.732e+02, threshold=6.232e+02, percent-clipped=1.0 2023-06-19 00:38:23,099 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=249930.0, ans=0.125 2023-06-19 00:38:26,217 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=249930.0, ans=0.1 2023-06-19 00:39:10,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=250050.0, ans=0.125 2023-06-19 00:39:29,510 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=250110.0, ans=0.2 2023-06-19 00:39:38,501 INFO [train.py:996] (2/4) Epoch 2, batch 11200, loss[loss=0.2674, simple_loss=0.3158, pruned_loss=0.1095, over 21794.00 frames. ], tot_loss[loss=0.2721, simple_loss=0.3283, pruned_loss=0.108, over 4263476.21 frames. ], batch size: 118, lr: 1.73e-02, grad_scale: 32.0 2023-06-19 00:39:43,867 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.11 vs. limit=15.0 2023-06-19 00:40:32,954 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 00:40:37,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=250290.0, ans=0.125 2023-06-19 00:41:24,356 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=250410.0, ans=0.1 2023-06-19 00:41:26,266 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=250410.0, ans=15.0 2023-06-19 00:41:29,073 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.48 vs. limit=15.0 2023-06-19 00:41:52,953 INFO [train.py:996] (2/4) Epoch 2, batch 11250, loss[loss=0.2855, simple_loss=0.3325, pruned_loss=0.1193, over 21206.00 frames. ], tot_loss[loss=0.2729, simple_loss=0.3283, pruned_loss=0.1088, over 4267837.58 frames. ], batch size: 176, lr: 1.73e-02, grad_scale: 32.0 2023-06-19 00:41:57,394 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.175e+02 2.808e+02 3.237e+02 3.673e+02 7.595e+02, threshold=6.473e+02, percent-clipped=2.0 2023-06-19 00:42:21,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=250530.0, ans=0.125 2023-06-19 00:42:33,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=250530.0, ans=0.125 2023-06-19 00:42:42,823 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.89 vs. limit=15.0 2023-06-19 00:42:46,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=250590.0, ans=0.1 2023-06-19 00:43:09,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=250650.0, ans=0.125 2023-06-19 00:43:27,636 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.46 vs. limit=15.0 2023-06-19 00:43:47,248 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=250710.0, ans=0.0 2023-06-19 00:44:02,658 INFO [train.py:996] (2/4) Epoch 2, batch 11300, loss[loss=0.2693, simple_loss=0.3314, pruned_loss=0.1036, over 21865.00 frames. ], tot_loss[loss=0.273, simple_loss=0.3296, pruned_loss=0.1082, over 4271586.32 frames. ], batch size: 124, lr: 1.73e-02, grad_scale: 32.0 2023-06-19 00:44:46,224 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 00:45:07,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=250890.0, ans=0.0 2023-06-19 00:45:14,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=250890.0, ans=0.125 2023-06-19 00:45:56,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=251010.0, ans=0.125 2023-06-19 00:45:59,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=251010.0, ans=0.2 2023-06-19 00:46:16,910 INFO [train.py:996] (2/4) Epoch 2, batch 11350, loss[loss=0.3518, simple_loss=0.4027, pruned_loss=0.1504, over 21569.00 frames. ], tot_loss[loss=0.2735, simple_loss=0.3306, pruned_loss=0.1082, over 4274365.70 frames. ], batch size: 389, lr: 1.72e-02, grad_scale: 32.0 2023-06-19 00:46:17,423 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 00:46:23,546 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.825e+02 2.745e+02 3.314e+02 3.869e+02 6.937e+02, threshold=6.629e+02, percent-clipped=3.0 2023-06-19 00:48:08,179 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=251250.0, ans=10.0 2023-06-19 00:48:31,005 INFO [train.py:996] (2/4) Epoch 2, batch 11400, loss[loss=0.2637, simple_loss=0.3234, pruned_loss=0.102, over 21108.00 frames. ], tot_loss[loss=0.2813, simple_loss=0.3388, pruned_loss=0.1119, over 4273090.03 frames. ], batch size: 143, lr: 1.72e-02, grad_scale: 32.0 2023-06-19 00:48:40,755 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.31 vs. limit=15.0 2023-06-19 00:49:01,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=251370.0, ans=0.125 2023-06-19 00:49:11,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=251430.0, ans=0.125 2023-06-19 00:49:17,324 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=251430.0, ans=0.05 2023-06-19 00:49:17,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=251430.0, ans=0.125 2023-06-19 00:49:35,749 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=251490.0, ans=0.125 2023-06-19 00:50:19,132 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.61 vs. limit=22.5 2023-06-19 00:50:59,194 INFO [train.py:996] (2/4) Epoch 2, batch 11450, loss[loss=0.288, simple_loss=0.3499, pruned_loss=0.1131, over 21777.00 frames. ], tot_loss[loss=0.2839, simple_loss=0.3425, pruned_loss=0.1126, over 4271838.17 frames. ], batch size: 298, lr: 1.72e-02, grad_scale: 32.0 2023-06-19 00:51:17,500 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.923e+02 2.931e+02 3.430e+02 4.239e+02 8.118e+02, threshold=6.860e+02, percent-clipped=4.0 2023-06-19 00:52:04,765 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=251790.0, ans=0.0 2023-06-19 00:53:20,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=251910.0, ans=0.1 2023-06-19 00:53:23,626 INFO [train.py:996] (2/4) Epoch 2, batch 11500, loss[loss=0.2454, simple_loss=0.3299, pruned_loss=0.08041, over 21869.00 frames. ], tot_loss[loss=0.2864, simple_loss=0.3457, pruned_loss=0.1135, over 4273098.17 frames. ], batch size: 316, lr: 1.72e-02, grad_scale: 32.0 2023-06-19 00:53:24,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=251970.0, ans=0.125 2023-06-19 00:53:32,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=251970.0, ans=0.5 2023-06-19 00:55:15,387 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=252210.0, ans=0.125 2023-06-19 00:55:26,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=252210.0, ans=0.125 2023-06-19 00:55:26,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=252210.0, ans=0.125 2023-06-19 00:55:52,491 INFO [train.py:996] (2/4) Epoch 2, batch 11550, loss[loss=0.2448, simple_loss=0.3269, pruned_loss=0.08132, over 21646.00 frames. ], tot_loss[loss=0.2897, simple_loss=0.3517, pruned_loss=0.1139, over 4268327.74 frames. ], batch size: 263, lr: 1.72e-02, grad_scale: 32.0 2023-06-19 00:56:10,468 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.930e+02 3.008e+02 3.655e+02 4.469e+02 7.997e+02, threshold=7.310e+02, percent-clipped=1.0 2023-06-19 00:56:14,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=252270.0, ans=0.0 2023-06-19 00:56:45,131 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=252390.0, ans=0.125 2023-06-19 00:57:17,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=252390.0, ans=0.125 2023-06-19 00:58:07,460 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=252510.0, ans=0.025 2023-06-19 00:58:11,605 INFO [train.py:996] (2/4) Epoch 2, batch 11600, loss[loss=0.3984, simple_loss=0.4665, pruned_loss=0.1651, over 21454.00 frames. ], tot_loss[loss=0.297, simple_loss=0.3643, pruned_loss=0.1149, over 4268299.64 frames. ], batch size: 507, lr: 1.72e-02, grad_scale: 32.0 2023-06-19 01:00:02,859 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=252810.0, ans=0.0 2023-06-19 01:00:26,315 INFO [train.py:996] (2/4) Epoch 2, batch 11650, loss[loss=0.3425, simple_loss=0.4091, pruned_loss=0.138, over 21587.00 frames. ], tot_loss[loss=0.2973, simple_loss=0.3679, pruned_loss=0.1133, over 4266381.94 frames. ], batch size: 414, lr: 1.72e-02, grad_scale: 32.0 2023-06-19 01:00:30,508 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.993e+02 2.719e+02 3.568e+02 4.744e+02 9.384e+02, threshold=7.136e+02, percent-clipped=4.0 2023-06-19 01:00:45,592 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=252930.0, ans=0.0 2023-06-19 01:01:49,196 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.78 vs. limit=12.0 2023-06-19 01:02:01,009 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.04 vs. limit=15.0 2023-06-19 01:02:25,802 INFO [train.py:996] (2/4) Epoch 2, batch 11700, loss[loss=0.2936, simple_loss=0.333, pruned_loss=0.1271, over 20064.00 frames. ], tot_loss[loss=0.2927, simple_loss=0.3602, pruned_loss=0.1126, over 4264474.40 frames. ], batch size: 702, lr: 1.72e-02, grad_scale: 16.0 2023-06-19 01:02:31,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=253170.0, ans=0.0 2023-06-19 01:02:33,992 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.43 vs. limit=15.0 2023-06-19 01:02:49,743 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.11 vs. limit=15.0 2023-06-19 01:03:56,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=253350.0, ans=0.0 2023-06-19 01:04:19,142 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=253410.0, ans=0.125 2023-06-19 01:04:22,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=253470.0, ans=0.125 2023-06-19 01:04:23,354 INFO [train.py:996] (2/4) Epoch 2, batch 11750, loss[loss=0.2755, simple_loss=0.3113, pruned_loss=0.1198, over 21644.00 frames. ], tot_loss[loss=0.288, simple_loss=0.3507, pruned_loss=0.1126, over 4258294.04 frames. ], batch size: 445, lr: 1.72e-02, grad_scale: 16.0 2023-06-19 01:04:42,021 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.104e+02 3.061e+02 3.639e+02 4.435e+02 7.294e+02, threshold=7.278e+02, percent-clipped=2.0 2023-06-19 01:05:13,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=253590.0, ans=0.125 2023-06-19 01:05:34,398 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=253590.0, ans=0.125 2023-06-19 01:06:07,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=253650.0, ans=0.125 2023-06-19 01:06:45,926 INFO [train.py:996] (2/4) Epoch 2, batch 11800, loss[loss=0.3233, simple_loss=0.4059, pruned_loss=0.1204, over 19742.00 frames. ], tot_loss[loss=0.2926, simple_loss=0.3536, pruned_loss=0.1158, over 4263165.11 frames. ], batch size: 702, lr: 1.72e-02, grad_scale: 16.0 2023-06-19 01:07:41,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=253890.0, ans=0.125 2023-06-19 01:07:51,975 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=253890.0, ans=0.0 2023-06-19 01:08:03,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=253890.0, ans=0.0 2023-06-19 01:08:10,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=253890.0, ans=0.125 2023-06-19 01:08:28,910 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.27 vs. limit=22.5 2023-06-19 01:08:30,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=253950.0, ans=0.125 2023-06-19 01:08:45,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=254010.0, ans=15.0 2023-06-19 01:08:45,634 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.67 vs. limit=15.0 2023-06-19 01:08:48,976 INFO [train.py:996] (2/4) Epoch 2, batch 11850, loss[loss=0.2536, simple_loss=0.337, pruned_loss=0.08512, over 21799.00 frames. ], tot_loss[loss=0.291, simple_loss=0.3545, pruned_loss=0.1137, over 4276661.53 frames. ], batch size: 332, lr: 1.71e-02, grad_scale: 16.0 2023-06-19 01:08:50,937 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=254070.0, ans=0.125 2023-06-19 01:09:00,505 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.050e+02 2.827e+02 3.352e+02 4.207e+02 5.671e+02, threshold=6.705e+02, percent-clipped=0.0 2023-06-19 01:10:08,186 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=254190.0, ans=0.1 2023-06-19 01:10:21,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=254250.0, ans=0.125 2023-06-19 01:10:22,090 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.78 vs. limit=6.0 2023-06-19 01:10:31,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=254250.0, ans=0.0 2023-06-19 01:11:08,827 INFO [train.py:996] (2/4) Epoch 2, batch 11900, loss[loss=0.3388, simple_loss=0.3998, pruned_loss=0.139, over 21403.00 frames. ], tot_loss[loss=0.2876, simple_loss=0.354, pruned_loss=0.1106, over 4274473.47 frames. ], batch size: 507, lr: 1.71e-02, grad_scale: 16.0 2023-06-19 01:11:17,085 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=254370.0, ans=0.05 2023-06-19 01:12:53,282 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=254550.0, ans=0.125 2023-06-19 01:13:13,433 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=254610.0, ans=0.05 2023-06-19 01:13:30,195 INFO [train.py:996] (2/4) Epoch 2, batch 11950, loss[loss=0.2037, simple_loss=0.2821, pruned_loss=0.06265, over 21226.00 frames. ], tot_loss[loss=0.2843, simple_loss=0.3545, pruned_loss=0.107, over 4268426.52 frames. ], batch size: 176, lr: 1.71e-02, grad_scale: 16.0 2023-06-19 01:13:35,898 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.053e+02 2.552e+02 2.960e+02 3.499e+02 5.476e+02, threshold=5.920e+02, percent-clipped=0.0 2023-06-19 01:14:29,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=254790.0, ans=0.0 2023-06-19 01:15:47,314 INFO [train.py:996] (2/4) Epoch 2, batch 12000, loss[loss=0.2543, simple_loss=0.2962, pruned_loss=0.1062, over 19958.00 frames. ], tot_loss[loss=0.282, simple_loss=0.3524, pruned_loss=0.1058, over 4268482.35 frames. ], batch size: 702, lr: 1.71e-02, grad_scale: 32.0 2023-06-19 01:15:47,314 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-19 01:16:36,944 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.0.layers.0.self_attn_weights, attn_weights_entropy = tensor([5.7091, 5.8669, 5.6079, 5.2716], device='cuda:2') 2023-06-19 01:16:40,671 INFO [train.py:1028] (2/4) Epoch 2, validation: loss=0.2909, simple_loss=0.3809, pruned_loss=0.1004, over 1796401.00 frames. 2023-06-19 01:16:40,672 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-19 01:16:45,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=254970.0, ans=0.125 2023-06-19 01:16:54,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=254970.0, ans=0.2 2023-06-19 01:17:05,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=255030.0, ans=10.0 2023-06-19 01:17:15,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff2.min_abs, batch_count=255090.0, ans=0.1 2023-06-19 01:17:45,800 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.33 vs. limit=22.5 2023-06-19 01:18:40,336 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=255210.0, ans=0.1 2023-06-19 01:18:42,665 INFO [train.py:996] (2/4) Epoch 2, batch 12050, loss[loss=0.3158, simple_loss=0.3745, pruned_loss=0.1285, over 21447.00 frames. ], tot_loss[loss=0.2845, simple_loss=0.3512, pruned_loss=0.1089, over 4277714.82 frames. ], batch size: 131, lr: 1.71e-02, grad_scale: 32.0 2023-06-19 01:18:57,276 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.053e+02 2.833e+02 3.629e+02 4.945e+02 8.634e+02, threshold=7.258e+02, percent-clipped=13.0 2023-06-19 01:19:05,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=255330.0, ans=0.0 2023-06-19 01:19:43,318 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=255390.0, ans=22.5 2023-06-19 01:19:45,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=255390.0, ans=0.0 2023-06-19 01:20:19,968 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.26 vs. limit=22.5 2023-06-19 01:20:44,656 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=255510.0, ans=0.0 2023-06-19 01:20:48,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=255510.0, ans=0.0 2023-06-19 01:20:56,934 INFO [train.py:996] (2/4) Epoch 2, batch 12100, loss[loss=0.2998, simple_loss=0.35, pruned_loss=0.1248, over 21786.00 frames. ], tot_loss[loss=0.2919, simple_loss=0.356, pruned_loss=0.1139, over 4277088.93 frames. ], batch size: 247, lr: 1.71e-02, grad_scale: 32.0 2023-06-19 01:20:57,379 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=255570.0, ans=0.05 2023-06-19 01:21:06,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=255570.0, ans=0.2 2023-06-19 01:21:33,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=255630.0, ans=0.05 2023-06-19 01:21:52,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=255630.0, ans=0.125 2023-06-19 01:22:16,028 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.19 vs. limit=15.0 2023-06-19 01:22:34,838 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=255750.0, ans=0.0 2023-06-19 01:23:23,981 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.91 vs. limit=10.0 2023-06-19 01:23:41,032 INFO [train.py:996] (2/4) Epoch 2, batch 12150, loss[loss=0.308, simple_loss=0.3926, pruned_loss=0.1116, over 21264.00 frames. ], tot_loss[loss=0.2928, simple_loss=0.3575, pruned_loss=0.114, over 4278580.85 frames. ], batch size: 548, lr: 1.71e-02, grad_scale: 32.0 2023-06-19 01:23:50,398 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.315e+02 3.161e+02 4.078e+02 5.260e+02 8.280e+02, threshold=8.155e+02, percent-clipped=4.0 2023-06-19 01:23:59,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=255870.0, ans=0.125 2023-06-19 01:24:06,691 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=255930.0, ans=0.0 2023-06-19 01:24:18,977 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.42 vs. limit=15.0 2023-06-19 01:24:49,262 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=255990.0, ans=0.0 2023-06-19 01:25:35,545 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.18 vs. limit=15.0 2023-06-19 01:25:40,319 INFO [train.py:996] (2/4) Epoch 2, batch 12200, loss[loss=0.2793, simple_loss=0.3319, pruned_loss=0.1134, over 21629.00 frames. ], tot_loss[loss=0.2898, simple_loss=0.3541, pruned_loss=0.1127, over 4271129.81 frames. ], batch size: 332, lr: 1.71e-02, grad_scale: 32.0 2023-06-19 01:26:03,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=256170.0, ans=0.04949747468305833 2023-06-19 01:26:50,054 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.89 vs. limit=15.0 2023-06-19 01:26:57,421 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.94 vs. limit=15.0 2023-06-19 01:27:37,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=256410.0, ans=0.07 2023-06-19 01:27:37,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=256410.0, ans=0.1 2023-06-19 01:27:49,616 INFO [train.py:996] (2/4) Epoch 2, batch 12250, loss[loss=0.1897, simple_loss=0.2418, pruned_loss=0.06878, over 21787.00 frames. ], tot_loss[loss=0.2808, simple_loss=0.3455, pruned_loss=0.108, over 4268113.64 frames. ], batch size: 107, lr: 1.71e-02, grad_scale: 16.0 2023-06-19 01:28:02,134 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.071e+02 2.815e+02 3.229e+02 3.820e+02 6.594e+02, threshold=6.459e+02, percent-clipped=0.0 2023-06-19 01:28:48,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=256590.0, ans=0.2 2023-06-19 01:29:02,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=256590.0, ans=0.125 2023-06-19 01:29:29,183 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.09 vs. limit=15.0 2023-06-19 01:29:38,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=256710.0, ans=0.125 2023-06-19 01:29:57,178 INFO [train.py:996] (2/4) Epoch 2, batch 12300, loss[loss=0.2454, simple_loss=0.3158, pruned_loss=0.08753, over 21497.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.3336, pruned_loss=0.09897, over 4268247.76 frames. ], batch size: 471, lr: 1.71e-02, grad_scale: 16.0 2023-06-19 01:30:35,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=256830.0, ans=0.0 2023-06-19 01:31:29,540 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.10 vs. limit=15.0 2023-06-19 01:32:13,373 INFO [train.py:996] (2/4) Epoch 2, batch 12350, loss[loss=0.3282, simple_loss=0.387, pruned_loss=0.1347, over 21852.00 frames. ], tot_loss[loss=0.2686, simple_loss=0.3374, pruned_loss=0.09994, over 4270083.48 frames. ], batch size: 351, lr: 1.70e-02, grad_scale: 16.0 2023-06-19 01:32:25,741 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.822e+02 2.742e+02 3.231e+02 4.296e+02 8.197e+02, threshold=6.463e+02, percent-clipped=4.0 2023-06-19 01:33:32,795 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=257190.0, ans=0.125 2023-06-19 01:34:00,039 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.10 vs. limit=22.5 2023-06-19 01:34:10,134 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=15.0 2023-06-19 01:34:14,658 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.62 vs. limit=15.0 2023-06-19 01:34:17,960 INFO [train.py:996] (2/4) Epoch 2, batch 12400, loss[loss=0.282, simple_loss=0.3418, pruned_loss=0.1111, over 21197.00 frames. ], tot_loss[loss=0.2742, simple_loss=0.3402, pruned_loss=0.1041, over 4282029.65 frames. ], batch size: 176, lr: 1.70e-02, grad_scale: 32.0 2023-06-19 01:34:49,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=257370.0, ans=0.0 2023-06-19 01:35:15,571 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=257430.0, ans=0.1 2023-06-19 01:35:34,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=257490.0, ans=0.1 2023-06-19 01:36:09,762 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=257550.0, ans=0.0 2023-06-19 01:36:09,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=257550.0, ans=0.0 2023-06-19 01:37:04,444 INFO [train.py:996] (2/4) Epoch 2, batch 12450, loss[loss=0.3448, simple_loss=0.4107, pruned_loss=0.1394, over 21276.00 frames. ], tot_loss[loss=0.2836, simple_loss=0.3467, pruned_loss=0.1103, over 4286842.61 frames. ], batch size: 159, lr: 1.70e-02, grad_scale: 32.0 2023-06-19 01:37:11,989 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=257670.0, ans=0.0 2023-06-19 01:37:12,478 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.73 vs. limit=15.0 2023-06-19 01:37:17,439 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.084e+02 2.991e+02 3.683e+02 4.445e+02 7.854e+02, threshold=7.366e+02, percent-clipped=4.0 2023-06-19 01:37:46,984 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=257790.0, ans=0.1 2023-06-19 01:37:58,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=257790.0, ans=0.125 2023-06-19 01:38:50,221 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.60 vs. limit=15.0 2023-06-19 01:39:14,094 INFO [train.py:996] (2/4) Epoch 2, batch 12500, loss[loss=0.3889, simple_loss=0.439, pruned_loss=0.1694, over 21502.00 frames. ], tot_loss[loss=0.2957, simple_loss=0.3585, pruned_loss=0.1164, over 4288876.38 frames. ], batch size: 471, lr: 1.70e-02, grad_scale: 32.0 2023-06-19 01:39:49,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=258030.0, ans=0.125 2023-06-19 01:40:12,999 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.99 vs. limit=15.0 2023-06-19 01:40:13,703 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=258030.0, ans=0.2 2023-06-19 01:40:41,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=258150.0, ans=0.125 2023-06-19 01:41:20,540 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.93 vs. limit=6.0 2023-06-19 01:41:23,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=258210.0, ans=0.015 2023-06-19 01:41:33,987 INFO [train.py:996] (2/4) Epoch 2, batch 12550, loss[loss=0.3013, simple_loss=0.3801, pruned_loss=0.1112, over 21322.00 frames. ], tot_loss[loss=0.3009, simple_loss=0.3643, pruned_loss=0.1187, over 4286166.76 frames. ], batch size: 549, lr: 1.70e-02, grad_scale: 32.0 2023-06-19 01:41:47,027 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.380e+02 3.352e+02 3.726e+02 4.299e+02 8.195e+02, threshold=7.451e+02, percent-clipped=1.0 2023-06-19 01:42:33,588 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.13 vs. limit=15.0 2023-06-19 01:43:39,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=258510.0, ans=0.125 2023-06-19 01:44:05,953 INFO [train.py:996] (2/4) Epoch 2, batch 12600, loss[loss=0.2562, simple_loss=0.3204, pruned_loss=0.096, over 20762.00 frames. ], tot_loss[loss=0.296, simple_loss=0.3618, pruned_loss=0.1151, over 4279594.39 frames. ], batch size: 608, lr: 1.70e-02, grad_scale: 32.0 2023-06-19 01:44:25,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=258630.0, ans=0.125 2023-06-19 01:45:33,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=258750.0, ans=0.125 2023-06-19 01:46:09,508 INFO [train.py:996] (2/4) Epoch 2, batch 12650, loss[loss=0.3139, simple_loss=0.3616, pruned_loss=0.1331, over 21671.00 frames. ], tot_loss[loss=0.2871, simple_loss=0.3532, pruned_loss=0.1105, over 4274152.43 frames. ], batch size: 473, lr: 1.70e-02, grad_scale: 32.0 2023-06-19 01:46:28,022 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.772e+02 2.520e+02 3.258e+02 4.283e+02 8.969e+02, threshold=6.516e+02, percent-clipped=1.0 2023-06-19 01:47:27,980 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=259050.0, ans=0.04949747468305833 2023-06-19 01:47:28,535 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.63 vs. limit=22.5 2023-06-19 01:47:49,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=259050.0, ans=0.05 2023-06-19 01:48:21,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=259170.0, ans=0.125 2023-06-19 01:48:22,471 INFO [train.py:996] (2/4) Epoch 2, batch 12700, loss[loss=0.2892, simple_loss=0.3455, pruned_loss=0.1164, over 21641.00 frames. ], tot_loss[loss=0.2896, simple_loss=0.3534, pruned_loss=0.1129, over 4283076.38 frames. ], batch size: 230, lr: 1.70e-02, grad_scale: 32.0 2023-06-19 01:49:01,083 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.47 vs. limit=15.0 2023-06-19 01:49:47,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=259350.0, ans=0.125 2023-06-19 01:50:18,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=259410.0, ans=0.125 2023-06-19 01:50:33,914 INFO [train.py:996] (2/4) Epoch 2, batch 12750, loss[loss=0.277, simple_loss=0.3426, pruned_loss=0.1057, over 21789.00 frames. ], tot_loss[loss=0.2932, simple_loss=0.3578, pruned_loss=0.1142, over 4287338.05 frames. ], batch size: 351, lr: 1.70e-02, grad_scale: 32.0 2023-06-19 01:50:59,770 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.122e+02 2.841e+02 3.473e+02 4.427e+02 7.212e+02, threshold=6.945e+02, percent-clipped=3.0 2023-06-19 01:51:00,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=259470.0, ans=0.035 2023-06-19 01:51:07,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=259470.0, ans=0.0 2023-06-19 01:51:48,752 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.99 vs. limit=22.5 2023-06-19 01:51:51,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=259590.0, ans=0.125 2023-06-19 01:51:51,074 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 01:52:59,509 INFO [train.py:996] (2/4) Epoch 2, batch 12800, loss[loss=0.3299, simple_loss=0.3939, pruned_loss=0.1329, over 21855.00 frames. ], tot_loss[loss=0.2911, simple_loss=0.3549, pruned_loss=0.1136, over 4287904.32 frames. ], batch size: 118, lr: 1.70e-02, grad_scale: 32.0 2023-06-19 01:52:59,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=259770.0, ans=0.0 2023-06-19 01:53:12,820 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.05 vs. limit=6.0 2023-06-19 01:53:53,265 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.74 vs. limit=22.5 2023-06-19 01:54:08,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=259950.0, ans=0.125 2023-06-19 01:54:58,060 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=260010.0, ans=0.1 2023-06-19 01:55:03,239 INFO [train.py:996] (2/4) Epoch 2, batch 12850, loss[loss=0.2815, simple_loss=0.3737, pruned_loss=0.09467, over 19913.00 frames. ], tot_loss[loss=0.2957, simple_loss=0.3576, pruned_loss=0.1169, over 4289215.48 frames. ], batch size: 703, lr: 1.69e-02, grad_scale: 32.0 2023-06-19 01:55:12,226 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.953e+02 2.824e+02 3.270e+02 4.199e+02 6.279e+02, threshold=6.541e+02, percent-clipped=0.0 2023-06-19 01:55:39,370 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.68 vs. limit=22.5 2023-06-19 01:57:11,717 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=260310.0, ans=0.125 2023-06-19 01:57:21,533 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 01:57:28,199 INFO [train.py:996] (2/4) Epoch 2, batch 12900, loss[loss=0.2665, simple_loss=0.3432, pruned_loss=0.09492, over 21747.00 frames. ], tot_loss[loss=0.2881, simple_loss=0.3534, pruned_loss=0.1114, over 4278416.28 frames. ], batch size: 351, lr: 1.69e-02, grad_scale: 32.0 2023-06-19 01:59:39,598 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.80 vs. limit=15.0 2023-06-19 01:59:54,192 INFO [train.py:996] (2/4) Epoch 2, batch 12950, loss[loss=0.3181, simple_loss=0.3753, pruned_loss=0.1304, over 21709.00 frames. ], tot_loss[loss=0.2858, simple_loss=0.3535, pruned_loss=0.109, over 4272407.26 frames. ], batch size: 298, lr: 1.69e-02, grad_scale: 32.0 2023-06-19 01:59:54,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=260670.0, ans=0.125 2023-06-19 02:00:01,777 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.741e+02 2.844e+02 3.449e+02 4.156e+02 6.439e+02, threshold=6.898e+02, percent-clipped=0.0 2023-06-19 02:00:18,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=260730.0, ans=0.125 2023-06-19 02:01:31,477 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.57 vs. limit=15.0 2023-06-19 02:01:54,510 INFO [train.py:996] (2/4) Epoch 2, batch 13000, loss[loss=0.1809, simple_loss=0.2488, pruned_loss=0.05653, over 21098.00 frames. ], tot_loss[loss=0.2829, simple_loss=0.3503, pruned_loss=0.1077, over 4261722.10 frames. ], batch size: 143, lr: 1.69e-02, grad_scale: 32.0 2023-06-19 02:01:57,082 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.63 vs. limit=22.5 2023-06-19 02:02:10,054 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=261030.0, ans=0.125 2023-06-19 02:02:11,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=261030.0, ans=0.0 2023-06-19 02:03:28,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=261210.0, ans=15.0 2023-06-19 02:03:43,307 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.05 vs. limit=15.0 2023-06-19 02:04:04,221 INFO [train.py:996] (2/4) Epoch 2, batch 13050, loss[loss=0.2745, simple_loss=0.3314, pruned_loss=0.1088, over 21919.00 frames. ], tot_loss[loss=0.2771, simple_loss=0.3443, pruned_loss=0.105, over 4271133.49 frames. ], batch size: 316, lr: 1.69e-02, grad_scale: 32.0 2023-06-19 02:04:08,294 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.89 vs. limit=22.5 2023-06-19 02:04:11,588 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.046e+02 2.722e+02 3.295e+02 4.146e+02 8.681e+02, threshold=6.589e+02, percent-clipped=5.0 2023-06-19 02:04:15,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=261270.0, ans=0.125 2023-06-19 02:05:42,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=261450.0, ans=0.125 2023-06-19 02:05:42,670 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.31 vs. limit=22.5 2023-06-19 02:05:45,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=261510.0, ans=0.125 2023-06-19 02:06:08,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=261570.0, ans=0.125 2023-06-19 02:06:09,572 INFO [train.py:996] (2/4) Epoch 2, batch 13100, loss[loss=0.2887, simple_loss=0.3544, pruned_loss=0.1115, over 21788.00 frames. ], tot_loss[loss=0.2769, simple_loss=0.3442, pruned_loss=0.1048, over 4275835.39 frames. ], batch size: 247, lr: 1.69e-02, grad_scale: 32.0 2023-06-19 02:06:52,319 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.58 vs. limit=6.0 2023-06-19 02:07:20,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=261690.0, ans=0.125 2023-06-19 02:07:51,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=261750.0, ans=0.2 2023-06-19 02:08:22,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=261810.0, ans=0.0 2023-06-19 02:08:32,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=261810.0, ans=0.0 2023-06-19 02:08:39,311 INFO [train.py:996] (2/4) Epoch 2, batch 13150, loss[loss=0.3066, simple_loss=0.3648, pruned_loss=0.1242, over 21359.00 frames. ], tot_loss[loss=0.2833, simple_loss=0.3488, pruned_loss=0.1089, over 4273546.74 frames. ], batch size: 471, lr: 1.69e-02, grad_scale: 32.0 2023-06-19 02:08:46,862 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.943e+02 2.844e+02 3.521e+02 4.321e+02 7.421e+02, threshold=7.042e+02, percent-clipped=2.0 2023-06-19 02:08:47,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=261870.0, ans=0.125 2023-06-19 02:08:47,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=261870.0, ans=0.125 2023-06-19 02:08:48,144 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.31 vs. limit=22.5 2023-06-19 02:08:51,935 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=261870.0, ans=0.0 2023-06-19 02:09:11,448 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.00 vs. limit=15.0 2023-06-19 02:10:12,100 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=262050.0, ans=0.04949747468305833 2023-06-19 02:10:35,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=262110.0, ans=0.125 2023-06-19 02:10:43,154 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.03 vs. limit=15.0 2023-06-19 02:10:44,896 INFO [train.py:996] (2/4) Epoch 2, batch 13200, loss[loss=0.2808, simple_loss=0.3454, pruned_loss=0.1081, over 21268.00 frames. ], tot_loss[loss=0.2837, simple_loss=0.3487, pruned_loss=0.1093, over 4274502.47 frames. ], batch size: 143, lr: 1.69e-02, grad_scale: 32.0 2023-06-19 02:11:19,285 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.15 vs. limit=15.0 2023-06-19 02:12:31,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=262350.0, ans=0.125 2023-06-19 02:12:59,890 INFO [train.py:996] (2/4) Epoch 2, batch 13250, loss[loss=0.2715, simple_loss=0.3324, pruned_loss=0.1053, over 21806.00 frames. ], tot_loss[loss=0.2852, simple_loss=0.3485, pruned_loss=0.111, over 4275215.90 frames. ], batch size: 112, lr: 1.69e-02, grad_scale: 32.0 2023-06-19 02:13:04,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=262470.0, ans=0.2 2023-06-19 02:13:09,754 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.942e+02 2.771e+02 3.301e+02 4.204e+02 7.419e+02, threshold=6.603e+02, percent-clipped=1.0 2023-06-19 02:14:19,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=262590.0, ans=0.2 2023-06-19 02:14:27,952 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=262590.0, ans=0.2 2023-06-19 02:15:30,031 INFO [train.py:996] (2/4) Epoch 2, batch 13300, loss[loss=0.2582, simple_loss=0.3291, pruned_loss=0.09365, over 21746.00 frames. ], tot_loss[loss=0.2875, simple_loss=0.3517, pruned_loss=0.1117, over 4272792.30 frames. ], batch size: 247, lr: 1.69e-02, grad_scale: 32.0 2023-06-19 02:15:39,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=262770.0, ans=22.5 2023-06-19 02:16:03,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=262830.0, ans=0.2 2023-06-19 02:17:18,713 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=262950.0, ans=0.125 2023-06-19 02:17:22,358 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.24 vs. limit=15.0 2023-06-19 02:17:43,423 INFO [train.py:996] (2/4) Epoch 2, batch 13350, loss[loss=0.2576, simple_loss=0.3185, pruned_loss=0.09834, over 16080.00 frames. ], tot_loss[loss=0.2925, simple_loss=0.3551, pruned_loss=0.1149, over 4269634.67 frames. ], batch size: 60, lr: 1.69e-02, grad_scale: 32.0 2023-06-19 02:18:00,913 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.931e+02 2.775e+02 3.449e+02 3.875e+02 6.020e+02, threshold=6.898e+02, percent-clipped=0.0 2023-06-19 02:18:18,285 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.53 vs. limit=15.0 2023-06-19 02:18:21,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=263130.0, ans=0.125 2023-06-19 02:18:23,923 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.49 vs. limit=15.0 2023-06-19 02:18:36,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=263190.0, ans=0.125 2023-06-19 02:19:49,349 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=263310.0, ans=0.05 2023-06-19 02:20:06,312 INFO [train.py:996] (2/4) Epoch 2, batch 13400, loss[loss=0.3004, simple_loss=0.3538, pruned_loss=0.1235, over 21414.00 frames. ], tot_loss[loss=0.2953, simple_loss=0.3559, pruned_loss=0.1174, over 4271296.01 frames. ], batch size: 194, lr: 1.68e-02, grad_scale: 32.0 2023-06-19 02:20:21,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=263370.0, ans=0.125 2023-06-19 02:20:25,422 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 02:20:40,457 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=263430.0, ans=0.1 2023-06-19 02:21:11,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=263490.0, ans=0.0 2023-06-19 02:22:32,121 INFO [train.py:996] (2/4) Epoch 2, batch 13450, loss[loss=0.3204, simple_loss=0.3706, pruned_loss=0.1352, over 21530.00 frames. ], tot_loss[loss=0.2992, simple_loss=0.3584, pruned_loss=0.12, over 4275048.79 frames. ], batch size: 389, lr: 1.68e-02, grad_scale: 32.0 2023-06-19 02:22:39,488 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.392e+02 3.379e+02 3.829e+02 4.469e+02 7.112e+02, threshold=7.658e+02, percent-clipped=1.0 2023-06-19 02:22:51,795 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=263730.0, ans=0.1 2023-06-19 02:23:00,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=263730.0, ans=0.125 2023-06-19 02:23:20,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=263790.0, ans=0.125 2023-06-19 02:24:40,494 INFO [train.py:996] (2/4) Epoch 2, batch 13500, loss[loss=0.2218, simple_loss=0.279, pruned_loss=0.08228, over 21484.00 frames. ], tot_loss[loss=0.2886, simple_loss=0.3475, pruned_loss=0.1148, over 4260782.63 frames. ], batch size: 195, lr: 1.68e-02, grad_scale: 32.0 2023-06-19 02:24:44,253 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=263970.0, ans=0.125 2023-06-19 02:24:52,568 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=263970.0, ans=0.0 2023-06-19 02:25:04,122 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=264030.0, ans=0.2 2023-06-19 02:25:04,585 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.10 vs. limit=15.0 2023-06-19 02:25:11,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=264030.0, ans=0.1 2023-06-19 02:25:17,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=264090.0, ans=0.1 2023-06-19 02:26:14,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=264150.0, ans=0.125 2023-06-19 02:26:43,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=264210.0, ans=0.125 2023-06-19 02:26:43,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=264210.0, ans=0.0 2023-06-19 02:26:46,730 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=264210.0, ans=0.1 2023-06-19 02:26:50,569 INFO [train.py:996] (2/4) Epoch 2, batch 13550, loss[loss=0.2784, simple_loss=0.3622, pruned_loss=0.09725, over 21238.00 frames. ], tot_loss[loss=0.2889, simple_loss=0.3512, pruned_loss=0.1133, over 4266566.12 frames. ], batch size: 176, lr: 1.68e-02, grad_scale: 32.0 2023-06-19 02:26:51,676 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.08 vs. limit=22.5 2023-06-19 02:27:04,667 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.970e+02 2.944e+02 3.381e+02 4.408e+02 7.046e+02, threshold=6.762e+02, percent-clipped=0.0 2023-06-19 02:27:06,566 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=264270.0, ans=0.1 2023-06-19 02:27:48,680 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-19 02:28:14,967 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=18.57 vs. limit=22.5 2023-06-19 02:28:18,697 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=264450.0, ans=0.04949747468305833 2023-06-19 02:28:24,459 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=264450.0, ans=0.1 2023-06-19 02:29:05,566 INFO [train.py:996] (2/4) Epoch 2, batch 13600, loss[loss=0.2946, simple_loss=0.3559, pruned_loss=0.1167, over 21493.00 frames. ], tot_loss[loss=0.2913, simple_loss=0.3527, pruned_loss=0.1149, over 4275145.41 frames. ], batch size: 548, lr: 1.68e-02, grad_scale: 32.0 2023-06-19 02:29:06,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=264570.0, ans=0.1 2023-06-19 02:29:43,791 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.28 vs. limit=15.0 2023-06-19 02:31:29,058 INFO [train.py:996] (2/4) Epoch 2, batch 13650, loss[loss=0.2294, simple_loss=0.2838, pruned_loss=0.08754, over 21587.00 frames. ], tot_loss[loss=0.2847, simple_loss=0.3472, pruned_loss=0.1111, over 4266939.76 frames. ], batch size: 247, lr: 1.68e-02, grad_scale: 32.0 2023-06-19 02:31:29,447 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=264870.0, ans=0.125 2023-06-19 02:31:36,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=264870.0, ans=0.0 2023-06-19 02:31:43,981 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.985e+02 2.727e+02 3.110e+02 3.674e+02 7.098e+02, threshold=6.220e+02, percent-clipped=1.0 2023-06-19 02:31:46,090 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=264870.0, ans=10.0 2023-06-19 02:32:10,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=264930.0, ans=0.1 2023-06-19 02:32:38,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=264990.0, ans=0.125 2023-06-19 02:33:09,280 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.17 vs. limit=10.0 2023-06-19 02:33:15,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=265050.0, ans=0.1 2023-06-19 02:33:48,005 INFO [train.py:996] (2/4) Epoch 2, batch 13700, loss[loss=0.2576, simple_loss=0.3139, pruned_loss=0.1006, over 21699.00 frames. ], tot_loss[loss=0.2811, simple_loss=0.341, pruned_loss=0.1106, over 4260755.39 frames. ], batch size: 263, lr: 1.68e-02, grad_scale: 32.0 2023-06-19 02:33:58,340 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=265170.0, ans=0.1 2023-06-19 02:34:32,368 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=265230.0, ans=0.125 2023-06-19 02:34:34,489 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.45 vs. limit=15.0 2023-06-19 02:35:30,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=265350.0, ans=0.125 2023-06-19 02:35:32,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=265350.0, ans=0.0 2023-06-19 02:35:53,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=265410.0, ans=0.125 2023-06-19 02:36:07,353 INFO [train.py:996] (2/4) Epoch 2, batch 13750, loss[loss=0.2331, simple_loss=0.2909, pruned_loss=0.08764, over 21545.00 frames. ], tot_loss[loss=0.2781, simple_loss=0.3385, pruned_loss=0.1089, over 4266448.95 frames. ], batch size: 195, lr: 1.68e-02, grad_scale: 32.0 2023-06-19 02:36:07,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=265470.0, ans=0.0 2023-06-19 02:36:26,654 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.274e+02 3.182e+02 3.904e+02 5.015e+02 8.772e+02, threshold=7.809e+02, percent-clipped=9.0 2023-06-19 02:36:31,007 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.48 vs. limit=15.0 2023-06-19 02:37:09,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=265530.0, ans=0.125 2023-06-19 02:38:07,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=265650.0, ans=0.125 2023-06-19 02:38:42,202 INFO [train.py:996] (2/4) Epoch 2, batch 13800, loss[loss=0.2922, simple_loss=0.3818, pruned_loss=0.1012, over 21768.00 frames. ], tot_loss[loss=0.2792, simple_loss=0.3436, pruned_loss=0.1075, over 4265882.90 frames. ], batch size: 282, lr: 1.68e-02, grad_scale: 32.0 2023-06-19 02:39:57,633 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.62 vs. limit=15.0 2023-06-19 02:40:00,989 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.59 vs. limit=22.5 2023-06-19 02:40:59,929 INFO [train.py:996] (2/4) Epoch 2, batch 13850, loss[loss=0.3451, simple_loss=0.4111, pruned_loss=0.1396, over 21303.00 frames. ], tot_loss[loss=0.2849, simple_loss=0.3505, pruned_loss=0.1096, over 4259731.27 frames. ], batch size: 548, lr: 1.68e-02, grad_scale: 32.0 2023-06-19 02:41:38,741 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.257e+02 2.862e+02 3.371e+02 4.001e+02 6.906e+02, threshold=6.742e+02, percent-clipped=0.0 2023-06-19 02:41:53,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=266130.0, ans=0.0 2023-06-19 02:43:28,681 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.72 vs. limit=15.0 2023-06-19 02:43:32,051 INFO [train.py:996] (2/4) Epoch 2, batch 13900, loss[loss=0.291, simple_loss=0.3459, pruned_loss=0.118, over 21881.00 frames. ], tot_loss[loss=0.2918, simple_loss=0.356, pruned_loss=0.1138, over 4266506.62 frames. ], batch size: 351, lr: 1.68e-02, grad_scale: 32.0 2023-06-19 02:43:33,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=266370.0, ans=0.125 2023-06-19 02:44:04,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=266430.0, ans=0.125 2023-06-19 02:44:10,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=266430.0, ans=0.0 2023-06-19 02:44:13,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=266430.0, ans=0.2 2023-06-19 02:44:13,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=266430.0, ans=0.1 2023-06-19 02:44:35,691 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=266490.0, ans=0.125 2023-06-19 02:44:55,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=266550.0, ans=0.1 2023-06-19 02:45:01,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=266550.0, ans=0.125 2023-06-19 02:45:37,171 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=266610.0, ans=0.025 2023-06-19 02:45:48,058 INFO [train.py:996] (2/4) Epoch 2, batch 13950, loss[loss=0.2849, simple_loss=0.3439, pruned_loss=0.113, over 21911.00 frames. ], tot_loss[loss=0.2923, simple_loss=0.3558, pruned_loss=0.1145, over 4268746.17 frames. ], batch size: 316, lr: 1.67e-02, grad_scale: 32.0 2023-06-19 02:46:01,267 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.170e+02 3.266e+02 3.802e+02 5.286e+02 1.041e+03, threshold=7.604e+02, percent-clipped=11.0 2023-06-19 02:46:14,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=266730.0, ans=0.0 2023-06-19 02:47:18,297 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=266850.0, ans=0.0 2023-06-19 02:47:18,928 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.46 vs. limit=15.0 2023-06-19 02:47:18,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=266850.0, ans=15.0 2023-06-19 02:48:03,605 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=266910.0, ans=0.125 2023-06-19 02:48:21,250 INFO [train.py:996] (2/4) Epoch 2, batch 14000, loss[loss=0.2402, simple_loss=0.336, pruned_loss=0.07219, over 21418.00 frames. ], tot_loss[loss=0.2875, simple_loss=0.3511, pruned_loss=0.112, over 4265209.30 frames. ], batch size: 211, lr: 1.67e-02, grad_scale: 32.0 2023-06-19 02:48:30,992 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=266970.0, ans=0.0 2023-06-19 02:49:29,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=267150.0, ans=0.125 2023-06-19 02:49:32,714 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=267150.0, ans=0.2 2023-06-19 02:49:35,637 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.73 vs. limit=15.0 2023-06-19 02:50:09,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=267210.0, ans=0.125 2023-06-19 02:50:17,481 INFO [train.py:996] (2/4) Epoch 2, batch 14050, loss[loss=0.2193, simple_loss=0.2864, pruned_loss=0.07607, over 21343.00 frames. ], tot_loss[loss=0.277, simple_loss=0.3429, pruned_loss=0.1056, over 4261306.84 frames. ], batch size: 176, lr: 1.67e-02, grad_scale: 32.0 2023-06-19 02:50:24,947 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.599e+02 2.535e+02 3.106e+02 3.697e+02 6.124e+02, threshold=6.211e+02, percent-clipped=0.0 2023-06-19 02:51:14,903 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.60 vs. limit=12.0 2023-06-19 02:52:12,810 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.23 vs. limit=15.0 2023-06-19 02:52:24,971 INFO [train.py:996] (2/4) Epoch 2, batch 14100, loss[loss=0.2688, simple_loss=0.3306, pruned_loss=0.1035, over 15075.00 frames. ], tot_loss[loss=0.274, simple_loss=0.3366, pruned_loss=0.1058, over 4253167.65 frames. ], batch size: 61, lr: 1.67e-02, grad_scale: 32.0 2023-06-19 02:52:26,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=267570.0, ans=0.125 2023-06-19 02:52:44,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=267630.0, ans=0.125 2023-06-19 02:52:46,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=267630.0, ans=0.125 2023-06-19 02:54:11,545 INFO [train.py:996] (2/4) Epoch 2, batch 14150, loss[loss=0.3374, simple_loss=0.3915, pruned_loss=0.1416, over 21447.00 frames. ], tot_loss[loss=0.2778, simple_loss=0.341, pruned_loss=0.1073, over 4257234.29 frames. ], batch size: 471, lr: 1.67e-02, grad_scale: 32.0 2023-06-19 02:54:22,695 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=267870.0, ans=0.015 2023-06-19 02:54:23,830 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.967e+02 2.842e+02 3.231e+02 3.958e+02 7.482e+02, threshold=6.462e+02, percent-clipped=1.0 2023-06-19 02:55:17,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=268050.0, ans=0.125 2023-06-19 02:55:34,760 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=268110.0, ans=0.0 2023-06-19 02:55:47,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=268110.0, ans=0.125 2023-06-19 02:55:48,868 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 02:56:02,057 INFO [train.py:996] (2/4) Epoch 2, batch 14200, loss[loss=0.3017, simple_loss=0.3885, pruned_loss=0.1074, over 20823.00 frames. ], tot_loss[loss=0.2744, simple_loss=0.3387, pruned_loss=0.105, over 4257703.68 frames. ], batch size: 608, lr: 1.67e-02, grad_scale: 32.0 2023-06-19 02:56:20,977 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 02:56:57,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=268290.0, ans=0.0 2023-06-19 02:57:00,805 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.18 vs. limit=15.0 2023-06-19 02:57:55,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=268470.0, ans=0.2 2023-06-19 02:58:11,568 INFO [train.py:996] (2/4) Epoch 2, batch 14250, loss[loss=0.2244, simple_loss=0.2822, pruned_loss=0.08333, over 21231.00 frames. ], tot_loss[loss=0.271, simple_loss=0.3332, pruned_loss=0.1044, over 4263373.35 frames. ], batch size: 159, lr: 1.67e-02, grad_scale: 32.0 2023-06-19 02:58:25,988 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.515e+02 2.584e+02 3.251e+02 4.167e+02 7.132e+02, threshold=6.503e+02, percent-clipped=3.0 2023-06-19 02:59:25,254 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=268650.0, ans=0.125 2023-06-19 02:59:32,592 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=268710.0, ans=0.2 2023-06-19 03:00:06,460 INFO [train.py:996] (2/4) Epoch 2, batch 14300, loss[loss=0.3797, simple_loss=0.4516, pruned_loss=0.154, over 21671.00 frames. ], tot_loss[loss=0.2734, simple_loss=0.3381, pruned_loss=0.1044, over 4265383.54 frames. ], batch size: 414, lr: 1.67e-02, grad_scale: 32.0 2023-06-19 03:00:08,877 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.09 vs. limit=22.5 2023-06-19 03:00:54,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=268830.0, ans=0.0 2023-06-19 03:01:00,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=268830.0, ans=0.1 2023-06-19 03:01:04,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=268890.0, ans=10.0 2023-06-19 03:01:29,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=268890.0, ans=0.1 2023-06-19 03:01:51,120 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=268950.0, ans=0.125 2023-06-19 03:02:34,029 INFO [train.py:996] (2/4) Epoch 2, batch 14350, loss[loss=0.2973, simple_loss=0.3573, pruned_loss=0.1187, over 21458.00 frames. ], tot_loss[loss=0.2783, simple_loss=0.3453, pruned_loss=0.1057, over 4256185.94 frames. ], batch size: 548, lr: 1.67e-02, grad_scale: 32.0 2023-06-19 03:02:54,121 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.943e+02 2.713e+02 3.120e+02 4.327e+02 9.558e+02, threshold=6.239e+02, percent-clipped=7.0 2023-06-19 03:03:10,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=269130.0, ans=0.125 2023-06-19 03:03:18,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=269130.0, ans=0.125 2023-06-19 03:03:18,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=269130.0, ans=0.125 2023-06-19 03:04:36,234 INFO [train.py:996] (2/4) Epoch 2, batch 14400, loss[loss=0.291, simple_loss=0.3439, pruned_loss=0.119, over 21857.00 frames. ], tot_loss[loss=0.2803, simple_loss=0.3439, pruned_loss=0.1084, over 4265074.58 frames. ], batch size: 107, lr: 1.67e-02, grad_scale: 32.0 2023-06-19 03:06:32,265 INFO [train.py:996] (2/4) Epoch 2, batch 14450, loss[loss=0.2762, simple_loss=0.3198, pruned_loss=0.1164, over 21251.00 frames. ], tot_loss[loss=0.2774, simple_loss=0.3373, pruned_loss=0.1087, over 4267759.18 frames. ], batch size: 176, lr: 1.67e-02, grad_scale: 32.0 2023-06-19 03:06:39,576 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=269670.0, ans=0.1 2023-06-19 03:06:46,537 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.906e+02 2.837e+02 3.414e+02 4.190e+02 7.999e+02, threshold=6.829e+02, percent-clipped=4.0 2023-06-19 03:07:07,469 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=269730.0, ans=10.0 2023-06-19 03:07:55,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=269850.0, ans=0.125 2023-06-19 03:08:13,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=269910.0, ans=10.0 2023-06-19 03:08:35,907 INFO [train.py:996] (2/4) Epoch 2, batch 14500, loss[loss=0.2711, simple_loss=0.342, pruned_loss=0.1001, over 21246.00 frames. ], tot_loss[loss=0.2745, simple_loss=0.3335, pruned_loss=0.1077, over 4255078.20 frames. ], batch size: 548, lr: 1.66e-02, grad_scale: 32.0 2023-06-19 03:08:44,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=269970.0, ans=0.125 2023-06-19 03:08:45,264 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.34 vs. limit=22.5 2023-06-19 03:09:48,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=270150.0, ans=0.035 2023-06-19 03:09:48,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=270150.0, ans=0.2 2023-06-19 03:10:34,022 INFO [train.py:996] (2/4) Epoch 2, batch 14550, loss[loss=0.3127, simple_loss=0.3641, pruned_loss=0.1307, over 21361.00 frames. ], tot_loss[loss=0.28, simple_loss=0.3396, pruned_loss=0.1102, over 4264557.04 frames. ], batch size: 549, lr: 1.66e-02, grad_scale: 32.0 2023-06-19 03:10:55,143 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.872e+02 2.690e+02 3.116e+02 3.745e+02 6.340e+02, threshold=6.231e+02, percent-clipped=0.0 2023-06-19 03:11:06,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=270330.0, ans=0.1 2023-06-19 03:11:53,527 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=270450.0, ans=0.2 2023-06-19 03:12:14,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=270450.0, ans=0.2 2023-06-19 03:12:53,774 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=270510.0, ans=0.025 2023-06-19 03:12:58,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=270510.0, ans=0.05 2023-06-19 03:13:00,814 INFO [train.py:996] (2/4) Epoch 2, batch 14600, loss[loss=0.2726, simple_loss=0.3522, pruned_loss=0.09651, over 21560.00 frames. ], tot_loss[loss=0.2892, simple_loss=0.3482, pruned_loss=0.1151, over 4272173.66 frames. ], batch size: 230, lr: 1.66e-02, grad_scale: 32.0 2023-06-19 03:15:08,360 INFO [train.py:996] (2/4) Epoch 2, batch 14650, loss[loss=0.2475, simple_loss=0.2912, pruned_loss=0.1019, over 20880.00 frames. ], tot_loss[loss=0.2896, simple_loss=0.3508, pruned_loss=0.1142, over 4271716.66 frames. ], batch size: 608, lr: 1.66e-02, grad_scale: 32.0 2023-06-19 03:15:27,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=270870.0, ans=0.0 2023-06-19 03:15:28,427 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.627e+02 2.832e+02 3.402e+02 3.897e+02 6.395e+02, threshold=6.804e+02, percent-clipped=2.0 2023-06-19 03:16:17,247 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=270990.0, ans=0.1 2023-06-19 03:16:22,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=270990.0, ans=0.0 2023-06-19 03:16:27,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=271050.0, ans=0.0 2023-06-19 03:17:06,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=271110.0, ans=0.125 2023-06-19 03:17:32,066 INFO [train.py:996] (2/4) Epoch 2, batch 14700, loss[loss=0.207, simple_loss=0.3014, pruned_loss=0.05636, over 21682.00 frames. ], tot_loss[loss=0.2753, simple_loss=0.3405, pruned_loss=0.1051, over 4259837.26 frames. ], batch size: 247, lr: 1.66e-02, grad_scale: 32.0 2023-06-19 03:17:35,477 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=271170.0, ans=0.125 2023-06-19 03:17:40,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=271170.0, ans=0.0 2023-06-19 03:18:45,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=271350.0, ans=0.2 2023-06-19 03:19:15,178 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=271410.0, ans=0.125 2023-06-19 03:19:41,507 INFO [train.py:996] (2/4) Epoch 2, batch 14750, loss[loss=0.5095, simple_loss=0.5262, pruned_loss=0.2465, over 21384.00 frames. ], tot_loss[loss=0.284, simple_loss=0.349, pruned_loss=0.1095, over 4266901.02 frames. ], batch size: 507, lr: 1.66e-02, grad_scale: 32.0 2023-06-19 03:19:46,429 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=271470.0, ans=0.0 2023-06-19 03:20:21,831 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.574e+02 2.730e+02 3.490e+02 4.322e+02 1.005e+03, threshold=6.981e+02, percent-clipped=5.0 2023-06-19 03:20:28,579 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=271530.0, ans=0.0 2023-06-19 03:20:53,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=271590.0, ans=0.125 2023-06-19 03:21:45,131 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=271710.0, ans=0.025 2023-06-19 03:22:05,211 INFO [train.py:996] (2/4) Epoch 2, batch 14800, loss[loss=0.3043, simple_loss=0.3524, pruned_loss=0.1281, over 19980.00 frames. ], tot_loss[loss=0.298, simple_loss=0.3621, pruned_loss=0.117, over 4267155.83 frames. ], batch size: 702, lr: 1.66e-02, grad_scale: 32.0 2023-06-19 03:23:20,683 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.37 vs. limit=15.0 2023-06-19 03:23:32,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=271950.0, ans=0.0 2023-06-19 03:24:36,333 INFO [train.py:996] (2/4) Epoch 2, batch 14850, loss[loss=0.2587, simple_loss=0.3154, pruned_loss=0.101, over 21661.00 frames. ], tot_loss[loss=0.2936, simple_loss=0.3546, pruned_loss=0.1163, over 4268660.86 frames. ], batch size: 247, lr: 1.66e-02, grad_scale: 32.0 2023-06-19 03:24:36,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=272070.0, ans=0.0 2023-06-19 03:24:45,640 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.380e+02 3.062e+02 3.650e+02 4.413e+02 7.562e+02, threshold=7.301e+02, percent-clipped=4.0 2023-06-19 03:25:47,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=272190.0, ans=0.125 2023-06-19 03:25:54,029 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.26 vs. limit=22.5 2023-06-19 03:26:24,329 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.99 vs. limit=15.0 2023-06-19 03:26:26,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=272250.0, ans=0.0 2023-06-19 03:26:49,814 INFO [train.py:996] (2/4) Epoch 2, batch 14900, loss[loss=0.3214, simple_loss=0.3775, pruned_loss=0.1326, over 21497.00 frames. ], tot_loss[loss=0.2983, simple_loss=0.3581, pruned_loss=0.1192, over 4267233.15 frames. ], batch size: 194, lr: 1.66e-02, grad_scale: 32.0 2023-06-19 03:27:03,357 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=272370.0, ans=0.0 2023-06-19 03:27:06,365 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=272370.0, ans=0.2 2023-06-19 03:27:31,438 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.76 vs. limit=15.0 2023-06-19 03:27:55,331 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=272490.0, ans=0.0 2023-06-19 03:28:03,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=272490.0, ans=0.0 2023-06-19 03:28:18,097 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.87 vs. limit=15.0 2023-06-19 03:29:10,899 INFO [train.py:996] (2/4) Epoch 2, batch 14950, loss[loss=0.2862, simple_loss=0.3491, pruned_loss=0.1117, over 21286.00 frames. ], tot_loss[loss=0.2973, simple_loss=0.358, pruned_loss=0.1183, over 4272209.27 frames. ], batch size: 176, lr: 1.66e-02, grad_scale: 32.0 2023-06-19 03:29:22,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=272670.0, ans=0.125 2023-06-19 03:29:25,172 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.872e+02 2.792e+02 3.355e+02 4.143e+02 6.575e+02, threshold=6.711e+02, percent-clipped=0.0 2023-06-19 03:30:57,593 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.37 vs. limit=15.0 2023-06-19 03:31:02,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=272910.0, ans=0.95 2023-06-19 03:31:17,961 INFO [train.py:996] (2/4) Epoch 2, batch 15000, loss[loss=0.2846, simple_loss=0.3393, pruned_loss=0.115, over 21513.00 frames. ], tot_loss[loss=0.2993, simple_loss=0.3599, pruned_loss=0.1194, over 4263961.40 frames. ], batch size: 194, lr: 1.66e-02, grad_scale: 32.0 2023-06-19 03:31:17,961 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-19 03:32:09,032 INFO [train.py:1028] (2/4) Epoch 2, validation: loss=0.272, simple_loss=0.3679, pruned_loss=0.08803, over 1796401.00 frames. 2023-06-19 03:32:09,040 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-19 03:32:20,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=272970.0, ans=0.0 2023-06-19 03:32:27,715 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=272970.0, ans=0.0 2023-06-19 03:32:29,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=273030.0, ans=0.0 2023-06-19 03:32:41,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=273030.0, ans=0.125 2023-06-19 03:33:17,908 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=273150.0, ans=0.1 2023-06-19 03:33:26,480 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.83 vs. limit=12.0 2023-06-19 03:33:57,864 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.33 vs. limit=22.5 2023-06-19 03:34:11,073 INFO [train.py:996] (2/4) Epoch 2, batch 15050, loss[loss=0.3203, simple_loss=0.412, pruned_loss=0.1143, over 21233.00 frames. ], tot_loss[loss=0.3013, simple_loss=0.3622, pruned_loss=0.1202, over 4263954.52 frames. ], batch size: 548, lr: 1.65e-02, grad_scale: 32.0 2023-06-19 03:34:28,427 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.276e+02 3.105e+02 3.767e+02 5.029e+02 7.583e+02, threshold=7.535e+02, percent-clipped=4.0 2023-06-19 03:35:18,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=273390.0, ans=0.125 2023-06-19 03:36:22,093 INFO [train.py:996] (2/4) Epoch 2, batch 15100, loss[loss=0.3263, simple_loss=0.3847, pruned_loss=0.1339, over 21856.00 frames. ], tot_loss[loss=0.3031, simple_loss=0.3657, pruned_loss=0.1203, over 4265945.35 frames. ], batch size: 371, lr: 1.65e-02, grad_scale: 32.0 2023-06-19 03:37:16,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=273630.0, ans=0.05 2023-06-19 03:37:22,141 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=273690.0, ans=0.0 2023-06-19 03:37:27,668 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=273690.0, ans=0.125 2023-06-19 03:37:44,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=273750.0, ans=0.0 2023-06-19 03:38:07,282 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=273810.0, ans=0.0 2023-06-19 03:38:07,318 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=273810.0, ans=0.125 2023-06-19 03:38:07,841 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.15 vs. limit=10.0 2023-06-19 03:38:41,651 INFO [train.py:996] (2/4) Epoch 2, batch 15150, loss[loss=0.255, simple_loss=0.3078, pruned_loss=0.1011, over 21543.00 frames. ], tot_loss[loss=0.3016, simple_loss=0.3615, pruned_loss=0.1208, over 4267658.41 frames. ], batch size: 132, lr: 1.65e-02, grad_scale: 32.0 2023-06-19 03:38:46,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=273870.0, ans=0.0 2023-06-19 03:38:56,088 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.095e+02 2.943e+02 3.299e+02 3.908e+02 6.468e+02, threshold=6.598e+02, percent-clipped=0.0 2023-06-19 03:38:56,522 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=273870.0, ans=0.0 2023-06-19 03:39:04,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=273930.0, ans=0.125 2023-06-19 03:39:15,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=273930.0, ans=0.0 2023-06-19 03:39:47,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=273990.0, ans=0.125 2023-06-19 03:39:50,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=273990.0, ans=0.0 2023-06-19 03:39:52,118 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.29 vs. limit=15.0 2023-06-19 03:40:09,939 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.22 vs. limit=15.0 2023-06-19 03:40:31,694 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=274110.0, ans=0.125 2023-06-19 03:40:37,432 INFO [train.py:996] (2/4) Epoch 2, batch 15200, loss[loss=0.2934, simple_loss=0.3607, pruned_loss=0.113, over 21418.00 frames. ], tot_loss[loss=0.2906, simple_loss=0.3514, pruned_loss=0.1149, over 4265239.94 frames. ], batch size: 507, lr: 1.65e-02, grad_scale: 32.0 2023-06-19 03:42:01,898 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=274350.0, ans=10.0 2023-06-19 03:42:14,872 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=274410.0, ans=0.125 2023-06-19 03:42:21,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=274410.0, ans=0.02 2023-06-19 03:42:28,463 INFO [train.py:996] (2/4) Epoch 2, batch 15250, loss[loss=0.2655, simple_loss=0.3182, pruned_loss=0.1064, over 21568.00 frames. ], tot_loss[loss=0.2851, simple_loss=0.3444, pruned_loss=0.1129, over 4262045.24 frames. ], batch size: 263, lr: 1.65e-02, grad_scale: 32.0 2023-06-19 03:42:29,561 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.79 vs. limit=15.0 2023-06-19 03:42:42,820 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.882e+02 2.617e+02 3.072e+02 3.337e+02 6.038e+02, threshold=6.144e+02, percent-clipped=0.0 2023-06-19 03:43:02,788 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=15.0 2023-06-19 03:43:03,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=274530.0, ans=0.0 2023-06-19 03:44:37,229 INFO [train.py:996] (2/4) Epoch 2, batch 15300, loss[loss=0.311, simple_loss=0.3732, pruned_loss=0.1244, over 21319.00 frames. ], tot_loss[loss=0.2908, simple_loss=0.3485, pruned_loss=0.1165, over 4260629.34 frames. ], batch size: 176, lr: 1.65e-02, grad_scale: 32.0 2023-06-19 03:45:36,282 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=274890.0, ans=0.1 2023-06-19 03:46:38,761 INFO [train.py:996] (2/4) Epoch 2, batch 15350, loss[loss=0.2782, simple_loss=0.3632, pruned_loss=0.09665, over 21769.00 frames. ], tot_loss[loss=0.2971, simple_loss=0.3557, pruned_loss=0.1192, over 4267999.55 frames. ], batch size: 247, lr: 1.65e-02, grad_scale: 32.0 2023-06-19 03:47:11,414 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.176e+02 3.136e+02 3.689e+02 4.822e+02 8.057e+02, threshold=7.379e+02, percent-clipped=7.0 2023-06-19 03:47:11,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=275070.0, ans=0.0 2023-06-19 03:47:20,116 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.45 vs. limit=15.0 2023-06-19 03:47:47,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=275190.0, ans=0.0 2023-06-19 03:48:03,560 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.26 vs. limit=15.0 2023-06-19 03:48:37,214 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=275310.0, ans=0.2 2023-06-19 03:48:51,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=275370.0, ans=0.0 2023-06-19 03:48:52,822 INFO [train.py:996] (2/4) Epoch 2, batch 15400, loss[loss=0.2521, simple_loss=0.3176, pruned_loss=0.0933, over 21852.00 frames. ], tot_loss[loss=0.2943, simple_loss=0.3554, pruned_loss=0.1166, over 4265079.51 frames. ], batch size: 282, lr: 1.65e-02, grad_scale: 32.0 2023-06-19 03:50:21,949 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=275550.0, ans=0.125 2023-06-19 03:50:42,046 INFO [train.py:996] (2/4) Epoch 2, batch 15450, loss[loss=0.2742, simple_loss=0.3347, pruned_loss=0.1068, over 21900.00 frames. ], tot_loss[loss=0.2907, simple_loss=0.3514, pruned_loss=0.115, over 4270705.65 frames. ], batch size: 107, lr: 1.65e-02, grad_scale: 32.0 2023-06-19 03:51:14,675 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.892e+02 2.540e+02 3.062e+02 3.855e+02 5.645e+02, threshold=6.124e+02, percent-clipped=0.0 2023-06-19 03:51:54,616 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-06-19 03:53:13,389 INFO [train.py:996] (2/4) Epoch 2, batch 15500, loss[loss=0.2902, simple_loss=0.3759, pruned_loss=0.1023, over 21321.00 frames. ], tot_loss[loss=0.2901, simple_loss=0.3514, pruned_loss=0.1144, over 4264913.14 frames. ], batch size: 548, lr: 1.65e-02, grad_scale: 32.0 2023-06-19 03:53:33,931 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=275970.0, ans=0.0 2023-06-19 03:54:46,043 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=276150.0, ans=0.1 2023-06-19 03:55:02,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=276210.0, ans=0.125 2023-06-19 03:55:28,550 INFO [train.py:996] (2/4) Epoch 2, batch 15550, loss[loss=0.2453, simple_loss=0.3118, pruned_loss=0.08934, over 21207.00 frames. ], tot_loss[loss=0.2866, simple_loss=0.3495, pruned_loss=0.1118, over 4268096.97 frames. ], batch size: 159, lr: 1.65e-02, grad_scale: 32.0 2023-06-19 03:55:49,738 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.861e+02 3.609e+02 4.664e+02 1.166e+03, threshold=7.218e+02, percent-clipped=12.0 2023-06-19 03:56:55,797 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=276450.0, ans=0.0 2023-06-19 03:57:15,576 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.65 vs. limit=15.0 2023-06-19 03:57:36,357 INFO [train.py:996] (2/4) Epoch 2, batch 15600, loss[loss=0.2694, simple_loss=0.3469, pruned_loss=0.09599, over 21550.00 frames. ], tot_loss[loss=0.2826, simple_loss=0.3436, pruned_loss=0.1108, over 4270791.94 frames. ], batch size: 230, lr: 1.64e-02, grad_scale: 32.0 2023-06-19 03:58:10,416 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.48 vs. limit=15.0 2023-06-19 03:58:12,000 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.57 vs. limit=15.0 2023-06-19 03:58:44,525 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=276690.0, ans=0.0 2023-06-19 03:58:47,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=276690.0, ans=0.125 2023-06-19 03:59:59,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=276870.0, ans=0.015 2023-06-19 03:59:59,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=276870.0, ans=0.2 2023-06-19 04:00:05,933 INFO [train.py:996] (2/4) Epoch 2, batch 15650, loss[loss=0.313, simple_loss=0.3556, pruned_loss=0.1352, over 20777.00 frames. ], tot_loss[loss=0.2803, simple_loss=0.3421, pruned_loss=0.1093, over 4269972.69 frames. ], batch size: 611, lr: 1.64e-02, grad_scale: 32.0 2023-06-19 04:00:14,621 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.027e+02 2.975e+02 3.392e+02 4.478e+02 7.977e+02, threshold=6.785e+02, percent-clipped=4.0 2023-06-19 04:01:53,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=277110.0, ans=0.04949747468305833 2023-06-19 04:02:03,675 INFO [train.py:996] (2/4) Epoch 2, batch 15700, loss[loss=0.2269, simple_loss=0.308, pruned_loss=0.07294, over 21420.00 frames. ], tot_loss[loss=0.2777, simple_loss=0.3387, pruned_loss=0.1083, over 4265444.28 frames. ], batch size: 211, lr: 1.64e-02, grad_scale: 32.0 2023-06-19 04:02:15,229 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=277170.0, ans=0.125 2023-06-19 04:02:33,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=277230.0, ans=0.125 2023-06-19 04:02:34,257 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.11 vs. limit=15.0 2023-06-19 04:03:22,387 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=277350.0, ans=0.0 2023-06-19 04:03:26,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=277350.0, ans=0.1 2023-06-19 04:03:53,031 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.62 vs. limit=22.5 2023-06-19 04:04:04,400 INFO [train.py:996] (2/4) Epoch 2, batch 15750, loss[loss=0.2494, simple_loss=0.3109, pruned_loss=0.09399, over 21756.00 frames. ], tot_loss[loss=0.2739, simple_loss=0.3333, pruned_loss=0.1073, over 4274143.75 frames. ], batch size: 112, lr: 1.64e-02, grad_scale: 32.0 2023-06-19 04:04:05,609 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=277470.0, ans=0.125 2023-06-19 04:04:22,788 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.958e+02 2.529e+02 2.886e+02 3.398e+02 5.891e+02, threshold=5.773e+02, percent-clipped=0.0 2023-06-19 04:04:33,061 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=277530.0, ans=0.5 2023-06-19 04:06:19,593 INFO [train.py:996] (2/4) Epoch 2, batch 15800, loss[loss=0.2498, simple_loss=0.3028, pruned_loss=0.09842, over 21696.00 frames. ], tot_loss[loss=0.2712, simple_loss=0.3292, pruned_loss=0.1066, over 4278598.84 frames. ], batch size: 282, lr: 1.64e-02, grad_scale: 32.0 2023-06-19 04:06:36,229 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=277770.0, ans=0.0 2023-06-19 04:06:46,963 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=277830.0, ans=0.07 2023-06-19 04:07:41,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=277950.0, ans=0.2 2023-06-19 04:08:40,429 INFO [train.py:996] (2/4) Epoch 2, batch 15850, loss[loss=0.2728, simple_loss=0.3315, pruned_loss=0.1071, over 21928.00 frames. ], tot_loss[loss=0.2768, simple_loss=0.3345, pruned_loss=0.1095, over 4264968.40 frames. ], batch size: 317, lr: 1.64e-02, grad_scale: 32.0 2023-06-19 04:08:49,298 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.894e+02 2.859e+02 3.359e+02 4.336e+02 6.556e+02, threshold=6.719e+02, percent-clipped=7.0 2023-06-19 04:08:50,231 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.32 vs. limit=15.0 2023-06-19 04:09:26,288 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=278130.0, ans=0.0 2023-06-19 04:09:33,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=278190.0, ans=0.0 2023-06-19 04:10:38,067 INFO [train.py:996] (2/4) Epoch 2, batch 15900, loss[loss=0.269, simple_loss=0.3342, pruned_loss=0.1019, over 21865.00 frames. ], tot_loss[loss=0.2744, simple_loss=0.3317, pruned_loss=0.1085, over 4267001.10 frames. ], batch size: 98, lr: 1.64e-02, grad_scale: 32.0 2023-06-19 04:12:36,282 INFO [train.py:996] (2/4) Epoch 2, batch 15950, loss[loss=0.2842, simple_loss=0.3494, pruned_loss=0.1095, over 21775.00 frames. ], tot_loss[loss=0.2707, simple_loss=0.3314, pruned_loss=0.105, over 4266662.22 frames. ], batch size: 112, lr: 1.64e-02, grad_scale: 16.0 2023-06-19 04:12:58,693 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.50 vs. limit=15.0 2023-06-19 04:13:02,147 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.948e+02 2.565e+02 3.122e+02 3.960e+02 8.698e+02, threshold=6.245e+02, percent-clipped=1.0 2023-06-19 04:13:16,383 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=278730.0, ans=0.1 2023-06-19 04:14:03,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=278790.0, ans=0.125 2023-06-19 04:14:10,945 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=278850.0, ans=0.0 2023-06-19 04:15:07,378 INFO [train.py:996] (2/4) Epoch 2, batch 16000, loss[loss=0.2471, simple_loss=0.3273, pruned_loss=0.08343, over 21886.00 frames. ], tot_loss[loss=0.2689, simple_loss=0.3315, pruned_loss=0.1031, over 4262925.19 frames. ], batch size: 316, lr: 1.64e-02, grad_scale: 32.0 2023-06-19 04:16:32,171 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=279150.0, ans=0.0 2023-06-19 04:17:13,455 INFO [train.py:996] (2/4) Epoch 2, batch 16050, loss[loss=0.2799, simple_loss=0.3576, pruned_loss=0.1011, over 21717.00 frames. ], tot_loss[loss=0.2669, simple_loss=0.3322, pruned_loss=0.1007, over 4267726.22 frames. ], batch size: 298, lr: 1.64e-02, grad_scale: 16.0 2023-06-19 04:17:30,615 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.739e+02 2.605e+02 3.348e+02 4.219e+02 7.021e+02, threshold=6.696e+02, percent-clipped=3.0 2023-06-19 04:18:04,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=279390.0, ans=0.125 2023-06-19 04:18:07,264 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=279390.0, ans=0.125 2023-06-19 04:19:19,761 INFO [train.py:996] (2/4) Epoch 2, batch 16100, loss[loss=0.2637, simple_loss=0.3262, pruned_loss=0.1006, over 21831.00 frames. ], tot_loss[loss=0.2722, simple_loss=0.3365, pruned_loss=0.104, over 4276642.51 frames. ], batch size: 282, lr: 1.64e-02, grad_scale: 16.0 2023-06-19 04:19:37,724 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.27 vs. limit=15.0 2023-06-19 04:19:43,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=279630.0, ans=0.2 2023-06-19 04:19:45,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=279630.0, ans=0.125 2023-06-19 04:20:06,319 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.09 vs. limit=6.0 2023-06-19 04:21:00,251 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=12.0 2023-06-19 04:21:09,184 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=279810.0, ans=0.07 2023-06-19 04:21:27,811 INFO [train.py:996] (2/4) Epoch 2, batch 16150, loss[loss=0.298, simple_loss=0.3597, pruned_loss=0.1181, over 21892.00 frames. ], tot_loss[loss=0.2765, simple_loss=0.3378, pruned_loss=0.1076, over 4288950.94 frames. ], batch size: 124, lr: 1.64e-02, grad_scale: 16.0 2023-06-19 04:21:57,623 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.299e+02 3.251e+02 4.179e+02 4.916e+02 8.388e+02, threshold=8.358e+02, percent-clipped=5.0 2023-06-19 04:21:59,002 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.67 vs. limit=22.5 2023-06-19 04:22:02,696 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=279930.0, ans=0.0 2023-06-19 04:22:21,252 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-19 04:23:31,211 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.13 vs. limit=15.0 2023-06-19 04:23:35,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=280170.0, ans=0.1 2023-06-19 04:23:36,111 INFO [train.py:996] (2/4) Epoch 2, batch 16200, loss[loss=0.331, simple_loss=0.3855, pruned_loss=0.1383, over 21882.00 frames. ], tot_loss[loss=0.2827, simple_loss=0.3445, pruned_loss=0.1105, over 4291965.81 frames. ], batch size: 371, lr: 1.63e-02, grad_scale: 16.0 2023-06-19 04:24:35,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=280290.0, ans=0.125 2023-06-19 04:24:35,810 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=280290.0, ans=0.125 2023-06-19 04:25:51,278 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=280410.0, ans=0.05 2023-06-19 04:25:56,552 INFO [train.py:996] (2/4) Epoch 2, batch 16250, loss[loss=0.1949, simple_loss=0.262, pruned_loss=0.06384, over 16225.00 frames. ], tot_loss[loss=0.2824, simple_loss=0.3442, pruned_loss=0.1103, over 4278883.75 frames. ], batch size: 60, lr: 1.63e-02, grad_scale: 16.0 2023-06-19 04:26:09,949 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=280470.0, ans=0.0 2023-06-19 04:26:14,109 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.805e+02 2.660e+02 3.048e+02 3.476e+02 5.105e+02, threshold=6.097e+02, percent-clipped=0.0 2023-06-19 04:27:01,173 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.94 vs. limit=15.0 2023-06-19 04:27:10,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=280650.0, ans=0.125 2023-06-19 04:27:46,219 INFO [train.py:996] (2/4) Epoch 2, batch 16300, loss[loss=0.2922, simple_loss=0.3647, pruned_loss=0.1098, over 20988.00 frames. ], tot_loss[loss=0.2735, simple_loss=0.3376, pruned_loss=0.1047, over 4265714.16 frames. ], batch size: 607, lr: 1.63e-02, grad_scale: 16.0 2023-06-19 04:28:23,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=280830.0, ans=0.2 2023-06-19 04:28:40,361 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=280890.0, ans=0.0 2023-06-19 04:28:48,248 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.26 vs. limit=10.0 2023-06-19 04:29:47,571 INFO [train.py:996] (2/4) Epoch 2, batch 16350, loss[loss=0.2909, simple_loss=0.3554, pruned_loss=0.1132, over 21714.00 frames. ], tot_loss[loss=0.2749, simple_loss=0.3382, pruned_loss=0.1058, over 4262879.85 frames. ], batch size: 298, lr: 1.63e-02, grad_scale: 16.0 2023-06-19 04:30:23,873 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.913e+02 2.593e+02 3.131e+02 3.921e+02 6.968e+02, threshold=6.263e+02, percent-clipped=2.0 2023-06-19 04:31:57,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=281310.0, ans=0.1 2023-06-19 04:32:19,369 INFO [train.py:996] (2/4) Epoch 2, batch 16400, loss[loss=0.3669, simple_loss=0.3904, pruned_loss=0.1717, over 21673.00 frames. ], tot_loss[loss=0.2802, simple_loss=0.3432, pruned_loss=0.1085, over 4272571.49 frames. ], batch size: 507, lr: 1.63e-02, grad_scale: 32.0 2023-06-19 04:33:01,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=281430.0, ans=0.1 2023-06-19 04:33:06,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=281430.0, ans=0.125 2023-06-19 04:33:07,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=281430.0, ans=0.025 2023-06-19 04:33:24,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=281550.0, ans=0.0 2023-06-19 04:34:12,334 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=281610.0, ans=0.125 2023-06-19 04:34:38,584 INFO [train.py:996] (2/4) Epoch 2, batch 16450, loss[loss=0.2456, simple_loss=0.301, pruned_loss=0.09513, over 21264.00 frames. ], tot_loss[loss=0.2834, simple_loss=0.3444, pruned_loss=0.1111, over 4281701.09 frames. ], batch size: 608, lr: 1.63e-02, grad_scale: 32.0 2023-06-19 04:34:56,416 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.130e+02 2.650e+02 3.270e+02 4.334e+02 6.545e+02, threshold=6.541e+02, percent-clipped=2.0 2023-06-19 04:34:57,456 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.72 vs. limit=15.0 2023-06-19 04:35:00,459 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.72 vs. limit=6.0 2023-06-19 04:35:26,111 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=281790.0, ans=0.0 2023-06-19 04:36:01,691 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.84 vs. limit=15.0 2023-06-19 04:36:49,342 INFO [train.py:996] (2/4) Epoch 2, batch 16500, loss[loss=0.2093, simple_loss=0.2645, pruned_loss=0.07708, over 21352.00 frames. ], tot_loss[loss=0.2852, simple_loss=0.3456, pruned_loss=0.1124, over 4278308.73 frames. ], batch size: 194, lr: 1.63e-02, grad_scale: 16.0 2023-06-19 04:37:29,607 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.30 vs. limit=15.0 2023-06-19 04:37:33,845 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.68 vs. limit=12.0 2023-06-19 04:39:04,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=282210.0, ans=0.125 2023-06-19 04:39:25,972 INFO [train.py:996] (2/4) Epoch 2, batch 16550, loss[loss=0.2968, simple_loss=0.3637, pruned_loss=0.115, over 21275.00 frames. ], tot_loss[loss=0.2802, simple_loss=0.343, pruned_loss=0.1087, over 4277183.53 frames. ], batch size: 548, lr: 1.63e-02, grad_scale: 16.0 2023-06-19 04:39:30,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=282270.0, ans=0.125 2023-06-19 04:39:38,911 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.161e+02 2.903e+02 3.447e+02 4.239e+02 9.534e+02, threshold=6.894e+02, percent-clipped=2.0 2023-06-19 04:39:40,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=282330.0, ans=0.5 2023-06-19 04:39:50,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=282330.0, ans=0.0 2023-06-19 04:41:07,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=282450.0, ans=0.0 2023-06-19 04:41:09,380 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 04:41:09,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=282450.0, ans=0.1 2023-06-19 04:41:46,296 INFO [train.py:996] (2/4) Epoch 2, batch 16600, loss[loss=0.3562, simple_loss=0.4352, pruned_loss=0.1386, over 21636.00 frames. ], tot_loss[loss=0.29, simple_loss=0.3532, pruned_loss=0.1134, over 4283257.32 frames. ], batch size: 389, lr: 1.63e-02, grad_scale: 16.0 2023-06-19 04:42:05,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=282630.0, ans=0.2 2023-06-19 04:43:33,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=282810.0, ans=0.125 2023-06-19 04:43:36,523 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=282810.0, ans=0.125 2023-06-19 04:44:14,846 INFO [train.py:996] (2/4) Epoch 2, batch 16650, loss[loss=0.3966, simple_loss=0.4287, pruned_loss=0.1823, over 21310.00 frames. ], tot_loss[loss=0.2982, simple_loss=0.3626, pruned_loss=0.1169, over 4283659.08 frames. ], batch size: 507, lr: 1.63e-02, grad_scale: 16.0 2023-06-19 04:44:28,261 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.064e+02 2.764e+02 3.302e+02 3.864e+02 7.360e+02, threshold=6.604e+02, percent-clipped=1.0 2023-06-19 04:44:31,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=282930.0, ans=0.1 2023-06-19 04:44:44,875 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=282930.0, ans=0.2 2023-06-19 04:45:06,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=282990.0, ans=0.125 2023-06-19 04:45:16,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=282990.0, ans=0.0 2023-06-19 04:45:28,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=283050.0, ans=0.125 2023-06-19 04:45:47,150 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.80 vs. limit=15.0 2023-06-19 04:46:32,964 INFO [train.py:996] (2/4) Epoch 2, batch 16700, loss[loss=0.2479, simple_loss=0.2966, pruned_loss=0.09964, over 21120.00 frames. ], tot_loss[loss=0.2963, simple_loss=0.3601, pruned_loss=0.1163, over 4277660.24 frames. ], batch size: 143, lr: 1.63e-02, grad_scale: 16.0 2023-06-19 04:47:34,443 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.14 vs. limit=15.0 2023-06-19 04:47:57,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=283290.0, ans=0.07 2023-06-19 04:47:57,515 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=283290.0, ans=0.0 2023-06-19 04:47:59,400 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.89 vs. limit=22.5 2023-06-19 04:48:46,404 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.73 vs. limit=22.5 2023-06-19 04:49:06,298 INFO [train.py:996] (2/4) Epoch 2, batch 16750, loss[loss=0.299, simple_loss=0.3623, pruned_loss=0.1178, over 21257.00 frames. ], tot_loss[loss=0.2995, simple_loss=0.3634, pruned_loss=0.1178, over 4272410.71 frames. ], batch size: 143, lr: 1.63e-02, grad_scale: 16.0 2023-06-19 04:49:32,336 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.454e+02 3.141e+02 3.831e+02 4.600e+02 7.842e+02, threshold=7.663e+02, percent-clipped=1.0 2023-06-19 04:50:06,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=283530.0, ans=10.0 2023-06-19 04:50:28,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=283590.0, ans=0.125 2023-06-19 04:50:56,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=283650.0, ans=0.125 2023-06-19 04:51:33,186 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=283710.0, ans=0.125 2023-06-19 04:51:37,206 INFO [train.py:996] (2/4) Epoch 2, batch 16800, loss[loss=0.3236, simple_loss=0.382, pruned_loss=0.1325, over 20618.00 frames. ], tot_loss[loss=0.3021, simple_loss=0.3671, pruned_loss=0.1186, over 4272667.35 frames. ], batch size: 607, lr: 1.62e-02, grad_scale: 32.0 2023-06-19 04:51:40,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=283770.0, ans=0.125 2023-06-19 04:52:08,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=283830.0, ans=0.025 2023-06-19 04:52:19,736 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=283830.0, ans=0.125 2023-06-19 04:52:28,420 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=283830.0, ans=10.0 2023-06-19 04:53:54,892 INFO [train.py:996] (2/4) Epoch 2, batch 16850, loss[loss=0.3047, simple_loss=0.3526, pruned_loss=0.1284, over 21920.00 frames. ], tot_loss[loss=0.2996, simple_loss=0.3625, pruned_loss=0.1184, over 4276693.61 frames. ], batch size: 414, lr: 1.62e-02, grad_scale: 32.0 2023-06-19 04:54:16,740 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.99 vs. limit=22.5 2023-06-19 04:54:16,971 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.007e+02 3.232e+02 3.798e+02 4.468e+02 9.826e+02, threshold=7.596e+02, percent-clipped=4.0 2023-06-19 04:54:17,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=284130.0, ans=0.1 2023-06-19 04:54:43,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=284130.0, ans=0.125 2023-06-19 04:54:45,270 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=284130.0, ans=0.0 2023-06-19 04:54:48,026 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=284190.0, ans=0.1 2023-06-19 04:55:26,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=284250.0, ans=0.125 2023-06-19 04:56:12,102 INFO [train.py:996] (2/4) Epoch 2, batch 16900, loss[loss=0.2188, simple_loss=0.2732, pruned_loss=0.08223, over 21201.00 frames. ], tot_loss[loss=0.2948, simple_loss=0.3562, pruned_loss=0.1167, over 4285523.22 frames. ], batch size: 176, lr: 1.62e-02, grad_scale: 32.0 2023-06-19 04:56:16,598 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.27 vs. limit=10.0 2023-06-19 04:56:33,079 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.85 vs. limit=15.0 2023-06-19 04:56:51,032 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=284430.0, ans=0.0 2023-06-19 04:57:06,247 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=284490.0, ans=0.125 2023-06-19 04:57:43,580 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=284550.0, ans=0.2 2023-06-19 04:58:01,230 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.25 vs. limit=15.0 2023-06-19 04:58:33,306 INFO [train.py:996] (2/4) Epoch 2, batch 16950, loss[loss=0.2567, simple_loss=0.3175, pruned_loss=0.09792, over 20083.00 frames. ], tot_loss[loss=0.2911, simple_loss=0.3518, pruned_loss=0.1152, over 4280417.74 frames. ], batch size: 703, lr: 1.62e-02, grad_scale: 32.0 2023-06-19 04:58:46,763 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.166e+02 2.811e+02 3.359e+02 4.146e+02 6.829e+02, threshold=6.718e+02, percent-clipped=0.0 2023-06-19 04:58:52,370 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.29 vs. limit=10.0 2023-06-19 04:59:52,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=284850.0, ans=0.125 2023-06-19 05:00:11,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=284910.0, ans=0.1 2023-06-19 05:00:18,994 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.69 vs. limit=12.0 2023-06-19 05:00:36,263 INFO [train.py:996] (2/4) Epoch 2, batch 17000, loss[loss=0.3022, simple_loss=0.3547, pruned_loss=0.1249, over 21860.00 frames. ], tot_loss[loss=0.2896, simple_loss=0.3486, pruned_loss=0.1153, over 4282977.20 frames. ], batch size: 118, lr: 1.62e-02, grad_scale: 32.0 2023-06-19 05:00:55,927 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=15.0 2023-06-19 05:01:13,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=285030.0, ans=0.0 2023-06-19 05:02:19,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=285150.0, ans=0.015 2023-06-19 05:02:23,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=285150.0, ans=0.0 2023-06-19 05:03:09,669 INFO [train.py:996] (2/4) Epoch 2, batch 17050, loss[loss=0.306, simple_loss=0.3728, pruned_loss=0.1196, over 21244.00 frames. ], tot_loss[loss=0.2972, simple_loss=0.3571, pruned_loss=0.1186, over 4288248.92 frames. ], batch size: 159, lr: 1.62e-02, grad_scale: 32.0 2023-06-19 05:03:28,013 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.296e+02 2.955e+02 3.306e+02 4.122e+02 6.323e+02, threshold=6.612e+02, percent-clipped=0.0 2023-06-19 05:04:09,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=285390.0, ans=0.125 2023-06-19 05:04:10,505 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.15 vs. limit=12.0 2023-06-19 05:05:09,702 INFO [train.py:996] (2/4) Epoch 2, batch 17100, loss[loss=0.3072, simple_loss=0.3576, pruned_loss=0.1284, over 21850.00 frames. ], tot_loss[loss=0.2961, simple_loss=0.3547, pruned_loss=0.1187, over 4282187.58 frames. ], batch size: 414, lr: 1.62e-02, grad_scale: 32.0 2023-06-19 05:06:55,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=285810.0, ans=0.125 2023-06-19 05:07:14,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=285810.0, ans=0.2 2023-06-19 05:07:26,621 INFO [train.py:996] (2/4) Epoch 2, batch 17150, loss[loss=0.2308, simple_loss=0.3065, pruned_loss=0.07751, over 21707.00 frames. ], tot_loss[loss=0.2917, simple_loss=0.3499, pruned_loss=0.1167, over 4289611.29 frames. ], batch size: 389, lr: 1.62e-02, grad_scale: 32.0 2023-06-19 05:08:02,661 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.995e+02 2.782e+02 3.187e+02 4.144e+02 6.035e+02, threshold=6.374e+02, percent-clipped=0.0 2023-06-19 05:09:28,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=286110.0, ans=0.125 2023-06-19 05:09:31,954 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.55 vs. limit=12.0 2023-06-19 05:09:35,270 INFO [train.py:996] (2/4) Epoch 2, batch 17200, loss[loss=0.3213, simple_loss=0.4165, pruned_loss=0.113, over 19730.00 frames. ], tot_loss[loss=0.2896, simple_loss=0.3483, pruned_loss=0.1155, over 4283930.04 frames. ], batch size: 703, lr: 1.62e-02, grad_scale: 32.0 2023-06-19 05:09:37,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=286170.0, ans=0.0 2023-06-19 05:10:18,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=286230.0, ans=0.035 2023-06-19 05:11:21,282 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=286350.0, ans=0.125 2023-06-19 05:11:59,542 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.51 vs. limit=15.0 2023-06-19 05:12:03,965 INFO [train.py:996] (2/4) Epoch 2, batch 17250, loss[loss=0.4187, simple_loss=0.4408, pruned_loss=0.1983, over 21337.00 frames. ], tot_loss[loss=0.2938, simple_loss=0.3518, pruned_loss=0.1179, over 4286517.00 frames. ], batch size: 507, lr: 1.62e-02, grad_scale: 32.0 2023-06-19 05:12:46,967 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.154e+02 3.005e+02 3.499e+02 4.330e+02 7.906e+02, threshold=6.999e+02, percent-clipped=5.0 2023-06-19 05:12:47,798 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=286530.0, ans=0.125 2023-06-19 05:13:24,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=286650.0, ans=0.0 2023-06-19 05:14:30,721 INFO [train.py:996] (2/4) Epoch 2, batch 17300, loss[loss=0.3102, simple_loss=0.3789, pruned_loss=0.1208, over 21783.00 frames. ], tot_loss[loss=0.3014, simple_loss=0.3594, pruned_loss=0.1217, over 4287906.29 frames. ], batch size: 124, lr: 1.62e-02, grad_scale: 16.0 2023-06-19 05:15:11,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=286830.0, ans=0.0 2023-06-19 05:15:24,625 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.05 vs. limit=15.0 2023-06-19 05:15:45,455 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.13 vs. limit=15.0 2023-06-19 05:16:06,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=286950.0, ans=0.1 2023-06-19 05:16:19,068 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.08 vs. limit=22.5 2023-06-19 05:16:28,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=287010.0, ans=0.1 2023-06-19 05:17:10,024 INFO [train.py:996] (2/4) Epoch 2, batch 17350, loss[loss=0.2674, simple_loss=0.3659, pruned_loss=0.08449, over 21261.00 frames. ], tot_loss[loss=0.3013, simple_loss=0.3603, pruned_loss=0.1211, over 4283854.21 frames. ], batch size: 548, lr: 1.62e-02, grad_scale: 16.0 2023-06-19 05:17:31,001 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.996e+02 3.184e+02 3.862e+02 4.907e+02 9.344e+02, threshold=7.725e+02, percent-clipped=8.0 2023-06-19 05:17:53,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=287190.0, ans=0.0 2023-06-19 05:18:07,906 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.58 vs. limit=15.0 2023-06-19 05:18:24,539 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=287250.0, ans=0.125 2023-06-19 05:18:25,264 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.71 vs. limit=15.0 2023-06-19 05:19:06,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=287310.0, ans=0.125 2023-06-19 05:19:11,902 INFO [train.py:996] (2/4) Epoch 2, batch 17400, loss[loss=0.2427, simple_loss=0.3053, pruned_loss=0.0901, over 21715.00 frames. ], tot_loss[loss=0.2953, simple_loss=0.3562, pruned_loss=0.1172, over 4265808.77 frames. ], batch size: 247, lr: 1.61e-02, grad_scale: 16.0 2023-06-19 05:19:24,399 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.48 vs. limit=15.0 2023-06-19 05:19:26,904 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=287370.0, ans=0.125 2023-06-19 05:20:09,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=287430.0, ans=0.0 2023-06-19 05:21:00,055 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=287550.0, ans=0.1 2023-06-19 05:21:39,829 INFO [train.py:996] (2/4) Epoch 2, batch 17450, loss[loss=0.215, simple_loss=0.2794, pruned_loss=0.07532, over 21105.00 frames. ], tot_loss[loss=0.2887, simple_loss=0.3512, pruned_loss=0.1131, over 4259622.22 frames. ], batch size: 143, lr: 1.61e-02, grad_scale: 16.0 2023-06-19 05:21:58,505 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=287670.0, ans=0.125 2023-06-19 05:22:16,018 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.811e+02 2.812e+02 3.482e+02 4.322e+02 6.614e+02, threshold=6.964e+02, percent-clipped=0.0 2023-06-19 05:23:39,980 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=10.03 vs. limit=15.0 2023-06-19 05:23:46,111 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=287910.0, ans=0.0 2023-06-19 05:23:54,464 INFO [train.py:996] (2/4) Epoch 2, batch 17500, loss[loss=0.2809, simple_loss=0.3377, pruned_loss=0.112, over 21893.00 frames. ], tot_loss[loss=0.2831, simple_loss=0.3467, pruned_loss=0.1097, over 4258932.46 frames. ], batch size: 351, lr: 1.61e-02, grad_scale: 16.0 2023-06-19 05:25:56,524 INFO [train.py:996] (2/4) Epoch 2, batch 17550, loss[loss=0.2724, simple_loss=0.3502, pruned_loss=0.09736, over 21804.00 frames. ], tot_loss[loss=0.2825, simple_loss=0.3471, pruned_loss=0.109, over 4258357.86 frames. ], batch size: 371, lr: 1.61e-02, grad_scale: 16.0 2023-06-19 05:26:08,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=288270.0, ans=0.0 2023-06-19 05:26:10,713 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.478e+02 2.669e+02 3.304e+02 4.133e+02 8.142e+02, threshold=6.607e+02, percent-clipped=6.0 2023-06-19 05:27:48,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=288510.0, ans=0.125 2023-06-19 05:27:51,760 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=288510.0, ans=0.0 2023-06-19 05:27:57,422 INFO [train.py:996] (2/4) Epoch 2, batch 17600, loss[loss=0.2874, simple_loss=0.3459, pruned_loss=0.1145, over 21810.00 frames. ], tot_loss[loss=0.2824, simple_loss=0.3481, pruned_loss=0.1084, over 4261461.63 frames. ], batch size: 247, lr: 1.61e-02, grad_scale: 32.0 2023-06-19 05:27:57,899 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=288570.0, ans=0.0 2023-06-19 05:28:09,030 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-06-19 05:29:53,846 INFO [train.py:996] (2/4) Epoch 2, batch 17650, loss[loss=0.1964, simple_loss=0.2588, pruned_loss=0.06702, over 21593.00 frames. ], tot_loss[loss=0.2822, simple_loss=0.3462, pruned_loss=0.1091, over 4260785.32 frames. ], batch size: 230, lr: 1.61e-02, grad_scale: 32.0 2023-06-19 05:30:34,314 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.876e+02 2.529e+02 3.160e+02 3.730e+02 7.099e+02, threshold=6.320e+02, percent-clipped=1.0 2023-06-19 05:30:37,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=288930.0, ans=0.125 2023-06-19 05:31:14,277 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 05:31:34,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=289050.0, ans=0.1 2023-06-19 05:32:11,540 INFO [train.py:996] (2/4) Epoch 2, batch 17700, loss[loss=0.3178, simple_loss=0.3815, pruned_loss=0.1271, over 21337.00 frames. ], tot_loss[loss=0.2738, simple_loss=0.3383, pruned_loss=0.1047, over 4266727.39 frames. ], batch size: 549, lr: 1.61e-02, grad_scale: 32.0 2023-06-19 05:32:35,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=289170.0, ans=0.07 2023-06-19 05:32:40,186 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=289230.0, ans=0.1 2023-06-19 05:32:44,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=289230.0, ans=0.125 2023-06-19 05:33:03,163 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=289230.0, ans=0.125 2023-06-19 05:33:19,700 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.48 vs. limit=15.0 2023-06-19 05:33:29,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=289290.0, ans=0.125 2023-06-19 05:34:45,047 INFO [train.py:996] (2/4) Epoch 2, batch 17750, loss[loss=0.2977, simple_loss=0.3411, pruned_loss=0.1272, over 20017.00 frames. ], tot_loss[loss=0.2837, simple_loss=0.3472, pruned_loss=0.1101, over 4258594.60 frames. ], batch size: 702, lr: 1.61e-02, grad_scale: 32.0 2023-06-19 05:34:46,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=289470.0, ans=0.0 2023-06-19 05:34:48,465 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 05:35:15,614 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.935e+02 2.691e+02 3.269e+02 3.962e+02 7.961e+02, threshold=6.538e+02, percent-clipped=2.0 2023-06-19 05:37:04,585 INFO [train.py:996] (2/4) Epoch 2, batch 17800, loss[loss=0.2257, simple_loss=0.2964, pruned_loss=0.07749, over 21415.00 frames. ], tot_loss[loss=0.2835, simple_loss=0.3482, pruned_loss=0.1094, over 4263805.87 frames. ], batch size: 211, lr: 1.61e-02, grad_scale: 32.0 2023-06-19 05:37:15,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=289770.0, ans=0.2 2023-06-19 05:37:24,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=289830.0, ans=0.0 2023-06-19 05:38:17,392 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=289890.0, ans=0.0 2023-06-19 05:38:40,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=289950.0, ans=0.1 2023-06-19 05:39:08,908 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=290010.0, ans=0.0 2023-06-19 05:39:24,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=290010.0, ans=0.0 2023-06-19 05:39:34,467 INFO [train.py:996] (2/4) Epoch 2, batch 17850, loss[loss=0.31, simple_loss=0.3674, pruned_loss=0.1263, over 20707.00 frames. ], tot_loss[loss=0.2841, simple_loss=0.3485, pruned_loss=0.1099, over 4257563.88 frames. ], batch size: 607, lr: 1.61e-02, grad_scale: 32.0 2023-06-19 05:39:46,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=290070.0, ans=0.0 2023-06-19 05:40:01,053 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.945e+02 2.782e+02 3.357e+02 4.257e+02 9.205e+02, threshold=6.714e+02, percent-clipped=6.0 2023-06-19 05:40:21,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=290190.0, ans=0.0 2023-06-19 05:40:40,795 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=290250.0, ans=0.2 2023-06-19 05:40:53,083 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=290250.0, ans=0.2 2023-06-19 05:40:54,882 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.19 vs. limit=15.0 2023-06-19 05:41:19,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=290310.0, ans=0.125 2023-06-19 05:41:50,107 INFO [train.py:996] (2/4) Epoch 2, batch 17900, loss[loss=0.2928, simple_loss=0.3786, pruned_loss=0.1035, over 21743.00 frames. ], tot_loss[loss=0.2905, simple_loss=0.3555, pruned_loss=0.1128, over 4263941.10 frames. ], batch size: 332, lr: 1.61e-02, grad_scale: 32.0 2023-06-19 05:42:08,503 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.13 vs. limit=15.0 2023-06-19 05:42:58,944 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.90 vs. limit=22.5 2023-06-19 05:43:59,601 INFO [train.py:996] (2/4) Epoch 2, batch 17950, loss[loss=0.3118, simple_loss=0.3815, pruned_loss=0.1211, over 21467.00 frames. ], tot_loss[loss=0.285, simple_loss=0.3534, pruned_loss=0.1083, over 4258418.87 frames. ], batch size: 507, lr: 1.61e-02, grad_scale: 32.0 2023-06-19 05:44:34,923 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.834e+02 2.675e+02 3.159e+02 3.778e+02 7.100e+02, threshold=6.318e+02, percent-clipped=1.0 2023-06-19 05:44:37,489 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.03 vs. limit=15.0 2023-06-19 05:45:27,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=290850.0, ans=0.2 2023-06-19 05:45:28,219 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.47 vs. limit=15.0 2023-06-19 05:45:32,710 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.17 vs. limit=15.0 2023-06-19 05:45:56,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=290910.0, ans=0.0 2023-06-19 05:46:07,266 INFO [train.py:996] (2/4) Epoch 2, batch 18000, loss[loss=0.2605, simple_loss=0.3091, pruned_loss=0.1059, over 21782.00 frames. ], tot_loss[loss=0.2804, simple_loss=0.3466, pruned_loss=0.1071, over 4266621.26 frames. ], batch size: 317, lr: 1.60e-02, grad_scale: 32.0 2023-06-19 05:46:07,266 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-19 05:47:07,989 INFO [train.py:1028] (2/4) Epoch 2, validation: loss=0.2814, simple_loss=0.3799, pruned_loss=0.0915, over 1796401.00 frames. 2023-06-19 05:47:07,991 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-19 05:47:23,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=290970.0, ans=0.1 2023-06-19 05:47:24,792 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=290970.0, ans=0.125 2023-06-19 05:47:36,736 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=291030.0, ans=0.125 2023-06-19 05:47:49,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=291090.0, ans=0.0 2023-06-19 05:48:15,731 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=291150.0, ans=0.125 2023-06-19 05:48:25,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=291150.0, ans=0.125 2023-06-19 05:48:40,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=291210.0, ans=0.0 2023-06-19 05:48:44,979 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=291210.0, ans=0.125 2023-06-19 05:48:54,871 INFO [train.py:996] (2/4) Epoch 2, batch 18050, loss[loss=0.2703, simple_loss=0.3258, pruned_loss=0.1074, over 21369.00 frames. ], tot_loss[loss=0.2767, simple_loss=0.3408, pruned_loss=0.1062, over 4268333.60 frames. ], batch size: 211, lr: 1.60e-02, grad_scale: 32.0 2023-06-19 05:48:58,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=291270.0, ans=0.125 2023-06-19 05:49:01,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=291270.0, ans=0.125 2023-06-19 05:49:26,879 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.963e+02 2.691e+02 3.245e+02 3.925e+02 6.947e+02, threshold=6.490e+02, percent-clipped=2.0 2023-06-19 05:49:32,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=291330.0, ans=0.1 2023-06-19 05:50:52,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=291510.0, ans=0.125 2023-06-19 05:51:12,420 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=291570.0, ans=0.125 2023-06-19 05:51:13,353 INFO [train.py:996] (2/4) Epoch 2, batch 18100, loss[loss=0.3452, simple_loss=0.4098, pruned_loss=0.1403, over 21496.00 frames. ], tot_loss[loss=0.2843, simple_loss=0.3471, pruned_loss=0.1107, over 4268614.43 frames. ], batch size: 471, lr: 1.60e-02, grad_scale: 32.0 2023-06-19 05:51:13,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=291570.0, ans=0.07 2023-06-19 05:51:17,721 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.33 vs. limit=22.5 2023-06-19 05:52:52,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=291810.0, ans=0.015 2023-06-19 05:53:12,429 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=291810.0, ans=0.0 2023-06-19 05:53:14,866 INFO [train.py:996] (2/4) Epoch 2, batch 18150, loss[loss=0.2882, simple_loss=0.3482, pruned_loss=0.1141, over 21662.00 frames. ], tot_loss[loss=0.2829, simple_loss=0.3469, pruned_loss=0.1095, over 4270511.41 frames. ], batch size: 332, lr: 1.60e-02, grad_scale: 32.0 2023-06-19 05:53:40,263 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.147e+02 2.848e+02 3.405e+02 4.028e+02 7.103e+02, threshold=6.809e+02, percent-clipped=3.0 2023-06-19 05:54:30,371 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.73 vs. limit=15.0 2023-06-19 05:54:59,198 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=292110.0, ans=0.125 2023-06-19 05:55:02,593 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.09 vs. limit=15.0 2023-06-19 05:55:05,667 INFO [train.py:996] (2/4) Epoch 2, batch 18200, loss[loss=0.2327, simple_loss=0.302, pruned_loss=0.08166, over 21853.00 frames. ], tot_loss[loss=0.2788, simple_loss=0.3408, pruned_loss=0.1084, over 4278598.90 frames. ], batch size: 102, lr: 1.60e-02, grad_scale: 32.0 2023-06-19 05:55:29,695 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=292170.0, ans=0.0 2023-06-19 05:55:45,935 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=292230.0, ans=0.0 2023-06-19 05:56:13,419 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 05:57:11,337 INFO [train.py:996] (2/4) Epoch 2, batch 18250, loss[loss=0.2113, simple_loss=0.2725, pruned_loss=0.07507, over 21857.00 frames. ], tot_loss[loss=0.2716, simple_loss=0.333, pruned_loss=0.1051, over 4276238.01 frames. ], batch size: 107, lr: 1.60e-02, grad_scale: 32.0 2023-06-19 05:57:30,902 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.767e+02 2.559e+02 2.946e+02 3.738e+02 5.936e+02, threshold=5.891e+02, percent-clipped=0.0 2023-06-19 05:59:04,247 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=292710.0, ans=0.125 2023-06-19 05:59:09,760 INFO [train.py:996] (2/4) Epoch 2, batch 18300, loss[loss=0.2595, simple_loss=0.3212, pruned_loss=0.09894, over 21920.00 frames. ], tot_loss[loss=0.2693, simple_loss=0.3318, pruned_loss=0.1034, over 4271096.67 frames. ], batch size: 124, lr: 1.60e-02, grad_scale: 32.0 2023-06-19 05:59:19,556 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.49 vs. limit=15.0 2023-06-19 05:59:46,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=292830.0, ans=0.125 2023-06-19 05:59:48,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=292830.0, ans=0.125 2023-06-19 06:01:22,878 INFO [train.py:996] (2/4) Epoch 2, batch 18350, loss[loss=0.268, simple_loss=0.324, pruned_loss=0.106, over 21297.00 frames. ], tot_loss[loss=0.2737, simple_loss=0.338, pruned_loss=0.1047, over 4272782.68 frames. ], batch size: 144, lr: 1.60e-02, grad_scale: 32.0 2023-06-19 06:01:24,669 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=293070.0, ans=0.125 2023-06-19 06:01:43,030 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.967e+02 3.069e+02 3.880e+02 5.006e+02 9.959e+02, threshold=7.760e+02, percent-clipped=14.0 2023-06-19 06:01:47,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=293130.0, ans=0.0 2023-06-19 06:02:52,968 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 06:03:19,017 INFO [train.py:996] (2/4) Epoch 2, batch 18400, loss[loss=0.2544, simple_loss=0.3349, pruned_loss=0.087, over 21497.00 frames. ], tot_loss[loss=0.2692, simple_loss=0.3343, pruned_loss=0.1021, over 4267930.79 frames. ], batch size: 473, lr: 1.60e-02, grad_scale: 32.0 2023-06-19 06:04:07,664 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.20 vs. limit=6.0 2023-06-19 06:04:30,048 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.50 vs. limit=15.0 2023-06-19 06:04:40,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=293490.0, ans=0.125 2023-06-19 06:04:53,520 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=293550.0, ans=0.09899494936611666 2023-06-19 06:05:01,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=293550.0, ans=0.1 2023-06-19 06:05:02,710 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-06-19 06:05:12,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=293610.0, ans=0.1 2023-06-19 06:05:21,634 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 06:05:28,472 INFO [train.py:996] (2/4) Epoch 2, batch 18450, loss[loss=0.2842, simple_loss=0.3325, pruned_loss=0.1179, over 21603.00 frames. ], tot_loss[loss=0.2629, simple_loss=0.3305, pruned_loss=0.09762, over 4247662.86 frames. ], batch size: 415, lr: 1.60e-02, grad_scale: 32.0 2023-06-19 06:05:43,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=293670.0, ans=0.1 2023-06-19 06:05:48,547 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.750e+02 2.533e+02 3.129e+02 3.878e+02 6.267e+02, threshold=6.259e+02, percent-clipped=0.0 2023-06-19 06:06:30,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=293790.0, ans=0.125 2023-06-19 06:07:28,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=293910.0, ans=0.125 2023-06-19 06:07:36,735 INFO [train.py:996] (2/4) Epoch 2, batch 18500, loss[loss=0.2715, simple_loss=0.3108, pruned_loss=0.1161, over 21375.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.3238, pruned_loss=0.09663, over 4257325.15 frames. ], batch size: 473, lr: 1.60e-02, grad_scale: 32.0 2023-06-19 06:07:46,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=293970.0, ans=0.0 2023-06-19 06:07:48,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=293970.0, ans=0.1 2023-06-19 06:08:28,387 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=294030.0, ans=0.125 2023-06-19 06:09:36,610 INFO [train.py:996] (2/4) Epoch 2, batch 18550, loss[loss=0.221, simple_loss=0.2882, pruned_loss=0.07691, over 21507.00 frames. ], tot_loss[loss=0.2566, simple_loss=0.3219, pruned_loss=0.09564, over 4252813.01 frames. ], batch size: 230, lr: 1.60e-02, grad_scale: 32.0 2023-06-19 06:09:49,866 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=294270.0, ans=0.2 2023-06-19 06:10:04,161 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.907e+02 2.529e+02 3.013e+02 3.541e+02 7.378e+02, threshold=6.027e+02, percent-clipped=2.0 2023-06-19 06:10:29,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=294390.0, ans=0.0 2023-06-19 06:10:44,674 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=294390.0, ans=0.0 2023-06-19 06:10:58,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=294450.0, ans=0.125 2023-06-19 06:11:07,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=294450.0, ans=0.125 2023-06-19 06:11:23,652 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-06-19 06:11:43,210 INFO [train.py:996] (2/4) Epoch 2, batch 18600, loss[loss=0.2238, simple_loss=0.2882, pruned_loss=0.07974, over 21255.00 frames. ], tot_loss[loss=0.2555, simple_loss=0.3197, pruned_loss=0.09568, over 4254470.51 frames. ], batch size: 176, lr: 1.60e-02, grad_scale: 32.0 2023-06-19 06:11:52,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=294570.0, ans=0.0 2023-06-19 06:11:52,840 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.32 vs. limit=22.5 2023-06-19 06:12:59,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=294690.0, ans=0.125 2023-06-19 06:13:44,498 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.25 vs. limit=15.0 2023-06-19 06:13:49,187 INFO [train.py:996] (2/4) Epoch 2, batch 18650, loss[loss=0.2657, simple_loss=0.318, pruned_loss=0.1067, over 21976.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3196, pruned_loss=0.09643, over 4260999.31 frames. ], batch size: 113, lr: 1.59e-02, grad_scale: 32.0 2023-06-19 06:14:20,857 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.956e+02 2.816e+02 3.268e+02 3.940e+02 5.301e+02, threshold=6.535e+02, percent-clipped=0.0 2023-06-19 06:14:30,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=294930.0, ans=0.05 2023-06-19 06:15:11,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=294990.0, ans=0.125 2023-06-19 06:15:54,973 INFO [train.py:996] (2/4) Epoch 2, batch 18700, loss[loss=0.291, simple_loss=0.3371, pruned_loss=0.1224, over 21748.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.3182, pruned_loss=0.09844, over 4267073.37 frames. ], batch size: 441, lr: 1.59e-02, grad_scale: 32.0 2023-06-19 06:16:57,674 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.83 vs. limit=15.0 2023-06-19 06:17:09,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=295290.0, ans=0.125 2023-06-19 06:17:37,511 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.72 vs. limit=10.0 2023-06-19 06:18:10,881 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=295410.0, ans=0.2 2023-06-19 06:18:13,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=295470.0, ans=0.0 2023-06-19 06:18:15,069 INFO [train.py:996] (2/4) Epoch 2, batch 18750, loss[loss=0.2913, simple_loss=0.3589, pruned_loss=0.1118, over 21632.00 frames. ], tot_loss[loss=0.2607, simple_loss=0.3198, pruned_loss=0.1008, over 4265193.68 frames. ], batch size: 230, lr: 1.59e-02, grad_scale: 32.0 2023-06-19 06:18:16,984 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=295470.0, ans=0.125 2023-06-19 06:18:25,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=295470.0, ans=0.125 2023-06-19 06:18:34,902 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.081e+02 2.770e+02 3.195e+02 3.990e+02 6.392e+02, threshold=6.389e+02, percent-clipped=0.0 2023-06-19 06:18:35,807 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.04 vs. limit=10.0 2023-06-19 06:19:57,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=295710.0, ans=0.2 2023-06-19 06:20:07,809 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=295770.0, ans=0.0 2023-06-19 06:20:08,755 INFO [train.py:996] (2/4) Epoch 2, batch 18800, loss[loss=0.2047, simple_loss=0.2828, pruned_loss=0.06331, over 21445.00 frames. ], tot_loss[loss=0.2638, simple_loss=0.3253, pruned_loss=0.1012, over 4257420.45 frames. ], batch size: 211, lr: 1.59e-02, grad_scale: 32.0 2023-06-19 06:20:10,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=295770.0, ans=0.0 2023-06-19 06:22:26,188 INFO [train.py:996] (2/4) Epoch 2, batch 18850, loss[loss=0.1912, simple_loss=0.2666, pruned_loss=0.05795, over 21179.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3195, pruned_loss=0.0951, over 4262285.69 frames. ], batch size: 143, lr: 1.59e-02, grad_scale: 16.0 2023-06-19 06:22:52,843 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.657e+02 2.651e+02 3.232e+02 4.439e+02 7.009e+02, threshold=6.464e+02, percent-clipped=3.0 2023-06-19 06:23:01,908 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=296130.0, ans=0.125 2023-06-19 06:23:31,206 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 06:23:50,052 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=296250.0, ans=0.1 2023-06-19 06:24:03,769 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 06:24:28,385 INFO [train.py:996] (2/4) Epoch 2, batch 18900, loss[loss=0.2593, simple_loss=0.3045, pruned_loss=0.107, over 21738.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3183, pruned_loss=0.09656, over 4249266.58 frames. ], batch size: 282, lr: 1.59e-02, grad_scale: 16.0 2023-06-19 06:24:31,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=296370.0, ans=0.125 2023-06-19 06:25:07,427 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=296430.0, ans=0.0 2023-06-19 06:25:20,241 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=296490.0, ans=0.0 2023-06-19 06:25:22,315 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.98 vs. limit=15.0 2023-06-19 06:26:08,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=296550.0, ans=0.0 2023-06-19 06:26:09,554 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=296550.0, ans=0.035 2023-06-19 06:26:31,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=296610.0, ans=0.125 2023-06-19 06:26:37,115 INFO [train.py:996] (2/4) Epoch 2, batch 18950, loss[loss=0.2691, simple_loss=0.3029, pruned_loss=0.1177, over 20285.00 frames. ], tot_loss[loss=0.261, simple_loss=0.321, pruned_loss=0.1005, over 4253284.87 frames. ], batch size: 703, lr: 1.59e-02, grad_scale: 16.0 2023-06-19 06:27:13,338 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.019e+02 2.916e+02 3.423e+02 4.145e+02 6.065e+02, threshold=6.846e+02, percent-clipped=0.0 2023-06-19 06:27:29,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=296730.0, ans=0.1 2023-06-19 06:28:57,539 INFO [train.py:996] (2/4) Epoch 2, batch 19000, loss[loss=0.3814, simple_loss=0.4163, pruned_loss=0.1732, over 21407.00 frames. ], tot_loss[loss=0.2701, simple_loss=0.3326, pruned_loss=0.1038, over 4252616.95 frames. ], batch size: 471, lr: 1.59e-02, grad_scale: 16.0 2023-06-19 06:29:08,784 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.55 vs. limit=22.5 2023-06-19 06:30:09,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=297090.0, ans=0.07 2023-06-19 06:31:09,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=297210.0, ans=0.125 2023-06-19 06:31:14,488 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.99 vs. limit=22.5 2023-06-19 06:31:16,445 INFO [train.py:996] (2/4) Epoch 2, batch 19050, loss[loss=0.2777, simple_loss=0.3303, pruned_loss=0.1126, over 21798.00 frames. ], tot_loss[loss=0.2784, simple_loss=0.3386, pruned_loss=0.1091, over 4263599.43 frames. ], batch size: 282, lr: 1.59e-02, grad_scale: 16.0 2023-06-19 06:31:56,027 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=297330.0, ans=0.2 2023-06-19 06:31:58,933 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.011e+02 3.316e+02 4.197e+02 5.124e+02 7.398e+02, threshold=8.394e+02, percent-clipped=4.0 2023-06-19 06:32:42,787 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.07 vs. limit=15.0 2023-06-19 06:33:17,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=297510.0, ans=0.1 2023-06-19 06:33:19,409 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=297510.0, ans=0.09899494936611666 2023-06-19 06:33:40,903 INFO [train.py:996] (2/4) Epoch 2, batch 19100, loss[loss=0.2149, simple_loss=0.2832, pruned_loss=0.07333, over 21393.00 frames. ], tot_loss[loss=0.278, simple_loss=0.3366, pruned_loss=0.1097, over 4267824.20 frames. ], batch size: 131, lr: 1.59e-02, grad_scale: 16.0 2023-06-19 06:34:04,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=297630.0, ans=0.05 2023-06-19 06:34:06,781 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.86 vs. limit=10.0 2023-06-19 06:34:57,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=297690.0, ans=0.125 2023-06-19 06:35:01,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=297750.0, ans=0.07 2023-06-19 06:35:07,339 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=297750.0, ans=0.1 2023-06-19 06:35:16,649 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.73 vs. limit=6.0 2023-06-19 06:35:26,692 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 06:35:48,531 INFO [train.py:996] (2/4) Epoch 2, batch 19150, loss[loss=0.3103, simple_loss=0.3935, pruned_loss=0.1135, over 21700.00 frames. ], tot_loss[loss=0.2783, simple_loss=0.337, pruned_loss=0.1098, over 4273520.50 frames. ], batch size: 351, lr: 1.59e-02, grad_scale: 16.0 2023-06-19 06:36:23,911 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.074e+02 3.259e+02 3.778e+02 5.445e+02 1.039e+03, threshold=7.556e+02, percent-clipped=5.0 2023-06-19 06:36:24,455 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=297930.0, ans=0.2 2023-06-19 06:36:44,838 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.59 vs. limit=12.0 2023-06-19 06:37:46,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=298110.0, ans=0.125 2023-06-19 06:37:46,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=298110.0, ans=0.125 2023-06-19 06:38:21,147 INFO [train.py:996] (2/4) Epoch 2, batch 19200, loss[loss=0.3131, simple_loss=0.3984, pruned_loss=0.1139, over 21769.00 frames. ], tot_loss[loss=0.2834, simple_loss=0.3471, pruned_loss=0.1098, over 4279856.63 frames. ], batch size: 351, lr: 1.59e-02, grad_scale: 32.0 2023-06-19 06:38:45,728 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.13 vs. limit=22.5 2023-06-19 06:38:53,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=298230.0, ans=0.125 2023-06-19 06:38:54,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=298230.0, ans=6.0 2023-06-19 06:39:15,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=298290.0, ans=0.125 2023-06-19 06:39:43,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=298350.0, ans=0.125 2023-06-19 06:40:06,810 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=298410.0, ans=0.125 2023-06-19 06:40:23,951 INFO [train.py:996] (2/4) Epoch 2, batch 19250, loss[loss=0.2053, simple_loss=0.2976, pruned_loss=0.05648, over 21785.00 frames. ], tot_loss[loss=0.2757, simple_loss=0.3447, pruned_loss=0.1033, over 4283034.04 frames. ], batch size: 332, lr: 1.58e-02, grad_scale: 32.0 2023-06-19 06:40:47,033 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.454e+02 2.384e+02 2.937e+02 3.389e+02 6.470e+02, threshold=5.874e+02, percent-clipped=0.0 2023-06-19 06:40:48,243 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.08 vs. limit=22.5 2023-06-19 06:41:08,629 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.84 vs. limit=15.0 2023-06-19 06:41:16,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=298590.0, ans=0.1 2023-06-19 06:41:46,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=298650.0, ans=0.0 2023-06-19 06:42:30,659 INFO [train.py:996] (2/4) Epoch 2, batch 19300, loss[loss=0.279, simple_loss=0.3486, pruned_loss=0.1047, over 21531.00 frames. ], tot_loss[loss=0.2739, simple_loss=0.3417, pruned_loss=0.103, over 4289614.35 frames. ], batch size: 471, lr: 1.58e-02, grad_scale: 32.0 2023-06-19 06:42:32,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=298770.0, ans=0.125 2023-06-19 06:43:16,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=298830.0, ans=0.2 2023-06-19 06:43:50,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=298950.0, ans=0.125 2023-06-19 06:44:29,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=299010.0, ans=0.2 2023-06-19 06:44:31,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=299010.0, ans=0.125 2023-06-19 06:44:37,838 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.89 vs. limit=15.0 2023-06-19 06:44:44,480 INFO [train.py:996] (2/4) Epoch 2, batch 19350, loss[loss=0.2565, simple_loss=0.3479, pruned_loss=0.08253, over 21205.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3359, pruned_loss=0.09816, over 4291259.56 frames. ], batch size: 548, lr: 1.58e-02, grad_scale: 32.0 2023-06-19 06:45:06,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=299070.0, ans=0.5 2023-06-19 06:45:22,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=299130.0, ans=0.0 2023-06-19 06:45:28,981 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.812e+02 2.566e+02 3.106e+02 3.775e+02 8.572e+02, threshold=6.211e+02, percent-clipped=2.0 2023-06-19 06:46:42,942 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.78 vs. limit=15.0 2023-06-19 06:47:03,754 INFO [train.py:996] (2/4) Epoch 2, batch 19400, loss[loss=0.3517, simple_loss=0.3824, pruned_loss=0.1605, over 21720.00 frames. ], tot_loss[loss=0.2667, simple_loss=0.3359, pruned_loss=0.09878, over 4292880.20 frames. ], batch size: 508, lr: 1.58e-02, grad_scale: 32.0 2023-06-19 06:48:00,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=299490.0, ans=0.125 2023-06-19 06:48:29,904 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=299550.0, ans=0.125 2023-06-19 06:49:09,820 INFO [train.py:996] (2/4) Epoch 2, batch 19450, loss[loss=0.2596, simple_loss=0.3102, pruned_loss=0.1045, over 20158.00 frames. ], tot_loss[loss=0.269, simple_loss=0.334, pruned_loss=0.102, over 4297407.08 frames. ], batch size: 703, lr: 1.58e-02, grad_scale: 32.0 2023-06-19 06:49:34,670 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.090e+02 3.192e+02 3.820e+02 4.643e+02 7.190e+02, threshold=7.640e+02, percent-clipped=4.0 2023-06-19 06:49:48,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=299730.0, ans=0.0 2023-06-19 06:50:06,709 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=299790.0, ans=0.1 2023-06-19 06:50:31,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=299850.0, ans=0.0 2023-06-19 06:51:12,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=299910.0, ans=0.125 2023-06-19 06:51:16,589 INFO [train.py:996] (2/4) Epoch 2, batch 19500, loss[loss=0.238, simple_loss=0.2801, pruned_loss=0.09796, over 21990.00 frames. ], tot_loss[loss=0.2665, simple_loss=0.3281, pruned_loss=0.1024, over 4295229.53 frames. ], batch size: 103, lr: 1.58e-02, grad_scale: 32.0 2023-06-19 06:52:16,314 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.35 vs. limit=22.5 2023-06-19 06:52:18,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=300030.0, ans=0.125 2023-06-19 06:52:48,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=300150.0, ans=0.05 2023-06-19 06:53:27,334 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.72 vs. limit=22.5 2023-06-19 06:53:38,792 INFO [train.py:996] (2/4) Epoch 2, batch 19550, loss[loss=0.187, simple_loss=0.2585, pruned_loss=0.05778, over 21335.00 frames. ], tot_loss[loss=0.2614, simple_loss=0.3228, pruned_loss=0.1, over 4282026.55 frames. ], batch size: 176, lr: 1.58e-02, grad_scale: 32.0 2023-06-19 06:53:59,372 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=300270.0, ans=0.0 2023-06-19 06:54:24,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=300330.0, ans=0.1 2023-06-19 06:54:27,317 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.977e+02 2.773e+02 3.270e+02 3.884e+02 6.073e+02, threshold=6.540e+02, percent-clipped=0.0 2023-06-19 06:54:42,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=300390.0, ans=0.2 2023-06-19 06:55:27,773 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.15 vs. limit=22.5 2023-06-19 06:55:56,406 INFO [train.py:996] (2/4) Epoch 2, batch 19600, loss[loss=0.2743, simple_loss=0.3305, pruned_loss=0.109, over 21641.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.3261, pruned_loss=0.1012, over 4287027.65 frames. ], batch size: 263, lr: 1.58e-02, grad_scale: 32.0 2023-06-19 06:55:58,235 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=300570.0, ans=0.5 2023-06-19 06:56:59,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=300690.0, ans=0.2 2023-06-19 06:58:20,665 INFO [train.py:996] (2/4) Epoch 2, batch 19650, loss[loss=0.2564, simple_loss=0.3397, pruned_loss=0.08649, over 19949.00 frames. ], tot_loss[loss=0.2731, simple_loss=0.333, pruned_loss=0.1066, over 4283282.94 frames. ], batch size: 704, lr: 1.58e-02, grad_scale: 32.0 2023-06-19 06:58:21,235 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=300870.0, ans=10.0 2023-06-19 06:58:54,412 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 2.810e+02 3.179e+02 3.741e+02 7.713e+02, threshold=6.358e+02, percent-clipped=2.0 2023-06-19 06:59:33,121 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.69 vs. limit=10.0 2023-06-19 06:59:55,473 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.66 vs. limit=15.0 2023-06-19 07:00:55,589 INFO [train.py:996] (2/4) Epoch 2, batch 19700, loss[loss=0.2766, simple_loss=0.3585, pruned_loss=0.09738, over 21649.00 frames. ], tot_loss[loss=0.2765, simple_loss=0.3377, pruned_loss=0.1077, over 4288253.72 frames. ], batch size: 414, lr: 1.58e-02, grad_scale: 32.0 2023-06-19 07:01:56,262 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=301290.0, ans=0.1 2023-06-19 07:01:56,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=301290.0, ans=0.125 2023-06-19 07:02:23,921 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=301350.0, ans=0.1 2023-06-19 07:02:52,097 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.01 vs. limit=15.0 2023-06-19 07:03:08,468 INFO [train.py:996] (2/4) Epoch 2, batch 19750, loss[loss=0.2834, simple_loss=0.3546, pruned_loss=0.1061, over 21433.00 frames. ], tot_loss[loss=0.2826, simple_loss=0.3461, pruned_loss=0.1096, over 4276624.45 frames. ], batch size: 194, lr: 1.58e-02, grad_scale: 32.0 2023-06-19 07:03:44,312 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.063e+02 2.569e+02 3.192e+02 3.926e+02 7.719e+02, threshold=6.384e+02, percent-clipped=3.0 2023-06-19 07:04:21,476 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.28 vs. limit=15.0 2023-06-19 07:05:27,407 INFO [train.py:996] (2/4) Epoch 2, batch 19800, loss[loss=0.2212, simple_loss=0.2761, pruned_loss=0.08319, over 21487.00 frames. ], tot_loss[loss=0.2842, simple_loss=0.3463, pruned_loss=0.111, over 4279858.73 frames. ], batch size: 131, lr: 1.58e-02, grad_scale: 32.0 2023-06-19 07:06:43,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=301890.0, ans=0.125 2023-06-19 07:06:43,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=301890.0, ans=0.0 2023-06-19 07:06:58,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=301950.0, ans=0.125 2023-06-19 07:07:24,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=301950.0, ans=0.125 2023-06-19 07:07:35,442 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.30 vs. limit=12.0 2023-06-19 07:07:36,145 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=302010.0, ans=0.125 2023-06-19 07:07:55,597 INFO [train.py:996] (2/4) Epoch 2, batch 19850, loss[loss=0.3043, simple_loss=0.3534, pruned_loss=0.1276, over 19967.00 frames. ], tot_loss[loss=0.2728, simple_loss=0.3373, pruned_loss=0.1042, over 4275133.38 frames. ], batch size: 702, lr: 1.58e-02, grad_scale: 32.0 2023-06-19 07:08:29,053 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.743e+02 2.580e+02 3.192e+02 4.086e+02 8.227e+02, threshold=6.384e+02, percent-clipped=3.0 2023-06-19 07:09:29,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=302250.0, ans=0.0 2023-06-19 07:09:44,809 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=302310.0, ans=22.5 2023-06-19 07:10:15,734 INFO [train.py:996] (2/4) Epoch 2, batch 19900, loss[loss=0.2262, simple_loss=0.2852, pruned_loss=0.08357, over 21195.00 frames. ], tot_loss[loss=0.2684, simple_loss=0.3362, pruned_loss=0.1003, over 4272690.97 frames. ], batch size: 176, lr: 1.57e-02, grad_scale: 32.0 2023-06-19 07:11:15,936 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=302490.0, ans=10.0 2023-06-19 07:12:13,339 INFO [train.py:996] (2/4) Epoch 2, batch 19950, loss[loss=0.2388, simple_loss=0.282, pruned_loss=0.09773, over 21557.00 frames. ], tot_loss[loss=0.2654, simple_loss=0.3304, pruned_loss=0.1002, over 4268332.32 frames. ], batch size: 247, lr: 1.57e-02, grad_scale: 32.0 2023-06-19 07:12:34,936 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=17.96 vs. limit=22.5 2023-06-19 07:12:46,415 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.827e+02 2.592e+02 3.325e+02 4.122e+02 6.437e+02, threshold=6.651e+02, percent-clipped=1.0 2023-06-19 07:13:02,175 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.43 vs. limit=10.0 2023-06-19 07:13:02,265 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.64 vs. limit=22.5 2023-06-19 07:13:10,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=302730.0, ans=0.1 2023-06-19 07:13:46,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=302850.0, ans=0.1 2023-06-19 07:14:28,310 INFO [train.py:996] (2/4) Epoch 2, batch 20000, loss[loss=0.279, simple_loss=0.3366, pruned_loss=0.1107, over 21522.00 frames. ], tot_loss[loss=0.2664, simple_loss=0.3315, pruned_loss=0.1007, over 4267463.04 frames. ], batch size: 194, lr: 1.57e-02, grad_scale: 32.0 2023-06-19 07:14:51,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=302970.0, ans=0.125 2023-06-19 07:14:55,931 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=303030.0, ans=0.125 2023-06-19 07:15:09,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=303030.0, ans=0.125 2023-06-19 07:15:09,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=303030.0, ans=0.125 2023-06-19 07:15:17,482 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.97 vs. limit=15.0 2023-06-19 07:16:02,191 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=303150.0, ans=0.0 2023-06-19 07:16:05,082 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=303150.0, ans=0.125 2023-06-19 07:16:41,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=303210.0, ans=0.0 2023-06-19 07:16:47,201 INFO [train.py:996] (2/4) Epoch 2, batch 20050, loss[loss=0.2675, simple_loss=0.3222, pruned_loss=0.1064, over 21477.00 frames. ], tot_loss[loss=0.2708, simple_loss=0.3343, pruned_loss=0.1036, over 4279564.13 frames. ], batch size: 194, lr: 1.57e-02, grad_scale: 32.0 2023-06-19 07:16:47,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=303270.0, ans=0.04949747468305833 2023-06-19 07:16:50,480 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=303270.0, ans=0.0 2023-06-19 07:17:17,714 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.952e+02 3.033e+02 3.423e+02 3.985e+02 8.117e+02, threshold=6.846e+02, percent-clipped=3.0 2023-06-19 07:17:36,361 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=303330.0, ans=0.1 2023-06-19 07:19:06,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=303570.0, ans=0.2 2023-06-19 07:19:11,606 INFO [train.py:996] (2/4) Epoch 2, batch 20100, loss[loss=0.3522, simple_loss=0.4423, pruned_loss=0.131, over 21301.00 frames. ], tot_loss[loss=0.2753, simple_loss=0.3379, pruned_loss=0.1064, over 4281556.76 frames. ], batch size: 548, lr: 1.57e-02, grad_scale: 32.0 2023-06-19 07:19:36,215 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.46 vs. limit=22.5 2023-06-19 07:20:24,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=303690.0, ans=0.0 2023-06-19 07:20:42,971 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 07:20:43,072 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=303750.0, ans=0.1 2023-06-19 07:21:56,485 INFO [train.py:996] (2/4) Epoch 2, batch 20150, loss[loss=0.3067, simple_loss=0.3656, pruned_loss=0.1239, over 21384.00 frames. ], tot_loss[loss=0.2858, simple_loss=0.3488, pruned_loss=0.1115, over 4277463.51 frames. ], batch size: 131, lr: 1.57e-02, grad_scale: 32.0 2023-06-19 07:22:10,954 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.73 vs. limit=10.0 2023-06-19 07:22:26,654 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.229e+02 2.980e+02 4.036e+02 4.841e+02 1.073e+03, threshold=8.072e+02, percent-clipped=4.0 2023-06-19 07:23:22,785 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 07:23:48,056 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.28 vs. limit=10.0 2023-06-19 07:23:59,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=304110.0, ans=0.125 2023-06-19 07:24:04,560 INFO [train.py:996] (2/4) Epoch 2, batch 20200, loss[loss=0.3203, simple_loss=0.3825, pruned_loss=0.1291, over 21369.00 frames. ], tot_loss[loss=0.2931, simple_loss=0.3565, pruned_loss=0.1149, over 4278993.27 frames. ], batch size: 131, lr: 1.57e-02, grad_scale: 16.0 2023-06-19 07:24:45,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=304230.0, ans=0.1 2023-06-19 07:25:37,430 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=304350.0, ans=0.1 2023-06-19 07:26:28,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=304410.0, ans=0.0 2023-06-19 07:26:31,276 INFO [train.py:996] (2/4) Epoch 2, batch 20250, loss[loss=0.2739, simple_loss=0.3265, pruned_loss=0.1107, over 21305.00 frames. ], tot_loss[loss=0.2909, simple_loss=0.3563, pruned_loss=0.1127, over 4284284.23 frames. ], batch size: 159, lr: 1.57e-02, grad_scale: 16.0 2023-06-19 07:26:55,244 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.982e+02 2.816e+02 3.333e+02 3.978e+02 6.194e+02, threshold=6.665e+02, percent-clipped=0.0 2023-06-19 07:27:37,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=304590.0, ans=0.125 2023-06-19 07:28:31,361 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.21 vs. limit=15.0 2023-06-19 07:28:31,497 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.92 vs. limit=15.0 2023-06-19 07:28:36,322 INFO [train.py:996] (2/4) Epoch 2, batch 20300, loss[loss=0.2783, simple_loss=0.3594, pruned_loss=0.0986, over 21745.00 frames. ], tot_loss[loss=0.2836, simple_loss=0.3512, pruned_loss=0.108, over 4274546.03 frames. ], batch size: 414, lr: 1.57e-02, grad_scale: 16.0 2023-06-19 07:28:39,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=304770.0, ans=0.125 2023-06-19 07:28:45,208 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.24 vs. limit=22.5 2023-06-19 07:29:09,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=304830.0, ans=0.0 2023-06-19 07:29:14,141 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.60 vs. limit=10.0 2023-06-19 07:29:41,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=304950.0, ans=0.0 2023-06-19 07:29:48,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=304950.0, ans=0.125 2023-06-19 07:29:48,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=304950.0, ans=0.0 2023-06-19 07:30:34,142 INFO [train.py:996] (2/4) Epoch 2, batch 20350, loss[loss=0.2993, simple_loss=0.3674, pruned_loss=0.1156, over 19830.00 frames. ], tot_loss[loss=0.2847, simple_loss=0.3519, pruned_loss=0.1088, over 4271696.36 frames. ], batch size: 704, lr: 1.57e-02, grad_scale: 16.0 2023-06-19 07:30:40,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=305070.0, ans=0.2 2023-06-19 07:31:00,711 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.949e+02 2.745e+02 3.216e+02 3.932e+02 7.808e+02, threshold=6.432e+02, percent-clipped=2.0 2023-06-19 07:31:34,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=305190.0, ans=0.0 2023-06-19 07:32:55,104 INFO [train.py:996] (2/4) Epoch 2, batch 20400, loss[loss=0.3481, simple_loss=0.4015, pruned_loss=0.1474, over 21628.00 frames. ], tot_loss[loss=0.2904, simple_loss=0.3562, pruned_loss=0.1122, over 4267259.98 frames. ], batch size: 389, lr: 1.57e-02, grad_scale: 32.0 2023-06-19 07:33:04,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=305370.0, ans=10.0 2023-06-19 07:33:19,278 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.27 vs. limit=12.0 2023-06-19 07:34:20,840 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.16 vs. limit=10.0 2023-06-19 07:34:50,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=305610.0, ans=0.1 2023-06-19 07:35:00,799 INFO [train.py:996] (2/4) Epoch 2, batch 20450, loss[loss=0.2992, simple_loss=0.3479, pruned_loss=0.1252, over 21479.00 frames. ], tot_loss[loss=0.2949, simple_loss=0.3571, pruned_loss=0.1163, over 4257934.50 frames. ], batch size: 194, lr: 1.57e-02, grad_scale: 32.0 2023-06-19 07:35:33,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=305730.0, ans=0.125 2023-06-19 07:35:34,025 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.893e+02 2.841e+02 3.405e+02 4.322e+02 6.691e+02, threshold=6.810e+02, percent-clipped=2.0 2023-06-19 07:36:43,740 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.04 vs. limit=10.0 2023-06-19 07:36:58,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=305910.0, ans=0.0 2023-06-19 07:37:12,725 INFO [train.py:996] (2/4) Epoch 2, batch 20500, loss[loss=0.2773, simple_loss=0.3262, pruned_loss=0.1142, over 21795.00 frames. ], tot_loss[loss=0.292, simple_loss=0.3522, pruned_loss=0.1159, over 4257128.59 frames. ], batch size: 371, lr: 1.57e-02, grad_scale: 32.0 2023-06-19 07:37:30,433 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.34 vs. limit=15.0 2023-06-19 07:38:01,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=306030.0, ans=0.1 2023-06-19 07:38:16,328 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.11 vs. limit=10.0 2023-06-19 07:39:27,695 INFO [train.py:996] (2/4) Epoch 2, batch 20550, loss[loss=0.2095, simple_loss=0.2874, pruned_loss=0.06576, over 15945.00 frames. ], tot_loss[loss=0.2861, simple_loss=0.3447, pruned_loss=0.1138, over 4252418.57 frames. ], batch size: 60, lr: 1.57e-02, grad_scale: 32.0 2023-06-19 07:39:37,876 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=306270.0, ans=0.125 2023-06-19 07:40:03,743 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.942e+02 2.682e+02 3.144e+02 3.592e+02 6.172e+02, threshold=6.288e+02, percent-clipped=0.0 2023-06-19 07:40:52,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=306390.0, ans=0.125 2023-06-19 07:40:53,899 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=306450.0, ans=0.125 2023-06-19 07:41:02,060 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=306450.0, ans=0.2 2023-06-19 07:41:13,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=306450.0, ans=0.1 2023-06-19 07:41:39,993 INFO [train.py:996] (2/4) Epoch 2, batch 20600, loss[loss=0.2921, simple_loss=0.3436, pruned_loss=0.1203, over 21811.00 frames. ], tot_loss[loss=0.2843, simple_loss=0.3455, pruned_loss=0.1115, over 4248970.14 frames. ], batch size: 414, lr: 1.56e-02, grad_scale: 32.0 2023-06-19 07:41:50,759 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=306570.0, ans=0.0 2023-06-19 07:43:35,105 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=306810.0, ans=0.0 2023-06-19 07:43:48,061 INFO [train.py:996] (2/4) Epoch 2, batch 20650, loss[loss=0.249, simple_loss=0.3014, pruned_loss=0.0983, over 21844.00 frames. ], tot_loss[loss=0.2828, simple_loss=0.3415, pruned_loss=0.1121, over 4254331.31 frames. ], batch size: 247, lr: 1.56e-02, grad_scale: 32.0 2023-06-19 07:43:51,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=306870.0, ans=10.0 2023-06-19 07:43:57,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=306870.0, ans=0.1 2023-06-19 07:44:14,214 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=306930.0, ans=0.125 2023-06-19 07:44:14,248 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=306930.0, ans=0.1 2023-06-19 07:44:18,677 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.957e+02 2.691e+02 3.207e+02 4.278e+02 6.062e+02, threshold=6.414e+02, percent-clipped=0.0 2023-06-19 07:44:34,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=306990.0, ans=0.125 2023-06-19 07:45:13,459 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=307050.0, ans=0.125 2023-06-19 07:45:42,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=307110.0, ans=0.1 2023-06-19 07:45:46,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=307110.0, ans=0.2 2023-06-19 07:45:53,884 INFO [train.py:996] (2/4) Epoch 2, batch 20700, loss[loss=0.2293, simple_loss=0.301, pruned_loss=0.07881, over 21677.00 frames. ], tot_loss[loss=0.2739, simple_loss=0.3341, pruned_loss=0.1069, over 4251987.45 frames. ], batch size: 298, lr: 1.56e-02, grad_scale: 32.0 2023-06-19 07:45:54,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=307170.0, ans=0.125 2023-06-19 07:45:54,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=307170.0, ans=0.125 2023-06-19 07:46:26,602 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=307230.0, ans=0.125 2023-06-19 07:48:09,364 INFO [train.py:996] (2/4) Epoch 2, batch 20750, loss[loss=0.3053, simple_loss=0.3781, pruned_loss=0.1162, over 21449.00 frames. ], tot_loss[loss=0.2728, simple_loss=0.3353, pruned_loss=0.1052, over 4253094.70 frames. ], batch size: 194, lr: 1.56e-02, grad_scale: 32.0 2023-06-19 07:48:44,916 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.914e+02 2.969e+02 3.611e+02 4.710e+02 7.755e+02, threshold=7.221e+02, percent-clipped=2.0 2023-06-19 07:49:22,262 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=307590.0, ans=0.1 2023-06-19 07:50:08,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=307710.0, ans=0.125 2023-06-19 07:50:11,242 INFO [train.py:996] (2/4) Epoch 2, batch 20800, loss[loss=0.1697, simple_loss=0.2108, pruned_loss=0.06428, over 17982.00 frames. ], tot_loss[loss=0.2769, simple_loss=0.3393, pruned_loss=0.1072, over 4241591.71 frames. ], batch size: 67, lr: 1.56e-02, grad_scale: 32.0 2023-06-19 07:51:51,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=307950.0, ans=0.0 2023-06-19 07:51:54,309 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=307950.0, ans=0.07 2023-06-19 07:52:13,357 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=308010.0, ans=0.125 2023-06-19 07:52:34,128 INFO [train.py:996] (2/4) Epoch 2, batch 20850, loss[loss=0.2399, simple_loss=0.3048, pruned_loss=0.08748, over 21792.00 frames. ], tot_loss[loss=0.2704, simple_loss=0.3317, pruned_loss=0.1045, over 4247393.65 frames. ], batch size: 282, lr: 1.56e-02, grad_scale: 32.0 2023-06-19 07:53:03,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=308130.0, ans=15.0 2023-06-19 07:53:03,496 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.087e+02 2.789e+02 3.231e+02 4.140e+02 1.099e+03, threshold=6.461e+02, percent-clipped=5.0 2023-06-19 07:54:21,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=308310.0, ans=0.2 2023-06-19 07:54:37,012 INFO [train.py:996] (2/4) Epoch 2, batch 20900, loss[loss=0.2549, simple_loss=0.3242, pruned_loss=0.09276, over 21460.00 frames. ], tot_loss[loss=0.2715, simple_loss=0.3327, pruned_loss=0.1052, over 4256954.54 frames. ], batch size: 195, lr: 1.56e-02, grad_scale: 32.0 2023-06-19 07:54:47,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=308370.0, ans=0.2 2023-06-19 07:56:00,766 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.34 vs. limit=15.0 2023-06-19 07:56:04,987 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.63 vs. limit=12.0 2023-06-19 07:56:15,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=308610.0, ans=0.125 2023-06-19 07:56:17,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=308610.0, ans=0.125 2023-06-19 07:56:18,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=308610.0, ans=0.125 2023-06-19 07:56:33,637 INFO [train.py:996] (2/4) Epoch 2, batch 20950, loss[loss=0.1995, simple_loss=0.2652, pruned_loss=0.06695, over 21404.00 frames. ], tot_loss[loss=0.2649, simple_loss=0.3283, pruned_loss=0.1007, over 4248544.20 frames. ], batch size: 131, lr: 1.56e-02, grad_scale: 32.0 2023-06-19 07:56:56,392 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.539e+02 2.542e+02 3.174e+02 4.032e+02 6.054e+02, threshold=6.348e+02, percent-clipped=0.0 2023-06-19 07:57:47,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=308790.0, ans=0.1 2023-06-19 07:58:13,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=308850.0, ans=0.125 2023-06-19 07:58:38,194 INFO [train.py:996] (2/4) Epoch 2, batch 21000, loss[loss=0.1934, simple_loss=0.2555, pruned_loss=0.0657, over 17690.00 frames. ], tot_loss[loss=0.264, simple_loss=0.3268, pruned_loss=0.1006, over 4249665.90 frames. ], batch size: 68, lr: 1.56e-02, grad_scale: 32.0 2023-06-19 07:58:38,195 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-19 07:59:32,905 INFO [train.py:1028] (2/4) Epoch 2, validation: loss=0.2892, simple_loss=0.3858, pruned_loss=0.09632, over 1796401.00 frames. 2023-06-19 07:59:32,906 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-19 07:59:48,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=309030.0, ans=0.125 2023-06-19 08:00:04,736 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=309090.0, ans=0.1 2023-06-19 08:00:28,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=309150.0, ans=0.125 2023-06-19 08:00:44,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=309150.0, ans=0.0 2023-06-19 08:00:57,524 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=309210.0, ans=0.0 2023-06-19 08:01:01,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=309210.0, ans=0.2 2023-06-19 08:01:14,076 INFO [train.py:996] (2/4) Epoch 2, batch 21050, loss[loss=0.2834, simple_loss=0.3235, pruned_loss=0.1216, over 21462.00 frames. ], tot_loss[loss=0.2639, simple_loss=0.3254, pruned_loss=0.1012, over 4246848.43 frames. ], batch size: 441, lr: 1.56e-02, grad_scale: 32.0 2023-06-19 08:01:45,262 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.159e+02 2.810e+02 3.247e+02 3.992e+02 5.990e+02, threshold=6.494e+02, percent-clipped=0.0 2023-06-19 08:03:08,456 INFO [train.py:996] (2/4) Epoch 2, batch 21100, loss[loss=0.2764, simple_loss=0.3321, pruned_loss=0.1103, over 21988.00 frames. ], tot_loss[loss=0.2623, simple_loss=0.3231, pruned_loss=0.1008, over 4228999.12 frames. ], batch size: 103, lr: 1.56e-02, grad_scale: 32.0 2023-06-19 08:04:38,218 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.91 vs. limit=10.0 2023-06-19 08:04:58,859 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=309750.0, ans=0.125 2023-06-19 08:05:00,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=309750.0, ans=0.1 2023-06-19 08:05:17,514 INFO [train.py:996] (2/4) Epoch 2, batch 21150, loss[loss=0.2622, simple_loss=0.3105, pruned_loss=0.107, over 21851.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.3195, pruned_loss=0.1017, over 4241075.15 frames. ], batch size: 373, lr: 1.56e-02, grad_scale: 32.0 2023-06-19 08:05:25,679 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.89 vs. limit=6.0 2023-06-19 08:05:28,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=309870.0, ans=0.125 2023-06-19 08:05:38,539 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=309930.0, ans=0.07 2023-06-19 08:05:40,951 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.033e+02 2.684e+02 3.280e+02 4.252e+02 7.142e+02, threshold=6.560e+02, percent-clipped=1.0 2023-06-19 08:05:42,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=309930.0, ans=0.1 2023-06-19 08:06:10,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=309990.0, ans=0.125 2023-06-19 08:07:12,908 INFO [train.py:996] (2/4) Epoch 2, batch 21200, loss[loss=0.2237, simple_loss=0.2845, pruned_loss=0.08141, over 21316.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.3145, pruned_loss=0.1006, over 4239971.99 frames. ], batch size: 131, lr: 1.56e-02, grad_scale: 32.0 2023-06-19 08:07:13,979 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.81 vs. limit=15.0 2023-06-19 08:07:23,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=310170.0, ans=0.0 2023-06-19 08:08:02,700 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.05 vs. limit=15.0 2023-06-19 08:08:35,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=310350.0, ans=0.09899494936611666 2023-06-19 08:08:38,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=310350.0, ans=0.0 2023-06-19 08:09:05,316 INFO [train.py:996] (2/4) Epoch 2, batch 21250, loss[loss=0.2336, simple_loss=0.2899, pruned_loss=0.08862, over 21579.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.3119, pruned_loss=0.09946, over 4250284.63 frames. ], batch size: 263, lr: 1.55e-02, grad_scale: 32.0 2023-06-19 08:09:08,604 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=310470.0, ans=0.2 2023-06-19 08:09:28,386 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.822e+02 2.442e+02 2.877e+02 3.403e+02 5.738e+02, threshold=5.754e+02, percent-clipped=0.0 2023-06-19 08:09:31,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=310530.0, ans=0.1 2023-06-19 08:09:41,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=310530.0, ans=0.1 2023-06-19 08:10:33,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=310650.0, ans=10.0 2023-06-19 08:10:35,870 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 08:11:05,116 INFO [train.py:996] (2/4) Epoch 2, batch 21300, loss[loss=0.2663, simple_loss=0.3308, pruned_loss=0.1009, over 21915.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.3167, pruned_loss=0.1014, over 4256583.24 frames. ], batch size: 316, lr: 1.55e-02, grad_scale: 32.0 2023-06-19 08:13:27,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=311010.0, ans=10.0 2023-06-19 08:13:31,858 INFO [train.py:996] (2/4) Epoch 2, batch 21350, loss[loss=0.2202, simple_loss=0.3039, pruned_loss=0.0682, over 21591.00 frames. ], tot_loss[loss=0.2645, simple_loss=0.3229, pruned_loss=0.1031, over 4264303.51 frames. ], batch size: 230, lr: 1.55e-02, grad_scale: 32.0 2023-06-19 08:13:36,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=311070.0, ans=0.125 2023-06-19 08:14:12,353 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.121e+02 2.747e+02 3.368e+02 4.316e+02 7.083e+02, threshold=6.735e+02, percent-clipped=3.0 2023-06-19 08:14:16,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=311130.0, ans=0.2 2023-06-19 08:15:39,931 INFO [train.py:996] (2/4) Epoch 2, batch 21400, loss[loss=0.3249, simple_loss=0.3803, pruned_loss=0.1348, over 21332.00 frames. ], tot_loss[loss=0.2649, simple_loss=0.3259, pruned_loss=0.1019, over 4265734.54 frames. ], batch size: 549, lr: 1.55e-02, grad_scale: 32.0 2023-06-19 08:16:50,398 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.74 vs. limit=15.0 2023-06-19 08:17:19,856 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.65 vs. limit=15.0 2023-06-19 08:18:01,522 INFO [train.py:996] (2/4) Epoch 2, batch 21450, loss[loss=0.2799, simple_loss=0.3222, pruned_loss=0.1188, over 20086.00 frames. ], tot_loss[loss=0.2734, simple_loss=0.3334, pruned_loss=0.1066, over 4274786.90 frames. ], batch size: 703, lr: 1.55e-02, grad_scale: 32.0 2023-06-19 08:18:11,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=311670.0, ans=0.1 2023-06-19 08:18:15,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=311670.0, ans=0.125 2023-06-19 08:18:35,942 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.965e+02 2.772e+02 3.350e+02 3.887e+02 8.399e+02, threshold=6.699e+02, percent-clipped=2.0 2023-06-19 08:18:36,525 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=311730.0, ans=0.0 2023-06-19 08:19:30,413 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-06-19 08:19:36,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=311910.0, ans=0.125 2023-06-19 08:19:47,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=311910.0, ans=0.0 2023-06-19 08:19:48,047 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.80 vs. limit=6.0 2023-06-19 08:20:14,674 INFO [train.py:996] (2/4) Epoch 2, batch 21500, loss[loss=0.2383, simple_loss=0.292, pruned_loss=0.0923, over 21223.00 frames. ], tot_loss[loss=0.2743, simple_loss=0.3321, pruned_loss=0.1083, over 4271488.12 frames. ], batch size: 176, lr: 1.55e-02, grad_scale: 32.0 2023-06-19 08:20:32,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=311970.0, ans=0.0 2023-06-19 08:21:03,206 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=22.5 2023-06-19 08:21:05,630 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=312030.0, ans=0.2 2023-06-19 08:22:17,611 INFO [train.py:996] (2/4) Epoch 2, batch 21550, loss[loss=0.2462, simple_loss=0.3133, pruned_loss=0.08957, over 21735.00 frames. ], tot_loss[loss=0.2667, simple_loss=0.3242, pruned_loss=0.1046, over 4276494.12 frames. ], batch size: 112, lr: 1.55e-02, grad_scale: 32.0 2023-06-19 08:22:36,491 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=312270.0, ans=0.5 2023-06-19 08:22:45,299 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.826e+02 2.454e+02 2.906e+02 3.459e+02 5.516e+02, threshold=5.812e+02, percent-clipped=0.0 2023-06-19 08:23:00,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=312330.0, ans=0.2 2023-06-19 08:23:56,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=312510.0, ans=0.2 2023-06-19 08:24:29,493 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=312510.0, ans=0.1 2023-06-19 08:24:33,359 INFO [train.py:996] (2/4) Epoch 2, batch 21600, loss[loss=0.2319, simple_loss=0.3034, pruned_loss=0.08014, over 21179.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.3185, pruned_loss=0.1021, over 4273751.42 frames. ], batch size: 176, lr: 1.55e-02, grad_scale: 32.0 2023-06-19 08:24:42,591 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=312570.0, ans=0.125 2023-06-19 08:24:45,794 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=312570.0, ans=0.2 2023-06-19 08:25:29,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=312690.0, ans=0.125 2023-06-19 08:26:23,854 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=312810.0, ans=0.0 2023-06-19 08:26:33,275 INFO [train.py:996] (2/4) Epoch 2, batch 21650, loss[loss=0.3224, simple_loss=0.4012, pruned_loss=0.1218, over 21544.00 frames. ], tot_loss[loss=0.2631, simple_loss=0.325, pruned_loss=0.1007, over 4267920.81 frames. ], batch size: 471, lr: 1.55e-02, grad_scale: 32.0 2023-06-19 08:26:36,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=312870.0, ans=0.0 2023-06-19 08:26:39,879 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.23 vs. limit=12.0 2023-06-19 08:26:47,105 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.40 vs. limit=15.0 2023-06-19 08:26:50,251 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.969e+02 2.980e+02 3.690e+02 4.543e+02 8.571e+02, threshold=7.379e+02, percent-clipped=9.0 2023-06-19 08:27:16,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=312990.0, ans=0.0 2023-06-19 08:27:17,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=312990.0, ans=10.0 2023-06-19 08:27:39,277 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=313050.0, ans=0.125 2023-06-19 08:28:16,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=313110.0, ans=0.1 2023-06-19 08:28:19,140 INFO [train.py:996] (2/4) Epoch 2, batch 21700, loss[loss=0.2377, simple_loss=0.292, pruned_loss=0.0917, over 21811.00 frames. ], tot_loss[loss=0.2593, simple_loss=0.3231, pruned_loss=0.09773, over 4262347.78 frames. ], batch size: 102, lr: 1.55e-02, grad_scale: 32.0 2023-06-19 08:28:19,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=313170.0, ans=0.125 2023-06-19 08:28:39,687 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=313170.0, ans=0.2 2023-06-19 08:29:28,691 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=313350.0, ans=0.125 2023-06-19 08:30:23,471 INFO [train.py:996] (2/4) Epoch 2, batch 21750, loss[loss=0.2335, simple_loss=0.2918, pruned_loss=0.08766, over 21491.00 frames. ], tot_loss[loss=0.2577, simple_loss=0.3184, pruned_loss=0.09856, over 4268055.15 frames. ], batch size: 212, lr: 1.55e-02, grad_scale: 32.0 2023-06-19 08:30:34,155 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.98 vs. limit=12.0 2023-06-19 08:30:47,364 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.785e+02 2.599e+02 3.130e+02 4.451e+02 8.277e+02, threshold=6.259e+02, percent-clipped=1.0 2023-06-19 08:31:00,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=313530.0, ans=0.1 2023-06-19 08:31:34,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=313650.0, ans=15.0 2023-06-19 08:31:58,714 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=313710.0, ans=0.0 2023-06-19 08:32:06,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=313710.0, ans=0.0 2023-06-19 08:32:06,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=313710.0, ans=0.125 2023-06-19 08:32:34,375 INFO [train.py:996] (2/4) Epoch 2, batch 21800, loss[loss=0.2376, simple_loss=0.2886, pruned_loss=0.09325, over 21446.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3175, pruned_loss=0.09937, over 4255391.92 frames. ], batch size: 212, lr: 1.55e-02, grad_scale: 32.0 2023-06-19 08:32:36,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=313770.0, ans=0.125 2023-06-19 08:33:12,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=313830.0, ans=0.125 2023-06-19 08:33:40,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=313950.0, ans=0.125 2023-06-19 08:34:19,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=314010.0, ans=0.2 2023-06-19 08:34:22,215 INFO [train.py:996] (2/4) Epoch 2, batch 21850, loss[loss=0.2431, simple_loss=0.3102, pruned_loss=0.08807, over 21748.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3232, pruned_loss=0.1004, over 4248200.71 frames. ], batch size: 247, lr: 1.55e-02, grad_scale: 32.0 2023-06-19 08:34:44,902 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.21 vs. limit=15.0 2023-06-19 08:34:56,139 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.924e+02 2.670e+02 3.094e+02 3.698e+02 5.413e+02, threshold=6.187e+02, percent-clipped=0.0 2023-06-19 08:34:58,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=314130.0, ans=0.125 2023-06-19 08:35:03,516 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=314130.0, ans=0.125 2023-06-19 08:35:37,499 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=314190.0, ans=0.125 2023-06-19 08:35:38,756 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=314190.0, ans=10.0 2023-06-19 08:36:13,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=314310.0, ans=0.125 2023-06-19 08:36:37,151 INFO [train.py:996] (2/4) Epoch 2, batch 21900, loss[loss=0.237, simple_loss=0.2943, pruned_loss=0.08983, over 21547.00 frames. ], tot_loss[loss=0.264, simple_loss=0.3241, pruned_loss=0.1019, over 4260518.26 frames. ], batch size: 212, lr: 1.55e-02, grad_scale: 32.0 2023-06-19 08:36:38,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=314370.0, ans=0.125 2023-06-19 08:36:43,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=314370.0, ans=0.125 2023-06-19 08:38:33,242 INFO [train.py:996] (2/4) Epoch 2, batch 21950, loss[loss=0.1592, simple_loss=0.24, pruned_loss=0.03919, over 21645.00 frames. ], tot_loss[loss=0.2598, simple_loss=0.3185, pruned_loss=0.1006, over 4263914.26 frames. ], batch size: 263, lr: 1.54e-02, grad_scale: 32.0 2023-06-19 08:38:44,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=314670.0, ans=0.5 2023-06-19 08:38:57,800 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=314730.0, ans=0.2 2023-06-19 08:39:07,654 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.054e+02 2.758e+02 3.297e+02 3.878e+02 5.596e+02, threshold=6.593e+02, percent-clipped=0.0 2023-06-19 08:39:15,874 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=314730.0, ans=0.125 2023-06-19 08:40:29,463 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.95 vs. limit=15.0 2023-06-19 08:40:39,711 INFO [train.py:996] (2/4) Epoch 2, batch 22000, loss[loss=0.2735, simple_loss=0.3229, pruned_loss=0.112, over 21457.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.3114, pruned_loss=0.09656, over 4264579.51 frames. ], batch size: 389, lr: 1.54e-02, grad_scale: 32.0 2023-06-19 08:40:41,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=314970.0, ans=0.0 2023-06-19 08:40:50,675 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff2.min_abs, batch_count=314970.0, ans=0.1 2023-06-19 08:41:28,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=315090.0, ans=0.125 2023-06-19 08:42:51,923 INFO [train.py:996] (2/4) Epoch 2, batch 22050, loss[loss=0.3201, simple_loss=0.3771, pruned_loss=0.1315, over 21248.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.3155, pruned_loss=0.09803, over 4248190.46 frames. ], batch size: 159, lr: 1.54e-02, grad_scale: 32.0 2023-06-19 08:42:55,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=315270.0, ans=0.0 2023-06-19 08:43:21,677 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.796e+02 2.619e+02 3.176e+02 4.335e+02 6.749e+02, threshold=6.352e+02, percent-clipped=1.0 2023-06-19 08:43:25,737 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.54 vs. limit=22.5 2023-06-19 08:43:59,993 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 08:45:06,813 INFO [train.py:996] (2/4) Epoch 2, batch 22100, loss[loss=0.3009, simple_loss=0.3582, pruned_loss=0.1218, over 21694.00 frames. ], tot_loss[loss=0.2707, simple_loss=0.33, pruned_loss=0.1057, over 4249361.74 frames. ], batch size: 389, lr: 1.54e-02, grad_scale: 32.0 2023-06-19 08:45:33,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=315630.0, ans=0.0 2023-06-19 08:45:54,134 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=315690.0, ans=0.07 2023-06-19 08:47:01,478 INFO [train.py:996] (2/4) Epoch 2, batch 22150, loss[loss=0.2645, simple_loss=0.3316, pruned_loss=0.09871, over 21577.00 frames. ], tot_loss[loss=0.2738, simple_loss=0.3328, pruned_loss=0.1074, over 4268244.66 frames. ], batch size: 195, lr: 1.54e-02, grad_scale: 32.0 2023-06-19 08:47:01,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=315870.0, ans=0.125 2023-06-19 08:47:09,147 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=315870.0, ans=0.0 2023-06-19 08:47:12,610 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.54 vs. limit=15.0 2023-06-19 08:47:18,227 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.78 vs. limit=15.0 2023-06-19 08:47:23,872 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.126e+02 3.220e+02 3.673e+02 4.139e+02 7.886e+02, threshold=7.346e+02, percent-clipped=1.0 2023-06-19 08:48:56,107 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=316110.0, ans=0.2 2023-06-19 08:49:06,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=316110.0, ans=0.0 2023-06-19 08:49:16,316 INFO [train.py:996] (2/4) Epoch 2, batch 22200, loss[loss=0.276, simple_loss=0.3532, pruned_loss=0.09938, over 21821.00 frames. ], tot_loss[loss=0.2756, simple_loss=0.3344, pruned_loss=0.1084, over 4277812.66 frames. ], batch size: 282, lr: 1.54e-02, grad_scale: 64.0 2023-06-19 08:49:17,410 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.03 vs. limit=22.5 2023-06-19 08:49:18,921 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.59 vs. limit=15.0 2023-06-19 08:49:27,219 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 08:49:38,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=316230.0, ans=0.2 2023-06-19 08:49:38,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=316230.0, ans=0.2 2023-06-19 08:50:45,114 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.59 vs. limit=22.5 2023-06-19 08:50:46,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=316350.0, ans=0.0 2023-06-19 08:51:21,178 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 08:51:24,969 INFO [train.py:996] (2/4) Epoch 2, batch 22250, loss[loss=0.3395, simple_loss=0.4069, pruned_loss=0.136, over 21806.00 frames. ], tot_loss[loss=0.2818, simple_loss=0.3426, pruned_loss=0.1105, over 4285498.14 frames. ], batch size: 124, lr: 1.54e-02, grad_scale: 64.0 2023-06-19 08:51:45,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=316470.0, ans=0.2 2023-06-19 08:51:58,349 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=316530.0, ans=15.0 2023-06-19 08:51:58,717 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 2.811e+02 3.425e+02 4.060e+02 7.172e+02, threshold=6.851e+02, percent-clipped=0.0 2023-06-19 08:53:07,141 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.85 vs. limit=22.5 2023-06-19 08:53:34,327 INFO [train.py:996] (2/4) Epoch 2, batch 22300, loss[loss=0.2831, simple_loss=0.3305, pruned_loss=0.1178, over 21873.00 frames. ], tot_loss[loss=0.285, simple_loss=0.3442, pruned_loss=0.1128, over 4291018.91 frames. ], batch size: 298, lr: 1.54e-02, grad_scale: 64.0 2023-06-19 08:54:00,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=316770.0, ans=0.2 2023-06-19 08:54:10,128 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=316830.0, ans=0.5 2023-06-19 08:54:10,585 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-06-19 08:55:39,897 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.74 vs. limit=6.0 2023-06-19 08:55:46,281 INFO [train.py:996] (2/4) Epoch 2, batch 22350, loss[loss=0.2409, simple_loss=0.3128, pruned_loss=0.08449, over 21640.00 frames. ], tot_loss[loss=0.2838, simple_loss=0.3418, pruned_loss=0.1129, over 4298110.64 frames. ], batch size: 263, lr: 1.54e-02, grad_scale: 32.0 2023-06-19 08:56:04,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=317070.0, ans=0.125 2023-06-19 08:56:32,044 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.064e+02 2.798e+02 3.468e+02 4.054e+02 6.110e+02, threshold=6.936e+02, percent-clipped=0.0 2023-06-19 08:56:45,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=317190.0, ans=0.2 2023-06-19 08:57:44,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=317250.0, ans=0.125 2023-06-19 08:57:48,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=317310.0, ans=0.125 2023-06-19 08:58:00,508 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 08:58:02,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=317310.0, ans=0.125 2023-06-19 08:58:18,576 INFO [train.py:996] (2/4) Epoch 2, batch 22400, loss[loss=0.2461, simple_loss=0.313, pruned_loss=0.08959, over 21400.00 frames. ], tot_loss[loss=0.2766, simple_loss=0.337, pruned_loss=0.1081, over 4290388.24 frames. ], batch size: 131, lr: 1.54e-02, grad_scale: 32.0 2023-06-19 08:59:45,724 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=317550.0, ans=0.125 2023-06-19 08:59:55,955 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.59 vs. limit=15.0 2023-06-19 08:59:59,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=317610.0, ans=0.125 2023-06-19 09:00:00,387 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.20 vs. limit=10.0 2023-06-19 09:00:19,685 INFO [train.py:996] (2/4) Epoch 2, batch 22450, loss[loss=0.2357, simple_loss=0.2919, pruned_loss=0.08972, over 21654.00 frames. ], tot_loss[loss=0.2716, simple_loss=0.3303, pruned_loss=0.1065, over 4277339.79 frames. ], batch size: 282, lr: 1.54e-02, grad_scale: 32.0 2023-06-19 09:00:50,393 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.792e+02 2.620e+02 3.213e+02 3.522e+02 5.684e+02, threshold=6.426e+02, percent-clipped=0.0 2023-06-19 09:01:31,389 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.21 vs. limit=15.0 2023-06-19 09:01:52,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=317850.0, ans=0.125 2023-06-19 09:02:00,953 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=317910.0, ans=0.0 2023-06-19 09:02:27,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=317910.0, ans=0.2 2023-06-19 09:02:29,737 INFO [train.py:996] (2/4) Epoch 2, batch 22500, loss[loss=0.3481, simple_loss=0.4205, pruned_loss=0.1378, over 21600.00 frames. ], tot_loss[loss=0.2702, simple_loss=0.3274, pruned_loss=0.1065, over 4274919.74 frames. ], batch size: 414, lr: 1.54e-02, grad_scale: 32.0 2023-06-19 09:03:17,886 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=318030.0, ans=0.2 2023-06-19 09:03:40,417 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=318090.0, ans=0.0 2023-06-19 09:03:40,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=318090.0, ans=0.05 2023-06-19 09:03:59,026 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=318150.0, ans=0.125 2023-06-19 09:04:15,661 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.47 vs. limit=22.5 2023-06-19 09:04:50,918 INFO [train.py:996] (2/4) Epoch 2, batch 22550, loss[loss=0.3236, simple_loss=0.3691, pruned_loss=0.1391, over 21834.00 frames. ], tot_loss[loss=0.2743, simple_loss=0.3334, pruned_loss=0.1076, over 4276103.56 frames. ], batch size: 441, lr: 1.54e-02, grad_scale: 32.0 2023-06-19 09:05:17,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=318270.0, ans=0.2 2023-06-19 09:05:26,001 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.043e+02 2.743e+02 3.376e+02 4.420e+02 1.013e+03, threshold=6.752e+02, percent-clipped=6.0 2023-06-19 09:05:26,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=318330.0, ans=0.2 2023-06-19 09:07:03,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=318510.0, ans=0.125 2023-06-19 09:07:08,867 INFO [train.py:996] (2/4) Epoch 2, batch 22600, loss[loss=0.2672, simple_loss=0.3352, pruned_loss=0.09963, over 21791.00 frames. ], tot_loss[loss=0.2766, simple_loss=0.3362, pruned_loss=0.1085, over 4282092.07 frames. ], batch size: 332, lr: 1.54e-02, grad_scale: 32.0 2023-06-19 09:07:41,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=318570.0, ans=0.0 2023-06-19 09:08:18,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=318690.0, ans=0.07 2023-06-19 09:08:37,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=318750.0, ans=0.1 2023-06-19 09:09:22,030 INFO [train.py:996] (2/4) Epoch 2, batch 22650, loss[loss=0.2331, simple_loss=0.2884, pruned_loss=0.08894, over 21547.00 frames. ], tot_loss[loss=0.275, simple_loss=0.3335, pruned_loss=0.1082, over 4270097.52 frames. ], batch size: 263, lr: 1.53e-02, grad_scale: 32.0 2023-06-19 09:09:36,912 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.69 vs. limit=15.0 2023-06-19 09:09:57,712 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.874e+02 2.722e+02 3.113e+02 3.879e+02 6.383e+02, threshold=6.225e+02, percent-clipped=0.0 2023-06-19 09:10:11,573 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 09:11:29,474 INFO [train.py:996] (2/4) Epoch 2, batch 22700, loss[loss=0.2581, simple_loss=0.3065, pruned_loss=0.1048, over 20075.00 frames. ], tot_loss[loss=0.271, simple_loss=0.3275, pruned_loss=0.1072, over 4270752.77 frames. ], batch size: 703, lr: 1.53e-02, grad_scale: 32.0 2023-06-19 09:11:41,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=319170.0, ans=0.0 2023-06-19 09:12:06,099 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=319230.0, ans=0.125 2023-06-19 09:12:19,505 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.74 vs. limit=22.5 2023-06-19 09:12:21,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=319290.0, ans=0.0 2023-06-19 09:13:14,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=319410.0, ans=0.2 2023-06-19 09:13:36,603 INFO [train.py:996] (2/4) Epoch 2, batch 22750, loss[loss=0.3164, simple_loss=0.3621, pruned_loss=0.1354, over 21804.00 frames. ], tot_loss[loss=0.2736, simple_loss=0.3289, pruned_loss=0.1091, over 4268896.59 frames. ], batch size: 282, lr: 1.53e-02, grad_scale: 32.0 2023-06-19 09:14:26,229 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.046e+02 2.878e+02 3.333e+02 3.990e+02 8.279e+02, threshold=6.666e+02, percent-clipped=1.0 2023-06-19 09:14:51,713 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=319590.0, ans=0.2 2023-06-19 09:15:23,511 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.17 vs. limit=15.0 2023-06-19 09:15:33,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=319710.0, ans=0.125 2023-06-19 09:15:59,088 INFO [train.py:996] (2/4) Epoch 2, batch 22800, loss[loss=0.2884, simple_loss=0.3379, pruned_loss=0.1195, over 20835.00 frames. ], tot_loss[loss=0.2783, simple_loss=0.3335, pruned_loss=0.1116, over 4272136.33 frames. ], batch size: 607, lr: 1.53e-02, grad_scale: 32.0 2023-06-19 09:16:44,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=319830.0, ans=0.125 2023-06-19 09:17:59,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=320010.0, ans=0.125 2023-06-19 09:18:03,243 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.23 vs. limit=15.0 2023-06-19 09:18:06,540 INFO [train.py:996] (2/4) Epoch 2, batch 22850, loss[loss=0.2456, simple_loss=0.3016, pruned_loss=0.09482, over 21299.00 frames. ], tot_loss[loss=0.2756, simple_loss=0.3299, pruned_loss=0.1106, over 4274466.09 frames. ], batch size: 131, lr: 1.53e-02, grad_scale: 32.0 2023-06-19 09:18:14,741 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.14 vs. limit=6.0 2023-06-19 09:18:15,499 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=320070.0, ans=0.95 2023-06-19 09:18:43,490 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.133e+02 3.116e+02 3.903e+02 5.067e+02 7.447e+02, threshold=7.805e+02, percent-clipped=3.0 2023-06-19 09:20:33,843 INFO [train.py:996] (2/4) Epoch 2, batch 22900, loss[loss=0.2886, simple_loss=0.386, pruned_loss=0.09562, over 21893.00 frames. ], tot_loss[loss=0.2783, simple_loss=0.335, pruned_loss=0.1108, over 4272885.41 frames. ], batch size: 317, lr: 1.53e-02, grad_scale: 32.0 2023-06-19 09:20:37,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=320370.0, ans=0.125 2023-06-19 09:20:44,103 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=320370.0, ans=0.0 2023-06-19 09:20:44,685 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.23 vs. limit=15.0 2023-06-19 09:21:42,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=320490.0, ans=0.125 2023-06-19 09:21:58,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=320490.0, ans=0.05 2023-06-19 09:22:03,266 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=320550.0, ans=0.2 2023-06-19 09:22:37,738 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.82 vs. limit=15.0 2023-06-19 09:22:43,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=320610.0, ans=0.125 2023-06-19 09:23:04,008 INFO [train.py:996] (2/4) Epoch 2, batch 22950, loss[loss=0.2812, simple_loss=0.3547, pruned_loss=0.1038, over 21284.00 frames. ], tot_loss[loss=0.2798, simple_loss=0.3454, pruned_loss=0.1071, over 4268328.26 frames. ], batch size: 159, lr: 1.53e-02, grad_scale: 32.0 2023-06-19 09:23:17,967 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=320670.0, ans=0.0 2023-06-19 09:23:23,918 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=320730.0, ans=0.125 2023-06-19 09:23:28,198 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=320730.0, ans=0.0 2023-06-19 09:23:29,081 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.110e+02 3.028e+02 3.874e+02 4.799e+02 7.839e+02, threshold=7.748e+02, percent-clipped=1.0 2023-06-19 09:23:46,181 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.67 vs. limit=15.0 2023-06-19 09:23:47,622 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.59 vs. limit=10.0 2023-06-19 09:24:32,295 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.31 vs. limit=10.0 2023-06-19 09:25:08,585 INFO [train.py:996] (2/4) Epoch 2, batch 23000, loss[loss=0.2687, simple_loss=0.3344, pruned_loss=0.1015, over 21434.00 frames. ], tot_loss[loss=0.2765, simple_loss=0.3444, pruned_loss=0.1043, over 4264276.98 frames. ], batch size: 548, lr: 1.53e-02, grad_scale: 32.0 2023-06-19 09:26:37,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=321090.0, ans=0.0 2023-06-19 09:26:55,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=321150.0, ans=0.125 2023-06-19 09:27:36,532 INFO [train.py:996] (2/4) Epoch 2, batch 23050, loss[loss=0.3257, simple_loss=0.3812, pruned_loss=0.1351, over 21580.00 frames. ], tot_loss[loss=0.2797, simple_loss=0.3456, pruned_loss=0.1069, over 4262353.24 frames. ], batch size: 389, lr: 1.53e-02, grad_scale: 32.0 2023-06-19 09:27:55,682 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.753e+02 2.939e+02 3.561e+02 4.565e+02 8.100e+02, threshold=7.122e+02, percent-clipped=1.0 2023-06-19 09:28:13,888 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=321330.0, ans=0.1 2023-06-19 09:28:30,471 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.54 vs. limit=15.0 2023-06-19 09:29:12,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=321450.0, ans=0.025 2023-06-19 09:29:14,430 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=321510.0, ans=0.125 2023-06-19 09:29:35,909 INFO [train.py:996] (2/4) Epoch 2, batch 23100, loss[loss=0.2648, simple_loss=0.3073, pruned_loss=0.1112, over 21805.00 frames. ], tot_loss[loss=0.2774, simple_loss=0.3404, pruned_loss=0.1072, over 4256049.60 frames. ], batch size: 98, lr: 1.53e-02, grad_scale: 32.0 2023-06-19 09:30:00,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=321630.0, ans=0.0 2023-06-19 09:30:16,928 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.46 vs. limit=15.0 2023-06-19 09:30:41,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=321690.0, ans=0.0 2023-06-19 09:30:42,078 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.71 vs. limit=10.0 2023-06-19 09:31:06,071 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.09 vs. limit=15.0 2023-06-19 09:32:02,685 INFO [train.py:996] (2/4) Epoch 2, batch 23150, loss[loss=0.276, simple_loss=0.3337, pruned_loss=0.1091, over 21490.00 frames. ], tot_loss[loss=0.2739, simple_loss=0.3341, pruned_loss=0.1068, over 4258169.15 frames. ], batch size: 131, lr: 1.53e-02, grad_scale: 32.0 2023-06-19 09:32:21,385 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.218e+02 2.749e+02 3.350e+02 4.019e+02 8.048e+02, threshold=6.700e+02, percent-clipped=1.0 2023-06-19 09:32:24,310 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=321930.0, ans=0.125 2023-06-19 09:33:24,245 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.77 vs. limit=15.0 2023-06-19 09:33:32,862 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=322050.0, ans=0.125 2023-06-19 09:33:34,277 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=322050.0, ans=0.0 2023-06-19 09:33:37,607 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.41 vs. limit=22.5 2023-06-19 09:34:08,025 INFO [train.py:996] (2/4) Epoch 2, batch 23200, loss[loss=0.287, simple_loss=0.3389, pruned_loss=0.1175, over 21761.00 frames. ], tot_loss[loss=0.2746, simple_loss=0.3339, pruned_loss=0.1076, over 4269990.27 frames. ], batch size: 441, lr: 1.53e-02, grad_scale: 32.0 2023-06-19 09:34:14,984 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.42 vs. limit=15.0 2023-06-19 09:34:25,165 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.58 vs. limit=15.0 2023-06-19 09:34:38,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=322230.0, ans=0.1 2023-06-19 09:35:16,792 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=22.5 2023-06-19 09:35:36,522 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.41 vs. limit=15.0 2023-06-19 09:36:02,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=322410.0, ans=0.125 2023-06-19 09:36:20,041 INFO [train.py:996] (2/4) Epoch 2, batch 23250, loss[loss=0.2712, simple_loss=0.3615, pruned_loss=0.09045, over 19867.00 frames. ], tot_loss[loss=0.2751, simple_loss=0.3332, pruned_loss=0.1085, over 4273066.24 frames. ], batch size: 702, lr: 1.53e-02, grad_scale: 32.0 2023-06-19 09:37:00,561 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 2.833e+02 3.294e+02 4.025e+02 6.311e+02, threshold=6.588e+02, percent-clipped=0.0 2023-06-19 09:37:39,661 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.62 vs. limit=12.0 2023-06-19 09:38:15,512 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 09:38:28,983 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.07 vs. limit=6.0 2023-06-19 09:38:34,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=322710.0, ans=0.0 2023-06-19 09:38:55,493 INFO [train.py:996] (2/4) Epoch 2, batch 23300, loss[loss=0.2892, simple_loss=0.3859, pruned_loss=0.09626, over 21783.00 frames. ], tot_loss[loss=0.2807, simple_loss=0.3403, pruned_loss=0.1105, over 4280014.61 frames. ], batch size: 282, lr: 1.53e-02, grad_scale: 32.0 2023-06-19 09:40:43,799 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=323010.0, ans=0.0 2023-06-19 09:40:54,650 INFO [train.py:996] (2/4) Epoch 2, batch 23350, loss[loss=0.21, simple_loss=0.2763, pruned_loss=0.07182, over 21289.00 frames. ], tot_loss[loss=0.2813, simple_loss=0.3438, pruned_loss=0.1094, over 4272121.73 frames. ], batch size: 176, lr: 1.52e-02, grad_scale: 32.0 2023-06-19 09:41:03,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=323070.0, ans=0.125 2023-06-19 09:41:28,387 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=323070.0, ans=0.125 2023-06-19 09:41:38,031 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.003e+02 2.727e+02 3.213e+02 3.768e+02 7.532e+02, threshold=6.426e+02, percent-clipped=1.0 2023-06-19 09:42:26,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=323190.0, ans=0.0 2023-06-19 09:42:39,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=323250.0, ans=0.125 2023-06-19 09:42:40,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=323250.0, ans=0.2 2023-06-19 09:43:21,907 INFO [train.py:996] (2/4) Epoch 2, batch 23400, loss[loss=0.2491, simple_loss=0.312, pruned_loss=0.09308, over 21829.00 frames. ], tot_loss[loss=0.2741, simple_loss=0.3376, pruned_loss=0.1053, over 4274812.97 frames. ], batch size: 247, lr: 1.52e-02, grad_scale: 32.0 2023-06-19 09:44:41,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=323550.0, ans=0.0 2023-06-19 09:44:43,134 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=323550.0, ans=0.125 2023-06-19 09:45:37,662 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.28 vs. limit=6.0 2023-06-19 09:45:40,847 INFO [train.py:996] (2/4) Epoch 2, batch 23450, loss[loss=0.2937, simple_loss=0.3545, pruned_loss=0.1164, over 21927.00 frames. ], tot_loss[loss=0.2778, simple_loss=0.3391, pruned_loss=0.1082, over 4280288.25 frames. ], batch size: 316, lr: 1.52e-02, grad_scale: 32.0 2023-06-19 09:46:09,880 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.52 vs. limit=22.5 2023-06-19 09:46:22,390 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.902e+02 2.895e+02 3.507e+02 4.209e+02 6.661e+02, threshold=7.013e+02, percent-clipped=1.0 2023-06-19 09:46:29,289 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.80 vs. limit=10.0 2023-06-19 09:46:52,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=323790.0, ans=0.0 2023-06-19 09:47:18,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=323850.0, ans=0.1 2023-06-19 09:47:44,253 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-06-19 09:48:05,215 INFO [train.py:996] (2/4) Epoch 2, batch 23500, loss[loss=0.2646, simple_loss=0.326, pruned_loss=0.1016, over 21413.00 frames. ], tot_loss[loss=0.2815, simple_loss=0.3409, pruned_loss=0.1111, over 4283377.47 frames. ], batch size: 211, lr: 1.52e-02, grad_scale: 32.0 2023-06-19 09:48:56,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=324030.0, ans=0.0 2023-06-19 09:49:03,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=324090.0, ans=0.2 2023-06-19 09:49:19,111 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=324090.0, ans=0.125 2023-06-19 09:49:35,669 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.55 vs. limit=6.0 2023-06-19 09:50:09,934 INFO [train.py:996] (2/4) Epoch 2, batch 23550, loss[loss=0.2628, simple_loss=0.3108, pruned_loss=0.1074, over 21793.00 frames. ], tot_loss[loss=0.2804, simple_loss=0.3386, pruned_loss=0.1112, over 4277400.45 frames. ], batch size: 351, lr: 1.52e-02, grad_scale: 32.0 2023-06-19 09:50:24,890 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=324330.0, ans=0.0 2023-06-19 09:50:32,896 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.171e+02 2.771e+02 3.237e+02 3.882e+02 7.021e+02, threshold=6.473e+02, percent-clipped=1.0 2023-06-19 09:50:33,878 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.15 vs. limit=15.0 2023-06-19 09:50:41,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=324330.0, ans=0.125 2023-06-19 09:51:07,787 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=324390.0, ans=0.125 2023-06-19 09:51:20,000 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 09:51:56,179 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=324510.0, ans=0.1 2023-06-19 09:52:08,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=324510.0, ans=0.1 2023-06-19 09:52:12,580 INFO [train.py:996] (2/4) Epoch 2, batch 23600, loss[loss=0.3145, simple_loss=0.3716, pruned_loss=0.1287, over 21574.00 frames. ], tot_loss[loss=0.2791, simple_loss=0.337, pruned_loss=0.1106, over 4269975.18 frames. ], batch size: 414, lr: 1.52e-02, grad_scale: 32.0 2023-06-19 09:52:30,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=324570.0, ans=0.025 2023-06-19 09:52:38,240 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=324570.0, ans=0.125 2023-06-19 09:52:56,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=324630.0, ans=0.04949747468305833 2023-06-19 09:53:01,737 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.71 vs. limit=12.0 2023-06-19 09:54:36,390 INFO [train.py:996] (2/4) Epoch 2, batch 23650, loss[loss=0.2595, simple_loss=0.3299, pruned_loss=0.09462, over 21782.00 frames. ], tot_loss[loss=0.2753, simple_loss=0.3351, pruned_loss=0.1078, over 4261331.30 frames. ], batch size: 247, lr: 1.52e-02, grad_scale: 32.0 2023-06-19 09:55:20,779 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=324930.0, ans=0.2 2023-06-19 09:55:30,399 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.076e+02 2.894e+02 3.663e+02 5.031e+02 1.050e+03, threshold=7.326e+02, percent-clipped=9.0 2023-06-19 09:55:44,777 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.77 vs. limit=15.0 2023-06-19 09:55:47,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=324990.0, ans=0.0 2023-06-19 09:56:13,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=325050.0, ans=0.125 2023-06-19 09:57:18,212 INFO [train.py:996] (2/4) Epoch 2, batch 23700, loss[loss=0.3197, simple_loss=0.372, pruned_loss=0.1337, over 21430.00 frames. ], tot_loss[loss=0.2729, simple_loss=0.3359, pruned_loss=0.1049, over 4262054.76 frames. ], batch size: 471, lr: 1.52e-02, grad_scale: 32.0 2023-06-19 09:57:22,425 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=22.5 2023-06-19 09:58:13,888 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=325230.0, ans=0.1 2023-06-19 09:58:36,398 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=325290.0, ans=0.0 2023-06-19 09:58:49,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=325350.0, ans=0.0 2023-06-19 09:58:52,937 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=325350.0, ans=0.125 2023-06-19 09:58:54,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=325410.0, ans=0.07 2023-06-19 09:59:30,270 INFO [train.py:996] (2/4) Epoch 2, batch 23750, loss[loss=0.2187, simple_loss=0.3161, pruned_loss=0.06067, over 21649.00 frames. ], tot_loss[loss=0.277, simple_loss=0.34, pruned_loss=0.107, over 4266582.82 frames. ], batch size: 263, lr: 1.52e-02, grad_scale: 32.0 2023-06-19 10:00:09,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=325530.0, ans=0.125 2023-06-19 10:00:12,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=325530.0, ans=0.125 2023-06-19 10:00:13,666 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.908e+02 2.910e+02 3.446e+02 4.097e+02 6.175e+02, threshold=6.892e+02, percent-clipped=0.0 2023-06-19 10:00:15,668 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=325530.0, ans=0.125 2023-06-19 10:00:21,869 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=325530.0, ans=0.125 2023-06-19 10:01:33,396 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=325710.0, ans=0.125 2023-06-19 10:01:52,708 INFO [train.py:996] (2/4) Epoch 2, batch 23800, loss[loss=0.3257, simple_loss=0.4003, pruned_loss=0.1255, over 21694.00 frames. ], tot_loss[loss=0.2735, simple_loss=0.338, pruned_loss=0.1045, over 4268864.79 frames. ], batch size: 332, lr: 1.52e-02, grad_scale: 32.0 2023-06-19 10:01:57,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=325770.0, ans=0.125 2023-06-19 10:03:51,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=326010.0, ans=0.0 2023-06-19 10:03:58,554 INFO [train.py:996] (2/4) Epoch 2, batch 23850, loss[loss=0.306, simple_loss=0.3678, pruned_loss=0.1221, over 21729.00 frames. ], tot_loss[loss=0.2827, simple_loss=0.3493, pruned_loss=0.1081, over 4269566.31 frames. ], batch size: 298, lr: 1.52e-02, grad_scale: 32.0 2023-06-19 10:04:41,598 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.209e+02 2.956e+02 3.514e+02 4.313e+02 7.177e+02, threshold=7.028e+02, percent-clipped=1.0 2023-06-19 10:04:57,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=326130.0, ans=0.125 2023-06-19 10:06:01,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=326310.0, ans=0.125 2023-06-19 10:06:03,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=326310.0, ans=0.0 2023-06-19 10:06:20,017 INFO [train.py:996] (2/4) Epoch 2, batch 23900, loss[loss=0.3645, simple_loss=0.4266, pruned_loss=0.1512, over 21464.00 frames. ], tot_loss[loss=0.289, simple_loss=0.3566, pruned_loss=0.1107, over 4274799.06 frames. ], batch size: 471, lr: 1.52e-02, grad_scale: 32.0 2023-06-19 10:08:27,711 INFO [train.py:996] (2/4) Epoch 2, batch 23950, loss[loss=0.2913, simple_loss=0.3608, pruned_loss=0.1109, over 21329.00 frames. ], tot_loss[loss=0.2844, simple_loss=0.3491, pruned_loss=0.1098, over 4269179.97 frames. ], batch size: 131, lr: 1.52e-02, grad_scale: 32.0 2023-06-19 10:08:42,964 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.95 vs. limit=6.0 2023-06-19 10:09:19,308 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.098e+02 2.788e+02 3.140e+02 3.660e+02 5.549e+02, threshold=6.280e+02, percent-clipped=0.0 2023-06-19 10:09:19,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=326730.0, ans=0.2 2023-06-19 10:09:21,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=326730.0, ans=0.0 2023-06-19 10:09:40,967 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=326790.0, ans=0.05 2023-06-19 10:10:45,251 INFO [train.py:996] (2/4) Epoch 2, batch 24000, loss[loss=0.3314, simple_loss=0.4001, pruned_loss=0.1314, over 21803.00 frames. ], tot_loss[loss=0.2891, simple_loss=0.3508, pruned_loss=0.1137, over 4267227.96 frames. ], batch size: 118, lr: 1.52e-02, grad_scale: 32.0 2023-06-19 10:10:45,251 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-19 10:11:27,285 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([4.8095, 2.0164, 4.1745, 2.0503], device='cuda:2') 2023-06-19 10:11:36,138 INFO [train.py:1028] (2/4) Epoch 2, validation: loss=0.2838, simple_loss=0.3817, pruned_loss=0.09297, over 1796401.00 frames. 2023-06-19 10:11:36,139 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-19 10:12:15,505 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=327030.0, ans=0.125 2023-06-19 10:12:16,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=327090.0, ans=0.2 2023-06-19 10:12:29,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=327090.0, ans=0.2 2023-06-19 10:12:45,862 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.66 vs. limit=22.5 2023-06-19 10:13:02,520 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=16.77 vs. limit=22.5 2023-06-19 10:13:39,997 INFO [train.py:996] (2/4) Epoch 2, batch 24050, loss[loss=0.2136, simple_loss=0.2939, pruned_loss=0.0666, over 21360.00 frames. ], tot_loss[loss=0.2896, simple_loss=0.3523, pruned_loss=0.1134, over 4271442.45 frames. ], batch size: 131, lr: 1.51e-02, grad_scale: 32.0 2023-06-19 10:13:54,500 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.04 vs. limit=15.0 2023-06-19 10:14:14,499 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.167e+02 2.893e+02 3.313e+02 4.221e+02 6.916e+02, threshold=6.626e+02, percent-clipped=2.0 2023-06-19 10:15:01,155 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=327450.0, ans=0.2 2023-06-19 10:15:20,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=327450.0, ans=0.125 2023-06-19 10:15:21,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=327510.0, ans=0.0 2023-06-19 10:15:58,854 INFO [train.py:996] (2/4) Epoch 2, batch 24100, loss[loss=0.289, simple_loss=0.353, pruned_loss=0.1126, over 21405.00 frames. ], tot_loss[loss=0.2857, simple_loss=0.3512, pruned_loss=0.1101, over 4277723.34 frames. ], batch size: 211, lr: 1.51e-02, grad_scale: 32.0 2023-06-19 10:16:34,348 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 10:16:43,705 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.36 vs. limit=6.0 2023-06-19 10:18:09,448 INFO [train.py:996] (2/4) Epoch 2, batch 24150, loss[loss=0.2892, simple_loss=0.3429, pruned_loss=0.1178, over 21887.00 frames. ], tot_loss[loss=0.2884, simple_loss=0.3516, pruned_loss=0.1126, over 4282940.30 frames. ], batch size: 371, lr: 1.51e-02, grad_scale: 32.0 2023-06-19 10:18:48,472 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.248e+02 2.793e+02 3.020e+02 3.796e+02 7.586e+02, threshold=6.040e+02, percent-clipped=1.0 2023-06-19 10:18:55,138 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.82 vs. limit=15.0 2023-06-19 10:19:34,052 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=327990.0, ans=0.0 2023-06-19 10:19:49,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=328050.0, ans=0.0 2023-06-19 10:20:31,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=328110.0, ans=0.125 2023-06-19 10:20:39,210 INFO [train.py:996] (2/4) Epoch 2, batch 24200, loss[loss=0.2568, simple_loss=0.3383, pruned_loss=0.08769, over 19930.00 frames. ], tot_loss[loss=0.2909, simple_loss=0.3532, pruned_loss=0.1143, over 4284567.63 frames. ], batch size: 703, lr: 1.51e-02, grad_scale: 16.0 2023-06-19 10:21:10,637 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=328230.0, ans=0.125 2023-06-19 10:21:30,288 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=328290.0, ans=0.0 2023-06-19 10:21:34,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=328290.0, ans=0.1 2023-06-19 10:22:59,857 INFO [train.py:996] (2/4) Epoch 2, batch 24250, loss[loss=0.2315, simple_loss=0.3122, pruned_loss=0.07546, over 21320.00 frames. ], tot_loss[loss=0.2801, simple_loss=0.3474, pruned_loss=0.1064, over 4276916.75 frames. ], batch size: 176, lr: 1.51e-02, grad_scale: 16.0 2023-06-19 10:23:00,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=328470.0, ans=0.2 2023-06-19 10:23:44,247 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.718e+02 2.606e+02 2.962e+02 3.428e+02 4.906e+02, threshold=5.923e+02, percent-clipped=0.0 2023-06-19 10:23:59,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=328590.0, ans=0.125 2023-06-19 10:24:26,886 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=1.99 vs. limit=12.0 2023-06-19 10:25:15,761 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.85 vs. limit=22.5 2023-06-19 10:25:23,882 INFO [train.py:996] (2/4) Epoch 2, batch 24300, loss[loss=0.1597, simple_loss=0.2381, pruned_loss=0.04067, over 21534.00 frames. ], tot_loss[loss=0.2673, simple_loss=0.3373, pruned_loss=0.09859, over 4273560.75 frames. ], batch size: 212, lr: 1.51e-02, grad_scale: 16.0 2023-06-19 10:25:33,262 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 10:26:00,214 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=328830.0, ans=0.5 2023-06-19 10:26:29,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=328890.0, ans=0.0 2023-06-19 10:26:36,716 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 10:27:14,610 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=329010.0, ans=0.125 2023-06-19 10:27:28,431 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.04 vs. limit=15.0 2023-06-19 10:27:36,328 INFO [train.py:996] (2/4) Epoch 2, batch 24350, loss[loss=0.4013, simple_loss=0.427, pruned_loss=0.1878, over 21515.00 frames. ], tot_loss[loss=0.2671, simple_loss=0.3347, pruned_loss=0.09973, over 4282036.18 frames. ], batch size: 508, lr: 1.51e-02, grad_scale: 16.0 2023-06-19 10:27:54,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=329070.0, ans=0.125 2023-06-19 10:28:15,016 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.020e+02 2.750e+02 3.153e+02 3.822e+02 5.866e+02, threshold=6.306e+02, percent-clipped=0.0 2023-06-19 10:28:16,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=329130.0, ans=0.5 2023-06-19 10:28:18,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=329130.0, ans=0.125 2023-06-19 10:28:37,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=329190.0, ans=0.125 2023-06-19 10:28:47,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=329190.0, ans=0.1 2023-06-19 10:29:45,930 INFO [train.py:996] (2/4) Epoch 2, batch 24400, loss[loss=0.29, simple_loss=0.3588, pruned_loss=0.1106, over 21747.00 frames. ], tot_loss[loss=0.276, simple_loss=0.3423, pruned_loss=0.1048, over 4278382.38 frames. ], batch size: 298, lr: 1.51e-02, grad_scale: 32.0 2023-06-19 10:30:00,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=329370.0, ans=0.125 2023-06-19 10:30:12,743 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.60 vs. limit=22.5 2023-06-19 10:30:16,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=329430.0, ans=0.1 2023-06-19 10:30:44,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=329430.0, ans=0.1 2023-06-19 10:31:35,537 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=329610.0, ans=0.1 2023-06-19 10:31:37,702 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.83 vs. limit=15.0 2023-06-19 10:31:39,120 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.72 vs. limit=22.5 2023-06-19 10:31:55,270 INFO [train.py:996] (2/4) Epoch 2, batch 24450, loss[loss=0.2324, simple_loss=0.3029, pruned_loss=0.08099, over 21155.00 frames. ], tot_loss[loss=0.2764, simple_loss=0.3415, pruned_loss=0.1056, over 4269999.61 frames. ], batch size: 143, lr: 1.51e-02, grad_scale: 32.0 2023-06-19 10:32:19,485 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 10:32:36,225 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.830e+02 2.674e+02 3.123e+02 3.615e+02 7.708e+02, threshold=6.247e+02, percent-clipped=3.0 2023-06-19 10:33:51,705 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 10:34:09,713 INFO [train.py:996] (2/4) Epoch 2, batch 24500, loss[loss=0.2636, simple_loss=0.3279, pruned_loss=0.09968, over 21670.00 frames. ], tot_loss[loss=0.2764, simple_loss=0.3424, pruned_loss=0.1052, over 4275165.93 frames. ], batch size: 263, lr: 1.51e-02, grad_scale: 32.0 2023-06-19 10:34:26,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=329970.0, ans=0.1 2023-06-19 10:34:56,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=330030.0, ans=0.125 2023-06-19 10:35:35,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=330090.0, ans=0.125 2023-06-19 10:36:32,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=330210.0, ans=0.0 2023-06-19 10:36:35,175 INFO [train.py:996] (2/4) Epoch 2, batch 24550, loss[loss=0.3121, simple_loss=0.3723, pruned_loss=0.1259, over 21610.00 frames. ], tot_loss[loss=0.2822, simple_loss=0.3467, pruned_loss=0.1089, over 4278731.73 frames. ], batch size: 389, lr: 1.51e-02, grad_scale: 32.0 2023-06-19 10:36:55,709 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 10:36:57,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=330330.0, ans=0.5 2023-06-19 10:37:13,330 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.015e+02 2.779e+02 3.400e+02 4.076e+02 5.829e+02, threshold=6.799e+02, percent-clipped=0.0 2023-06-19 10:38:43,063 INFO [train.py:996] (2/4) Epoch 2, batch 24600, loss[loss=0.2347, simple_loss=0.2991, pruned_loss=0.08514, over 21694.00 frames. ], tot_loss[loss=0.2804, simple_loss=0.3421, pruned_loss=0.1093, over 4282139.29 frames. ], batch size: 298, lr: 1.51e-02, grad_scale: 32.0 2023-06-19 10:38:57,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=330570.0, ans=0.2 2023-06-19 10:38:59,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=330570.0, ans=0.125 2023-06-19 10:39:44,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=330690.0, ans=0.2 2023-06-19 10:40:46,682 INFO [train.py:996] (2/4) Epoch 2, batch 24650, loss[loss=0.2798, simple_loss=0.3188, pruned_loss=0.1205, over 21475.00 frames. ], tot_loss[loss=0.2757, simple_loss=0.3349, pruned_loss=0.1083, over 4276501.60 frames. ], batch size: 441, lr: 1.51e-02, grad_scale: 32.0 2023-06-19 10:41:26,832 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.15 vs. limit=6.0 2023-06-19 10:41:27,290 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.017e+02 2.833e+02 3.466e+02 4.292e+02 5.874e+02, threshold=6.932e+02, percent-clipped=0.0 2023-06-19 10:41:27,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=330930.0, ans=0.2 2023-06-19 10:41:51,969 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.43 vs. limit=15.0 2023-06-19 10:42:58,545 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.48 vs. limit=15.0 2023-06-19 10:43:00,292 INFO [train.py:996] (2/4) Epoch 2, batch 24700, loss[loss=0.2926, simple_loss=0.3348, pruned_loss=0.1252, over 21258.00 frames. ], tot_loss[loss=0.2738, simple_loss=0.3338, pruned_loss=0.1069, over 4266656.32 frames. ], batch size: 471, lr: 1.51e-02, grad_scale: 32.0 2023-06-19 10:43:03,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=331170.0, ans=0.2 2023-06-19 10:44:45,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=331410.0, ans=0.09899494936611666 2023-06-19 10:45:02,939 INFO [train.py:996] (2/4) Epoch 2, batch 24750, loss[loss=0.2348, simple_loss=0.2931, pruned_loss=0.08825, over 21683.00 frames. ], tot_loss[loss=0.2651, simple_loss=0.3251, pruned_loss=0.1026, over 4269266.15 frames. ], batch size: 333, lr: 1.51e-02, grad_scale: 32.0 2023-06-19 10:45:40,270 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 2.432e+02 2.874e+02 3.533e+02 8.125e+02, threshold=5.749e+02, percent-clipped=3.0 2023-06-19 10:46:00,118 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.50 vs. limit=12.0 2023-06-19 10:47:08,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=331770.0, ans=0.2 2023-06-19 10:47:10,080 INFO [train.py:996] (2/4) Epoch 2, batch 24800, loss[loss=0.2666, simple_loss=0.3202, pruned_loss=0.1066, over 21805.00 frames. ], tot_loss[loss=0.2641, simple_loss=0.3226, pruned_loss=0.1028, over 4276473.39 frames. ], batch size: 298, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 10:48:01,995 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.08 vs. limit=6.0 2023-06-19 10:48:27,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=331890.0, ans=0.125 2023-06-19 10:49:33,406 INFO [train.py:996] (2/4) Epoch 2, batch 24850, loss[loss=0.2944, simple_loss=0.3628, pruned_loss=0.1129, over 21328.00 frames. ], tot_loss[loss=0.2659, simple_loss=0.3242, pruned_loss=0.1039, over 4276609.00 frames. ], batch size: 548, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 10:50:02,645 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=332130.0, ans=0.125 2023-06-19 10:50:13,147 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.192e+02 2.964e+02 3.527e+02 4.123e+02 6.284e+02, threshold=7.055e+02, percent-clipped=5.0 2023-06-19 10:51:55,285 INFO [train.py:996] (2/4) Epoch 2, batch 24900, loss[loss=0.3069, simple_loss=0.3664, pruned_loss=0.1236, over 21831.00 frames. ], tot_loss[loss=0.2678, simple_loss=0.3262, pruned_loss=0.1048, over 4273267.44 frames. ], batch size: 282, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 10:52:53,339 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.04 vs. limit=12.0 2023-06-19 10:53:04,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=332490.0, ans=0.125 2023-06-19 10:53:07,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=332490.0, ans=0.125 2023-06-19 10:53:33,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=332550.0, ans=0.125 2023-06-19 10:53:33,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=332550.0, ans=0.02 2023-06-19 10:54:13,019 INFO [train.py:996] (2/4) Epoch 2, batch 24950, loss[loss=0.2962, simple_loss=0.3466, pruned_loss=0.1229, over 21626.00 frames. ], tot_loss[loss=0.2768, simple_loss=0.335, pruned_loss=0.1093, over 4273648.43 frames. ], batch size: 263, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 10:54:54,099 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.400e+02 3.083e+02 3.929e+02 5.227e+02 8.953e+02, threshold=7.858e+02, percent-clipped=7.0 2023-06-19 10:55:34,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=332850.0, ans=0.1 2023-06-19 10:56:37,135 INFO [train.py:996] (2/4) Epoch 2, batch 25000, loss[loss=0.2723, simple_loss=0.3374, pruned_loss=0.1036, over 21742.00 frames. ], tot_loss[loss=0.2802, simple_loss=0.3395, pruned_loss=0.1105, over 4270955.58 frames. ], batch size: 351, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 10:57:14,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=333030.0, ans=0.125 2023-06-19 10:57:17,088 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.15 vs. limit=15.0 2023-06-19 10:57:24,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=333090.0, ans=0.125 2023-06-19 10:57:25,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=333090.0, ans=0.1 2023-06-19 10:57:35,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=333090.0, ans=0.0 2023-06-19 10:58:31,335 INFO [train.py:996] (2/4) Epoch 2, batch 25050, loss[loss=0.2444, simple_loss=0.2961, pruned_loss=0.0963, over 21353.00 frames. ], tot_loss[loss=0.275, simple_loss=0.3326, pruned_loss=0.1087, over 4266362.55 frames. ], batch size: 160, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 10:58:43,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=333270.0, ans=0.0 2023-06-19 10:59:09,119 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.982e+02 2.713e+02 2.983e+02 3.396e+02 5.843e+02, threshold=5.966e+02, percent-clipped=0.0 2023-06-19 10:59:11,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=333330.0, ans=0.1 2023-06-19 11:00:01,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=333510.0, ans=0.125 2023-06-19 11:00:30,859 INFO [train.py:996] (2/4) Epoch 2, batch 25100, loss[loss=0.2606, simple_loss=0.3349, pruned_loss=0.0931, over 21638.00 frames. ], tot_loss[loss=0.2703, simple_loss=0.3266, pruned_loss=0.107, over 4266541.69 frames. ], batch size: 332, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 11:00:31,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=333570.0, ans=0.125 2023-06-19 11:00:42,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=333570.0, ans=0.0 2023-06-19 11:02:31,346 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.55 vs. limit=10.0 2023-06-19 11:02:36,048 INFO [train.py:996] (2/4) Epoch 2, batch 25150, loss[loss=0.2714, simple_loss=0.3396, pruned_loss=0.1016, over 21901.00 frames. ], tot_loss[loss=0.2675, simple_loss=0.328, pruned_loss=0.1035, over 4256878.07 frames. ], batch size: 316, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 11:02:48,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=333870.0, ans=0.125 2023-06-19 11:03:12,811 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.643e+02 2.528e+02 3.068e+02 3.908e+02 5.226e+02, threshold=6.135e+02, percent-clipped=0.0 2023-06-19 11:03:16,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=333930.0, ans=0.125 2023-06-19 11:03:34,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=333990.0, ans=0.2 2023-06-19 11:04:06,333 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.08 vs. limit=15.0 2023-06-19 11:04:47,125 INFO [train.py:996] (2/4) Epoch 2, batch 25200, loss[loss=0.2252, simple_loss=0.3086, pruned_loss=0.07085, over 21469.00 frames. ], tot_loss[loss=0.2644, simple_loss=0.3271, pruned_loss=0.1008, over 4254027.97 frames. ], batch size: 211, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 11:06:47,742 INFO [train.py:996] (2/4) Epoch 2, batch 25250, loss[loss=0.2373, simple_loss=0.282, pruned_loss=0.09633, over 21188.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3252, pruned_loss=0.09936, over 4252643.49 frames. ], batch size: 548, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 11:07:20,669 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.853e+02 2.794e+02 3.264e+02 4.171e+02 6.058e+02, threshold=6.527e+02, percent-clipped=0.0 2023-06-19 11:08:20,867 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.58 vs. limit=6.0 2023-06-19 11:09:01,287 INFO [train.py:996] (2/4) Epoch 2, batch 25300, loss[loss=0.3288, simple_loss=0.385, pruned_loss=0.1363, over 21622.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.3235, pruned_loss=0.09971, over 4244686.01 frames. ], batch size: 389, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 11:10:10,285 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=334890.0, ans=0.125 2023-06-19 11:10:11,921 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=334890.0, ans=0.1 2023-06-19 11:10:22,277 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=334950.0, ans=0.125 2023-06-19 11:10:37,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=334950.0, ans=0.2 2023-06-19 11:11:14,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=335010.0, ans=0.025 2023-06-19 11:11:19,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=335070.0, ans=0.0 2023-06-19 11:11:20,620 INFO [train.py:996] (2/4) Epoch 2, batch 25350, loss[loss=0.2565, simple_loss=0.3215, pruned_loss=0.0958, over 21528.00 frames. ], tot_loss[loss=0.2629, simple_loss=0.3264, pruned_loss=0.09974, over 4253929.28 frames. ], batch size: 441, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 11:11:21,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=335070.0, ans=0.125 2023-06-19 11:11:34,107 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=335070.0, ans=0.125 2023-06-19 11:11:52,963 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.877e+02 2.646e+02 3.240e+02 3.926e+02 6.087e+02, threshold=6.480e+02, percent-clipped=0.0 2023-06-19 11:12:12,742 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.27 vs. limit=6.0 2023-06-19 11:13:03,158 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 11:13:11,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=335310.0, ans=0.1 2023-06-19 11:13:19,495 INFO [train.py:996] (2/4) Epoch 2, batch 25400, loss[loss=0.2641, simple_loss=0.3208, pruned_loss=0.1037, over 21699.00 frames. ], tot_loss[loss=0.2594, simple_loss=0.3216, pruned_loss=0.09865, over 4248105.64 frames. ], batch size: 282, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 11:13:22,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=335370.0, ans=0.5 2023-06-19 11:14:21,068 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 11:15:31,544 INFO [train.py:996] (2/4) Epoch 2, batch 25450, loss[loss=0.2911, simple_loss=0.3723, pruned_loss=0.105, over 21695.00 frames. ], tot_loss[loss=0.2626, simple_loss=0.3238, pruned_loss=0.1007, over 4245257.51 frames. ], batch size: 441, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 11:15:43,850 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=335670.0, ans=0.04949747468305833 2023-06-19 11:16:04,829 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.861e+02 2.784e+02 3.356e+02 3.897e+02 6.428e+02, threshold=6.713e+02, percent-clipped=0.0 2023-06-19 11:16:49,602 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=10.27 vs. limit=15.0 2023-06-19 11:17:52,956 INFO [train.py:996] (2/4) Epoch 2, batch 25500, loss[loss=0.2626, simple_loss=0.3503, pruned_loss=0.08747, over 21499.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3227, pruned_loss=0.09625, over 4237806.85 frames. ], batch size: 471, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 11:17:58,073 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=335970.0, ans=0.125 2023-06-19 11:18:17,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=336030.0, ans=0.04949747468305833 2023-06-19 11:18:19,554 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.06 vs. limit=15.0 2023-06-19 11:20:10,338 INFO [train.py:996] (2/4) Epoch 2, batch 25550, loss[loss=0.2801, simple_loss=0.3819, pruned_loss=0.0891, over 21631.00 frames. ], tot_loss[loss=0.2632, simple_loss=0.331, pruned_loss=0.09774, over 4237128.59 frames. ], batch size: 414, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:20:12,403 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=336270.0, ans=0.0 2023-06-19 11:20:12,917 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.65 vs. limit=15.0 2023-06-19 11:20:36,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=336330.0, ans=0.0 2023-06-19 11:20:46,163 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.814e+02 2.372e+02 2.803e+02 3.273e+02 5.075e+02, threshold=5.607e+02, percent-clipped=0.0 2023-06-19 11:21:12,963 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=336330.0, ans=0.07 2023-06-19 11:21:37,850 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=336390.0, ans=0.0 2023-06-19 11:21:37,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=336390.0, ans=0.05 2023-06-19 11:22:12,669 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=336510.0, ans=0.2 2023-06-19 11:22:43,077 INFO [train.py:996] (2/4) Epoch 2, batch 25600, loss[loss=0.2772, simple_loss=0.3439, pruned_loss=0.1053, over 21838.00 frames. ], tot_loss[loss=0.2666, simple_loss=0.3354, pruned_loss=0.09889, over 4253349.47 frames. ], batch size: 282, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:22:55,073 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 11:23:00,949 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=336570.0, ans=0.0 2023-06-19 11:23:13,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=336630.0, ans=0.125 2023-06-19 11:23:43,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=336690.0, ans=0.1 2023-06-19 11:23:50,544 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.45 vs. limit=10.0 2023-06-19 11:23:50,719 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=336750.0, ans=15.0 2023-06-19 11:24:31,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=336810.0, ans=0.125 2023-06-19 11:24:46,824 INFO [train.py:996] (2/4) Epoch 2, batch 25650, loss[loss=0.2611, simple_loss=0.3192, pruned_loss=0.1015, over 21433.00 frames. ], tot_loss[loss=0.2727, simple_loss=0.339, pruned_loss=0.1032, over 4250809.31 frames. ], batch size: 131, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:25:03,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=336870.0, ans=0.025 2023-06-19 11:25:13,251 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.054e+02 2.891e+02 3.339e+02 4.051e+02 5.669e+02, threshold=6.678e+02, percent-clipped=1.0 2023-06-19 11:25:27,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=336990.0, ans=0.0 2023-06-19 11:26:04,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=337050.0, ans=0.025 2023-06-19 11:26:09,451 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.38 vs. limit=12.0 2023-06-19 11:26:45,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=337110.0, ans=0.0 2023-06-19 11:26:48,353 INFO [train.py:996] (2/4) Epoch 2, batch 25700, loss[loss=0.247, simple_loss=0.3272, pruned_loss=0.08344, over 21387.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.3349, pruned_loss=0.1044, over 4238329.15 frames. ], batch size: 194, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:28:44,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=337350.0, ans=0.2 2023-06-19 11:29:14,403 INFO [train.py:996] (2/4) Epoch 2, batch 25750, loss[loss=0.3386, simple_loss=0.3807, pruned_loss=0.1482, over 21786.00 frames. ], tot_loss[loss=0.2779, simple_loss=0.3405, pruned_loss=0.1077, over 4246103.77 frames. ], batch size: 441, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:29:29,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=337470.0, ans=0.125 2023-06-19 11:30:06,570 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=337530.0, ans=0.0 2023-06-19 11:30:07,488 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.743e+02 2.884e+02 3.379e+02 4.155e+02 6.943e+02, threshold=6.757e+02, percent-clipped=1.0 2023-06-19 11:30:31,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=337590.0, ans=0.1 2023-06-19 11:31:21,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=337650.0, ans=0.2 2023-06-19 11:31:33,482 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 11:31:43,401 INFO [train.py:996] (2/4) Epoch 2, batch 25800, loss[loss=0.3357, simple_loss=0.4103, pruned_loss=0.1305, over 21419.00 frames. ], tot_loss[loss=0.2893, simple_loss=0.3537, pruned_loss=0.1125, over 4253070.77 frames. ], batch size: 131, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:32:21,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=337770.0, ans=0.2 2023-06-19 11:32:26,711 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=337830.0, ans=0.125 2023-06-19 11:34:10,794 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 11:34:12,171 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=338010.0, ans=0.125 2023-06-19 11:34:19,984 INFO [train.py:996] (2/4) Epoch 2, batch 25850, loss[loss=0.2687, simple_loss=0.3296, pruned_loss=0.1039, over 21686.00 frames. ], tot_loss[loss=0.2886, simple_loss=0.3544, pruned_loss=0.1114, over 4258856.37 frames. ], batch size: 230, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:34:53,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=338070.0, ans=0.0 2023-06-19 11:35:07,691 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.249e+02 2.822e+02 3.088e+02 3.805e+02 7.433e+02, threshold=6.176e+02, percent-clipped=1.0 2023-06-19 11:35:26,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=338190.0, ans=0.2 2023-06-19 11:35:34,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=338190.0, ans=0.125 2023-06-19 11:35:42,445 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.94 vs. limit=10.0 2023-06-19 11:36:46,146 INFO [train.py:996] (2/4) Epoch 2, batch 25900, loss[loss=0.2911, simple_loss=0.3674, pruned_loss=0.1074, over 21424.00 frames. ], tot_loss[loss=0.2904, simple_loss=0.3556, pruned_loss=0.1126, over 4263278.86 frames. ], batch size: 211, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:37:00,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=338370.0, ans=0.125 2023-06-19 11:37:14,383 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 11:37:24,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=338430.0, ans=0.125 2023-06-19 11:37:26,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=338430.0, ans=0.0 2023-06-19 11:37:32,386 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=338430.0, ans=0.2 2023-06-19 11:37:52,636 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=338490.0, ans=0.0 2023-06-19 11:37:55,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=338490.0, ans=0.125 2023-06-19 11:39:03,311 INFO [train.py:996] (2/4) Epoch 2, batch 25950, loss[loss=0.3272, simple_loss=0.3851, pruned_loss=0.1347, over 21612.00 frames. ], tot_loss[loss=0.2971, simple_loss=0.3626, pruned_loss=0.1158, over 4271487.06 frames. ], batch size: 415, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:39:16,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=338670.0, ans=0.2 2023-06-19 11:39:36,486 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 2.799e+02 3.323e+02 3.883e+02 7.298e+02, threshold=6.645e+02, percent-clipped=1.0 2023-06-19 11:39:43,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=338790.0, ans=0.0 2023-06-19 11:40:05,477 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=338790.0, ans=0.125 2023-06-19 11:40:21,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=338850.0, ans=0.2 2023-06-19 11:40:24,830 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=338850.0, ans=0.07 2023-06-19 11:40:24,888 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=338850.0, ans=0.125 2023-06-19 11:41:14,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=338910.0, ans=0.0 2023-06-19 11:41:30,341 INFO [train.py:996] (2/4) Epoch 2, batch 26000, loss[loss=0.2754, simple_loss=0.3601, pruned_loss=0.09536, over 21414.00 frames. ], tot_loss[loss=0.2926, simple_loss=0.3598, pruned_loss=0.1127, over 4268179.40 frames. ], batch size: 211, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:41:34,226 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.88 vs. limit=15.0 2023-06-19 11:42:33,342 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.57 vs. limit=15.0 2023-06-19 11:42:57,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=339150.0, ans=0.2 2023-06-19 11:43:47,048 INFO [train.py:996] (2/4) Epoch 2, batch 26050, loss[loss=0.2445, simple_loss=0.2975, pruned_loss=0.09577, over 21061.00 frames. ], tot_loss[loss=0.2957, simple_loss=0.3604, pruned_loss=0.1155, over 4271755.83 frames. ], batch size: 608, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:44:17,720 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.961e+02 2.721e+02 3.311e+02 3.972e+02 6.601e+02, threshold=6.622e+02, percent-clipped=0.0 2023-06-19 11:44:50,957 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.02 vs. limit=15.0 2023-06-19 11:45:55,648 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 11:46:04,717 INFO [train.py:996] (2/4) Epoch 2, batch 26100, loss[loss=0.2554, simple_loss=0.312, pruned_loss=0.09941, over 21912.00 frames. ], tot_loss[loss=0.2911, simple_loss=0.3535, pruned_loss=0.1143, over 4279975.30 frames. ], batch size: 283, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:46:29,822 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-19 11:47:53,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=339750.0, ans=0.125 2023-06-19 11:48:33,526 INFO [train.py:996] (2/4) Epoch 2, batch 26150, loss[loss=0.29, simple_loss=0.3599, pruned_loss=0.11, over 21312.00 frames. ], tot_loss[loss=0.2912, simple_loss=0.3529, pruned_loss=0.1147, over 4276630.90 frames. ], batch size: 143, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:48:54,242 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=22.5 2023-06-19 11:49:14,425 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.236e+02 2.871e+02 3.280e+02 4.154e+02 8.858e+02, threshold=6.561e+02, percent-clipped=1.0 2023-06-19 11:49:19,491 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.53 vs. limit=6.0 2023-06-19 11:49:52,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=339990.0, ans=0.1 2023-06-19 11:49:58,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=339990.0, ans=0.0 2023-06-19 11:50:11,845 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.51 vs. limit=6.0 2023-06-19 11:50:37,690 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.16 vs. limit=12.0 2023-06-19 11:50:47,962 INFO [train.py:996] (2/4) Epoch 2, batch 26200, loss[loss=0.2742, simple_loss=0.3618, pruned_loss=0.09334, over 21859.00 frames. ], tot_loss[loss=0.2882, simple_loss=0.353, pruned_loss=0.1117, over 4275398.98 frames. ], batch size: 316, lr: 1.49e-02, grad_scale: 64.0 2023-06-19 11:51:24,720 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.68 vs. limit=6.0 2023-06-19 11:52:07,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=340290.0, ans=0.025 2023-06-19 11:53:09,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=340410.0, ans=0.125 2023-06-19 11:53:25,759 INFO [train.py:996] (2/4) Epoch 2, batch 26250, loss[loss=0.2735, simple_loss=0.3284, pruned_loss=0.1093, over 21363.00 frames. ], tot_loss[loss=0.2875, simple_loss=0.3548, pruned_loss=0.1101, over 4282711.41 frames. ], batch size: 159, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:53:51,627 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=340530.0, ans=0.125 2023-06-19 11:54:03,520 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.932e+02 2.625e+02 2.924e+02 3.538e+02 7.396e+02, threshold=5.848e+02, percent-clipped=2.0 2023-06-19 11:54:11,046 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.54 vs. limit=15.0 2023-06-19 11:55:11,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=340650.0, ans=0.125 2023-06-19 11:55:14,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=340710.0, ans=0.0 2023-06-19 11:55:30,359 INFO [train.py:996] (2/4) Epoch 2, batch 26300, loss[loss=0.2808, simple_loss=0.3241, pruned_loss=0.1187, over 21608.00 frames. ], tot_loss[loss=0.2868, simple_loss=0.3516, pruned_loss=0.111, over 4290876.37 frames. ], batch size: 548, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:56:56,177 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=340890.0, ans=0.2 2023-06-19 11:57:10,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=340950.0, ans=0.125 2023-06-19 11:57:27,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=340950.0, ans=0.1 2023-06-19 11:58:03,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=341010.0, ans=0.125 2023-06-19 11:58:06,958 INFO [train.py:996] (2/4) Epoch 2, batch 26350, loss[loss=0.3048, simple_loss=0.3651, pruned_loss=0.1223, over 21470.00 frames. ], tot_loss[loss=0.2872, simple_loss=0.3506, pruned_loss=0.1119, over 4293776.05 frames. ], batch size: 131, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 11:58:57,174 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.355e+02 3.087e+02 3.544e+02 4.218e+02 8.701e+02, threshold=7.087e+02, percent-clipped=9.0 2023-06-19 11:58:57,675 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=341130.0, ans=0.125 2023-06-19 11:59:16,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=341190.0, ans=0.09899494936611666 2023-06-19 11:59:32,137 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=341250.0, ans=0.125 2023-06-19 12:00:00,727 INFO [train.py:996] (2/4) Epoch 2, batch 26400, loss[loss=0.2784, simple_loss=0.3192, pruned_loss=0.1188, over 21794.00 frames. ], tot_loss[loss=0.2833, simple_loss=0.3443, pruned_loss=0.1112, over 4288888.35 frames. ], batch size: 372, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 12:00:14,442 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=341370.0, ans=0.125 2023-06-19 12:01:25,250 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 12:01:51,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=341550.0, ans=0.0 2023-06-19 12:02:04,282 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.04 vs. limit=12.0 2023-06-19 12:02:30,915 INFO [train.py:996] (2/4) Epoch 2, batch 26450, loss[loss=0.4101, simple_loss=0.4775, pruned_loss=0.1713, over 21431.00 frames. ], tot_loss[loss=0.2835, simple_loss=0.3442, pruned_loss=0.1114, over 4284791.86 frames. ], batch size: 507, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 12:03:21,327 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.049e+02 2.970e+02 3.486e+02 4.809e+02 7.983e+02, threshold=6.973e+02, percent-clipped=4.0 2023-06-19 12:03:28,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=341730.0, ans=0.09899494936611666 2023-06-19 12:03:30,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=341730.0, ans=0.125 2023-06-19 12:03:44,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=341790.0, ans=0.125 2023-06-19 12:03:56,544 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 12:03:59,461 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=341850.0, ans=0.125 2023-06-19 12:04:17,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=341850.0, ans=0.04949747468305833 2023-06-19 12:05:04,480 INFO [train.py:996] (2/4) Epoch 2, batch 26500, loss[loss=0.1593, simple_loss=0.1997, pruned_loss=0.05943, over 16337.00 frames. ], tot_loss[loss=0.2825, simple_loss=0.3452, pruned_loss=0.1099, over 4272118.34 frames. ], batch size: 60, lr: 1.48e-02, grad_scale: 16.0 2023-06-19 12:05:21,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=341970.0, ans=0.0 2023-06-19 12:05:51,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=342090.0, ans=0.1 2023-06-19 12:05:55,565 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.90 vs. limit=15.0 2023-06-19 12:07:39,687 INFO [train.py:996] (2/4) Epoch 2, batch 26550, loss[loss=0.2118, simple_loss=0.2721, pruned_loss=0.0758, over 21225.00 frames. ], tot_loss[loss=0.2743, simple_loss=0.3393, pruned_loss=0.1046, over 4262324.31 frames. ], batch size: 159, lr: 1.48e-02, grad_scale: 16.0 2023-06-19 12:07:52,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=342270.0, ans=0.2 2023-06-19 12:07:52,910 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.45 vs. limit=15.0 2023-06-19 12:07:53,964 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=342270.0, ans=0.125 2023-06-19 12:08:27,567 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.125e+02 2.765e+02 3.396e+02 4.650e+02 6.569e+02, threshold=6.792e+02, percent-clipped=0.0 2023-06-19 12:08:42,178 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.54 vs. limit=22.5 2023-06-19 12:09:46,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=342510.0, ans=0.0 2023-06-19 12:09:50,675 INFO [train.py:996] (2/4) Epoch 2, batch 26600, loss[loss=0.2461, simple_loss=0.3069, pruned_loss=0.09268, over 21272.00 frames. ], tot_loss[loss=0.2703, simple_loss=0.338, pruned_loss=0.1013, over 4266080.46 frames. ], batch size: 131, lr: 1.48e-02, grad_scale: 16.0 2023-06-19 12:10:16,493 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=342570.0, ans=0.1 2023-06-19 12:10:53,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=342690.0, ans=0.1 2023-06-19 12:10:55,567 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-06-19 12:12:01,204 INFO [train.py:996] (2/4) Epoch 2, batch 26650, loss[loss=0.2358, simple_loss=0.312, pruned_loss=0.07983, over 21522.00 frames. ], tot_loss[loss=0.2659, simple_loss=0.3315, pruned_loss=0.1002, over 4269710.55 frames. ], batch size: 441, lr: 1.48e-02, grad_scale: 16.0 2023-06-19 12:12:03,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=342870.0, ans=0.025 2023-06-19 12:12:47,811 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.83 vs. limit=22.5 2023-06-19 12:12:49,403 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.573e+02 2.521e+02 2.956e+02 3.686e+02 6.672e+02, threshold=5.912e+02, percent-clipped=0.0 2023-06-19 12:13:42,511 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.55 vs. limit=15.0 2023-06-19 12:14:18,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=343110.0, ans=0.0 2023-06-19 12:14:24,018 INFO [train.py:996] (2/4) Epoch 2, batch 26700, loss[loss=0.3416, simple_loss=0.368, pruned_loss=0.1576, over 21745.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.3252, pruned_loss=0.09729, over 4269006.10 frames. ], batch size: 508, lr: 1.48e-02, grad_scale: 16.0 2023-06-19 12:14:28,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=343170.0, ans=0.5 2023-06-19 12:15:40,891 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.55 vs. limit=15.0 2023-06-19 12:15:41,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=343290.0, ans=0.125 2023-06-19 12:15:52,146 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.15 vs. limit=6.0 2023-06-19 12:16:37,960 INFO [train.py:996] (2/4) Epoch 2, batch 26750, loss[loss=0.3207, simple_loss=0.3901, pruned_loss=0.1257, over 21812.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3257, pruned_loss=0.09607, over 4278134.39 frames. ], batch size: 124, lr: 1.48e-02, grad_scale: 16.0 2023-06-19 12:17:11,480 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=343530.0, ans=0.0 2023-06-19 12:17:15,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=343530.0, ans=0.1 2023-06-19 12:17:33,074 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.673e+02 2.465e+02 2.855e+02 3.693e+02 7.404e+02, threshold=5.711e+02, percent-clipped=5.0 2023-06-19 12:17:41,195 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.98 vs. limit=10.0 2023-06-19 12:18:11,133 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.24 vs. limit=15.0 2023-06-19 12:18:19,384 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 12:18:58,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=343710.0, ans=0.125 2023-06-19 12:19:17,207 INFO [train.py:996] (2/4) Epoch 2, batch 26800, loss[loss=0.2684, simple_loss=0.3383, pruned_loss=0.09923, over 19984.00 frames. ], tot_loss[loss=0.2691, simple_loss=0.3342, pruned_loss=0.1021, over 4270533.27 frames. ], batch size: 703, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 12:19:30,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=343770.0, ans=0.04949747468305833 2023-06-19 12:19:45,925 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=343830.0, ans=0.07 2023-06-19 12:20:31,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=343950.0, ans=0.125 2023-06-19 12:20:46,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=343950.0, ans=0.125 2023-06-19 12:21:26,712 INFO [train.py:996] (2/4) Epoch 2, batch 26850, loss[loss=0.237, simple_loss=0.295, pruned_loss=0.08957, over 21629.00 frames. ], tot_loss[loss=0.274, simple_loss=0.3369, pruned_loss=0.1056, over 4265160.80 frames. ], batch size: 298, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 12:21:49,936 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=344070.0, ans=10.0 2023-06-19 12:22:01,094 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.242e+02 2.924e+02 3.551e+02 4.179e+02 8.286e+02, threshold=7.103e+02, percent-clipped=6.0 2023-06-19 12:22:36,880 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.50 vs. limit=6.0 2023-06-19 12:22:42,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=344250.0, ans=0.125 2023-06-19 12:22:48,171 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=344250.0, ans=0.125 2023-06-19 12:22:51,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=344250.0, ans=15.0 2023-06-19 12:23:25,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=344370.0, ans=0.0 2023-06-19 12:23:26,397 INFO [train.py:996] (2/4) Epoch 2, batch 26900, loss[loss=0.2474, simple_loss=0.2959, pruned_loss=0.09945, over 21141.00 frames. ], tot_loss[loss=0.2677, simple_loss=0.3279, pruned_loss=0.1037, over 4261780.76 frames. ], batch size: 143, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 12:23:48,806 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.22 vs. limit=15.0 2023-06-19 12:24:30,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=344490.0, ans=0.0 2023-06-19 12:24:57,816 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=344550.0, ans=0.125 2023-06-19 12:25:37,857 INFO [train.py:996] (2/4) Epoch 2, batch 26950, loss[loss=0.2496, simple_loss=0.3077, pruned_loss=0.09577, over 21761.00 frames. ], tot_loss[loss=0.2674, simple_loss=0.3267, pruned_loss=0.1041, over 4265143.95 frames. ], batch size: 112, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 12:26:19,477 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 12:26:25,924 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.913e+02 2.623e+02 3.082e+02 3.619e+02 5.950e+02, threshold=6.165e+02, percent-clipped=0.0 2023-06-19 12:26:40,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=344730.0, ans=0.125 2023-06-19 12:26:52,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=344790.0, ans=0.125 2023-06-19 12:27:59,461 INFO [train.py:996] (2/4) Epoch 2, batch 27000, loss[loss=0.2653, simple_loss=0.3578, pruned_loss=0.08637, over 21135.00 frames. ], tot_loss[loss=0.2643, simple_loss=0.3261, pruned_loss=0.1012, over 4264587.11 frames. ], batch size: 548, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 12:27:59,462 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-19 12:28:58,708 INFO [train.py:1028] (2/4) Epoch 2, validation: loss=0.2596, simple_loss=0.3558, pruned_loss=0.08164, over 1796401.00 frames. 2023-06-19 12:28:58,717 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-19 12:29:04,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=344970.0, ans=0.1 2023-06-19 12:29:47,341 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.65 vs. limit=10.0 2023-06-19 12:30:16,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=345150.0, ans=0.04949747468305833 2023-06-19 12:30:36,004 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=345210.0, ans=0.0 2023-06-19 12:30:48,515 INFO [train.py:996] (2/4) Epoch 2, batch 27050, loss[loss=0.2739, simple_loss=0.3461, pruned_loss=0.1008, over 21923.00 frames. ], tot_loss[loss=0.2605, simple_loss=0.3276, pruned_loss=0.09673, over 4257734.66 frames. ], batch size: 372, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 12:31:23,393 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.864e+02 2.340e+02 2.656e+02 3.215e+02 6.128e+02, threshold=5.313e+02, percent-clipped=0.0 2023-06-19 12:31:27,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=345330.0, ans=0.125 2023-06-19 12:31:55,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=345390.0, ans=0.0 2023-06-19 12:33:09,922 INFO [train.py:996] (2/4) Epoch 2, batch 27100, loss[loss=0.2731, simple_loss=0.3201, pruned_loss=0.1131, over 21677.00 frames. ], tot_loss[loss=0.2654, simple_loss=0.3307, pruned_loss=0.1, over 4272915.08 frames. ], batch size: 263, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 12:33:33,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=345630.0, ans=0.125 2023-06-19 12:34:12,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=345750.0, ans=0.0 2023-06-19 12:34:53,899 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=345810.0, ans=0.125 2023-06-19 12:35:13,362 INFO [train.py:996] (2/4) Epoch 2, batch 27150, loss[loss=0.2836, simple_loss=0.3669, pruned_loss=0.1002, over 21638.00 frames. ], tot_loss[loss=0.2742, simple_loss=0.3424, pruned_loss=0.103, over 4269518.54 frames. ], batch size: 263, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 12:35:19,636 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=345870.0, ans=0.0 2023-06-19 12:35:36,686 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.016e+02 2.859e+02 3.282e+02 3.762e+02 6.108e+02, threshold=6.564e+02, percent-clipped=5.0 2023-06-19 12:36:04,087 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.14 vs. limit=6.0 2023-06-19 12:36:33,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=346050.0, ans=0.0 2023-06-19 12:36:58,899 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.16 vs. limit=5.0 2023-06-19 12:37:10,440 INFO [train.py:996] (2/4) Epoch 2, batch 27200, loss[loss=0.3021, simple_loss=0.3656, pruned_loss=0.1193, over 21773.00 frames. ], tot_loss[loss=0.2842, simple_loss=0.3531, pruned_loss=0.1076, over 4276584.15 frames. ], batch size: 247, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 12:38:33,629 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.18 vs. limit=15.0 2023-06-19 12:38:33,788 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.20 vs. limit=10.0 2023-06-19 12:38:44,792 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=346350.0, ans=0.0 2023-06-19 12:38:56,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=346410.0, ans=0.0 2023-06-19 12:39:01,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=346410.0, ans=0.0 2023-06-19 12:39:12,436 INFO [train.py:996] (2/4) Epoch 2, batch 27250, loss[loss=0.2526, simple_loss=0.28, pruned_loss=0.1126, over 20089.00 frames. ], tot_loss[loss=0.2913, simple_loss=0.3566, pruned_loss=0.113, over 4275860.83 frames. ], batch size: 704, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 12:39:13,359 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.86 vs. limit=10.0 2023-06-19 12:40:02,684 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.091e+02 2.829e+02 3.133e+02 3.689e+02 6.969e+02, threshold=6.265e+02, percent-clipped=1.0 2023-06-19 12:40:53,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=346650.0, ans=0.2 2023-06-19 12:41:28,687 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=346710.0, ans=10.0 2023-06-19 12:41:38,424 INFO [train.py:996] (2/4) Epoch 2, batch 27300, loss[loss=0.2703, simple_loss=0.3488, pruned_loss=0.09593, over 21822.00 frames. ], tot_loss[loss=0.2912, simple_loss=0.3576, pruned_loss=0.1124, over 4278239.31 frames. ], batch size: 282, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 12:42:05,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=346830.0, ans=0.2 2023-06-19 12:42:40,941 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=346890.0, ans=0.1 2023-06-19 12:42:48,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=346890.0, ans=0.125 2023-06-19 12:43:35,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=347010.0, ans=0.2 2023-06-19 12:44:02,149 INFO [train.py:996] (2/4) Epoch 2, batch 27350, loss[loss=0.3645, simple_loss=0.4008, pruned_loss=0.1641, over 21511.00 frames. ], tot_loss[loss=0.2943, simple_loss=0.3604, pruned_loss=0.1141, over 4279020.76 frames. ], batch size: 507, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 12:44:53,032 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.034e+02 2.548e+02 2.973e+02 3.425e+02 6.706e+02, threshold=5.945e+02, percent-clipped=1.0 2023-06-19 12:44:54,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=347130.0, ans=0.125 2023-06-19 12:45:01,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=347130.0, ans=0.0 2023-06-19 12:45:27,931 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=347250.0, ans=0.125 2023-06-19 12:45:27,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=347250.0, ans=0.2 2023-06-19 12:46:09,206 INFO [train.py:996] (2/4) Epoch 2, batch 27400, loss[loss=0.2555, simple_loss=0.3121, pruned_loss=0.09944, over 21595.00 frames. ], tot_loss[loss=0.291, simple_loss=0.3558, pruned_loss=0.1131, over 4284209.37 frames. ], batch size: 263, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 12:46:32,661 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.42 vs. limit=15.0 2023-06-19 12:47:14,203 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=347490.0, ans=0.125 2023-06-19 12:47:50,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=347610.0, ans=0.0 2023-06-19 12:48:07,713 INFO [train.py:996] (2/4) Epoch 2, batch 27450, loss[loss=0.2739, simple_loss=0.3458, pruned_loss=0.101, over 21682.00 frames. ], tot_loss[loss=0.2857, simple_loss=0.3489, pruned_loss=0.1112, over 4285442.22 frames. ], batch size: 247, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 12:48:52,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=347730.0, ans=0.2 2023-06-19 12:49:01,173 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 3.006e+02 3.444e+02 4.102e+02 6.886e+02, threshold=6.888e+02, percent-clipped=2.0 2023-06-19 12:49:02,967 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=347730.0, ans=0.0 2023-06-19 12:49:18,140 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.91 vs. limit=22.5 2023-06-19 12:50:20,429 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=347910.0, ans=0.125 2023-06-19 12:50:24,240 INFO [train.py:996] (2/4) Epoch 2, batch 27500, loss[loss=0.247, simple_loss=0.3001, pruned_loss=0.09689, over 21148.00 frames. ], tot_loss[loss=0.2858, simple_loss=0.3483, pruned_loss=0.1117, over 4293181.52 frames. ], batch size: 608, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 12:50:32,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=347970.0, ans=0.0 2023-06-19 12:50:43,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=348030.0, ans=0.125 2023-06-19 12:51:03,337 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=348030.0, ans=0.125 2023-06-19 12:51:11,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=348090.0, ans=0.0 2023-06-19 12:51:28,816 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.18 vs. limit=12.0 2023-06-19 12:52:21,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=348210.0, ans=15.0 2023-06-19 12:52:29,467 INFO [train.py:996] (2/4) Epoch 2, batch 27550, loss[loss=0.3695, simple_loss=0.4535, pruned_loss=0.1427, over 19984.00 frames. ], tot_loss[loss=0.2798, simple_loss=0.3434, pruned_loss=0.1081, over 4289672.25 frames. ], batch size: 702, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 12:52:51,332 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 12:53:08,723 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.914e+02 2.657e+02 3.197e+02 3.865e+02 5.896e+02, threshold=6.395e+02, percent-clipped=0.0 2023-06-19 12:53:25,811 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.52 vs. limit=12.0 2023-06-19 12:53:50,564 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.85 vs. limit=22.5 2023-06-19 12:53:51,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=348450.0, ans=0.05 2023-06-19 12:53:52,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=348450.0, ans=0.125 2023-06-19 12:54:10,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=348510.0, ans=0.125 2023-06-19 12:54:33,009 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.44 vs. limit=22.5 2023-06-19 12:54:36,377 INFO [train.py:996] (2/4) Epoch 2, batch 27600, loss[loss=0.2692, simple_loss=0.3232, pruned_loss=0.1076, over 21817.00 frames. ], tot_loss[loss=0.2746, simple_loss=0.3361, pruned_loss=0.1065, over 4282384.71 frames. ], batch size: 98, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 12:54:46,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn1.whiten.whitening_limit, batch_count=348570.0, ans=22.5 2023-06-19 12:55:32,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=348750.0, ans=0.2 2023-06-19 12:55:35,848 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=348750.0, ans=0.0 2023-06-19 12:56:19,785 INFO [train.py:996] (2/4) Epoch 2, batch 27650, loss[loss=0.2922, simple_loss=0.3451, pruned_loss=0.1196, over 21804.00 frames. ], tot_loss[loss=0.2696, simple_loss=0.329, pruned_loss=0.1051, over 4277531.30 frames. ], batch size: 124, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 12:56:28,002 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=348870.0, ans=0.0 2023-06-19 12:56:45,255 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=348930.0, ans=0.0 2023-06-19 12:57:01,758 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.096e+02 2.710e+02 3.133e+02 3.895e+02 7.823e+02, threshold=6.265e+02, percent-clipped=1.0 2023-06-19 12:57:02,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=348930.0, ans=0.0 2023-06-19 12:58:02,335 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=349110.0, ans=0.1 2023-06-19 12:58:15,760 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.27 vs. limit=22.5 2023-06-19 12:58:17,477 INFO [train.py:996] (2/4) Epoch 2, batch 27700, loss[loss=0.2286, simple_loss=0.2971, pruned_loss=0.07999, over 21176.00 frames. ], tot_loss[loss=0.2664, simple_loss=0.3282, pruned_loss=0.1023, over 4280131.38 frames. ], batch size: 143, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 12:59:04,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=349230.0, ans=0.125 2023-06-19 12:59:16,583 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=349290.0, ans=0.0 2023-06-19 12:59:31,879 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=349350.0, ans=0.125 2023-06-19 13:00:30,254 INFO [train.py:996] (2/4) Epoch 2, batch 27750, loss[loss=0.2626, simple_loss=0.3004, pruned_loss=0.1124, over 20214.00 frames. ], tot_loss[loss=0.2653, simple_loss=0.3288, pruned_loss=0.1008, over 4280348.42 frames. ], batch size: 703, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 13:00:59,271 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 13:01:12,828 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.913e+02 2.697e+02 3.256e+02 3.829e+02 6.521e+02, threshold=6.511e+02, percent-clipped=1.0 2023-06-19 13:01:30,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=349590.0, ans=0.2 2023-06-19 13:02:11,096 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.72 vs. limit=15.0 2023-06-19 13:02:32,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=349710.0, ans=0.125 2023-06-19 13:02:40,128 INFO [train.py:996] (2/4) Epoch 2, batch 27800, loss[loss=0.2875, simple_loss=0.3478, pruned_loss=0.1136, over 21896.00 frames. ], tot_loss[loss=0.2665, simple_loss=0.3294, pruned_loss=0.1018, over 4284187.37 frames. ], batch size: 118, lr: 1.47e-02, grad_scale: 16.0 2023-06-19 13:03:08,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=349770.0, ans=0.125 2023-06-19 13:03:09,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=349770.0, ans=0.0 2023-06-19 13:03:17,582 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=349830.0, ans=0.0 2023-06-19 13:03:41,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=349890.0, ans=0.035 2023-06-19 13:04:44,755 INFO [train.py:996] (2/4) Epoch 2, batch 27850, loss[loss=0.2601, simple_loss=0.3152, pruned_loss=0.1025, over 21711.00 frames. ], tot_loss[loss=0.2677, simple_loss=0.3291, pruned_loss=0.1031, over 4294649.92 frames. ], batch size: 230, lr: 1.47e-02, grad_scale: 16.0 2023-06-19 13:04:57,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=350070.0, ans=0.0 2023-06-19 13:05:11,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=350070.0, ans=0.05 2023-06-19 13:05:41,175 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.118e+02 2.837e+02 3.296e+02 3.956e+02 7.636e+02, threshold=6.591e+02, percent-clipped=1.0 2023-06-19 13:05:51,935 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=350190.0, ans=0.125 2023-06-19 13:06:23,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=350250.0, ans=0.125 2023-06-19 13:06:37,194 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 13:06:38,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=350250.0, ans=0.125 2023-06-19 13:07:15,642 INFO [train.py:996] (2/4) Epoch 2, batch 27900, loss[loss=0.3494, simple_loss=0.4192, pruned_loss=0.1398, over 21482.00 frames. ], tot_loss[loss=0.2748, simple_loss=0.3395, pruned_loss=0.105, over 4287781.94 frames. ], batch size: 471, lr: 1.47e-02, grad_scale: 16.0 2023-06-19 13:08:01,914 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.01 vs. limit=15.0 2023-06-19 13:08:31,174 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=350550.0, ans=0.0 2023-06-19 13:08:38,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=350550.0, ans=0.125 2023-06-19 13:08:38,681 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.97 vs. limit=15.0 2023-06-19 13:09:20,637 INFO [train.py:996] (2/4) Epoch 2, batch 27950, loss[loss=0.2457, simple_loss=0.3136, pruned_loss=0.0889, over 21830.00 frames. ], tot_loss[loss=0.271, simple_loss=0.3393, pruned_loss=0.1013, over 4282516.38 frames. ], batch size: 118, lr: 1.46e-02, grad_scale: 16.0 2023-06-19 13:09:22,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=350670.0, ans=0.125 2023-06-19 13:09:34,715 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.49 vs. limit=15.0 2023-06-19 13:09:57,862 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.724e+02 2.612e+02 3.167e+02 4.012e+02 7.863e+02, threshold=6.333e+02, percent-clipped=3.0 2023-06-19 13:10:31,797 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.03 vs. limit=15.0 2023-06-19 13:10:42,168 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.69 vs. limit=15.0 2023-06-19 13:10:50,597 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.11 vs. limit=15.0 2023-06-19 13:11:04,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=350850.0, ans=0.125 2023-06-19 13:11:28,374 INFO [train.py:996] (2/4) Epoch 2, batch 28000, loss[loss=0.2776, simple_loss=0.3527, pruned_loss=0.1013, over 21597.00 frames. ], tot_loss[loss=0.2676, simple_loss=0.337, pruned_loss=0.09906, over 4281876.98 frames. ], batch size: 471, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:12:13,810 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=351030.0, ans=0.0 2023-06-19 13:12:34,471 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=351090.0, ans=0.1 2023-06-19 13:12:38,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=351150.0, ans=0.1 2023-06-19 13:13:20,053 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=351210.0, ans=0.0 2023-06-19 13:13:30,502 INFO [train.py:996] (2/4) Epoch 2, batch 28050, loss[loss=0.2577, simple_loss=0.309, pruned_loss=0.1032, over 20148.00 frames. ], tot_loss[loss=0.2687, simple_loss=0.335, pruned_loss=0.1012, over 4285059.76 frames. ], batch size: 703, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:14:33,398 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.223e+02 3.018e+02 3.592e+02 4.421e+02 7.489e+02, threshold=7.184e+02, percent-clipped=8.0 2023-06-19 13:15:31,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=351510.0, ans=0.125 2023-06-19 13:15:39,820 INFO [train.py:996] (2/4) Epoch 2, batch 28100, loss[loss=0.2475, simple_loss=0.3085, pruned_loss=0.09322, over 21990.00 frames. ], tot_loss[loss=0.2679, simple_loss=0.3332, pruned_loss=0.1013, over 4271734.14 frames. ], batch size: 103, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:16:21,003 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=351630.0, ans=0.0 2023-06-19 13:16:22,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=351630.0, ans=0.1 2023-06-19 13:16:38,055 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=351690.0, ans=0.1 2023-06-19 13:17:07,078 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=351750.0, ans=0.125 2023-06-19 13:17:26,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=351810.0, ans=0.1 2023-06-19 13:17:35,549 INFO [train.py:996] (2/4) Epoch 2, batch 28150, loss[loss=0.2216, simple_loss=0.2624, pruned_loss=0.09044, over 21251.00 frames. ], tot_loss[loss=0.2648, simple_loss=0.3257, pruned_loss=0.1019, over 4268523.36 frames. ], batch size: 551, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:18:30,789 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.042e+02 2.813e+02 3.244e+02 3.974e+02 6.604e+02, threshold=6.487e+02, percent-clipped=0.0 2023-06-19 13:19:10,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=352050.0, ans=0.0 2023-06-19 13:19:11,169 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.21 vs. limit=12.0 2023-06-19 13:19:25,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=352110.0, ans=0.0 2023-06-19 13:19:25,792 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=352110.0, ans=0.2 2023-06-19 13:19:36,321 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 13:19:37,283 INFO [train.py:996] (2/4) Epoch 2, batch 28200, loss[loss=0.2782, simple_loss=0.3369, pruned_loss=0.1097, over 21490.00 frames. ], tot_loss[loss=0.2655, simple_loss=0.3235, pruned_loss=0.1037, over 4265888.67 frames. ], batch size: 194, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:20:44,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=352290.0, ans=0.125 2023-06-19 13:21:16,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=352350.0, ans=0.125 2023-06-19 13:21:26,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=352410.0, ans=0.125 2023-06-19 13:21:43,611 INFO [train.py:996] (2/4) Epoch 2, batch 28250, loss[loss=0.2573, simple_loss=0.31, pruned_loss=0.1023, over 21774.00 frames. ], tot_loss[loss=0.2714, simple_loss=0.3293, pruned_loss=0.1067, over 4261981.37 frames. ], batch size: 317, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:22:30,579 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.044e+02 3.279e+02 3.953e+02 4.755e+02 7.153e+02, threshold=7.906e+02, percent-clipped=2.0 2023-06-19 13:23:35,549 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=18.41 vs. limit=15.0 2023-06-19 13:24:01,142 INFO [train.py:996] (2/4) Epoch 2, batch 28300, loss[loss=0.249, simple_loss=0.3373, pruned_loss=0.08034, over 21627.00 frames. ], tot_loss[loss=0.2683, simple_loss=0.328, pruned_loss=0.1043, over 4251989.52 frames. ], batch size: 441, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:24:12,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=352770.0, ans=0.125 2023-06-19 13:24:28,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=352830.0, ans=0.2 2023-06-19 13:24:50,917 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=352830.0, ans=0.0 2023-06-19 13:26:10,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=353070.0, ans=0.04949747468305833 2023-06-19 13:26:11,214 INFO [train.py:996] (2/4) Epoch 2, batch 28350, loss[loss=0.2163, simple_loss=0.3012, pruned_loss=0.06567, over 21660.00 frames. ], tot_loss[loss=0.2602, simple_loss=0.3251, pruned_loss=0.09762, over 4259603.55 frames. ], batch size: 298, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:26:17,469 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=353070.0, ans=0.1 2023-06-19 13:26:58,734 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-06-19 13:27:10,663 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.490e+02 2.403e+02 3.006e+02 3.946e+02 7.827e+02, threshold=6.012e+02, percent-clipped=0.0 2023-06-19 13:28:34,679 INFO [train.py:996] (2/4) Epoch 2, batch 28400, loss[loss=0.2537, simple_loss=0.3063, pruned_loss=0.1005, over 21755.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.3204, pruned_loss=0.09771, over 4259465.12 frames. ], batch size: 282, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:29:27,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=353490.0, ans=0.04949747468305833 2023-06-19 13:29:27,192 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=353490.0, ans=0.1 2023-06-19 13:30:38,154 INFO [train.py:996] (2/4) Epoch 2, batch 28450, loss[loss=0.2658, simple_loss=0.3385, pruned_loss=0.09653, over 21467.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3276, pruned_loss=0.1023, over 4253990.47 frames. ], batch size: 131, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:31:20,606 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.233e+02 3.030e+02 3.579e+02 4.363e+02 8.439e+02, threshold=7.159e+02, percent-clipped=7.0 2023-06-19 13:31:33,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=353790.0, ans=0.2 2023-06-19 13:32:15,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=353910.0, ans=0.2 2023-06-19 13:32:47,137 INFO [train.py:996] (2/4) Epoch 2, batch 28500, loss[loss=0.309, simple_loss=0.3527, pruned_loss=0.1327, over 21493.00 frames. ], tot_loss[loss=0.2712, simple_loss=0.3306, pruned_loss=0.1059, over 4265398.87 frames. ], batch size: 548, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:32:56,514 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=20.72 vs. limit=15.0 2023-06-19 13:33:45,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=354090.0, ans=0.1 2023-06-19 13:33:58,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=354150.0, ans=0.125 2023-06-19 13:34:01,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=354150.0, ans=0.1 2023-06-19 13:34:09,362 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=15.0 2023-06-19 13:34:49,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=354270.0, ans=0.0 2023-06-19 13:34:50,570 INFO [train.py:996] (2/4) Epoch 2, batch 28550, loss[loss=0.2867, simple_loss=0.3767, pruned_loss=0.09832, over 21623.00 frames. ], tot_loss[loss=0.2768, simple_loss=0.3373, pruned_loss=0.1082, over 4267900.81 frames. ], batch size: 263, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:35:14,717 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 13:35:35,189 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.197e+02 2.880e+02 3.273e+02 3.942e+02 6.177e+02, threshold=6.546e+02, percent-clipped=0.0 2023-06-19 13:36:48,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=354510.0, ans=0.125 2023-06-19 13:36:58,155 INFO [train.py:996] (2/4) Epoch 2, batch 28600, loss[loss=0.3602, simple_loss=0.3978, pruned_loss=0.1613, over 21416.00 frames. ], tot_loss[loss=0.2834, simple_loss=0.345, pruned_loss=0.111, over 4268211.48 frames. ], batch size: 471, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:37:14,177 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=354570.0, ans=0.04949747468305833 2023-06-19 13:37:57,914 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-19 13:38:31,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=354750.0, ans=0.1 2023-06-19 13:38:56,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=354810.0, ans=0.125 2023-06-19 13:39:02,545 INFO [train.py:996] (2/4) Epoch 2, batch 28650, loss[loss=0.2544, simple_loss=0.305, pruned_loss=0.1019, over 21560.00 frames. ], tot_loss[loss=0.2797, simple_loss=0.3391, pruned_loss=0.1102, over 4266905.84 frames. ], batch size: 263, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:39:17,618 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=354930.0, ans=0.1 2023-06-19 13:39:44,449 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.150e+02 2.911e+02 3.310e+02 3.742e+02 7.048e+02, threshold=6.621e+02, percent-clipped=2.0 2023-06-19 13:39:44,899 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=354930.0, ans=0.07 2023-06-19 13:40:08,702 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.87 vs. limit=6.0 2023-06-19 13:40:35,704 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.41 vs. limit=15.0 2023-06-19 13:41:07,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=355170.0, ans=0.0 2023-06-19 13:41:08,385 INFO [train.py:996] (2/4) Epoch 2, batch 28700, loss[loss=0.2836, simple_loss=0.3409, pruned_loss=0.1132, over 21244.00 frames. ], tot_loss[loss=0.2802, simple_loss=0.3382, pruned_loss=0.1111, over 4274943.97 frames. ], batch size: 176, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:41:19,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=355170.0, ans=0.125 2023-06-19 13:41:32,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=355170.0, ans=0.0 2023-06-19 13:42:12,233 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=355290.0, ans=0.0 2023-06-19 13:43:04,244 INFO [train.py:996] (2/4) Epoch 2, batch 28750, loss[loss=0.3422, simple_loss=0.4227, pruned_loss=0.1308, over 19819.00 frames. ], tot_loss[loss=0.2809, simple_loss=0.339, pruned_loss=0.1114, over 4279729.94 frames. ], batch size: 703, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 13:43:58,435 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.167e+02 2.722e+02 3.095e+02 3.469e+02 6.636e+02, threshold=6.190e+02, percent-clipped=1.0 2023-06-19 13:45:21,933 INFO [train.py:996] (2/4) Epoch 2, batch 28800, loss[loss=0.3246, simple_loss=0.3852, pruned_loss=0.132, over 21375.00 frames. ], tot_loss[loss=0.2835, simple_loss=0.3432, pruned_loss=0.1119, over 4282329.98 frames. ], batch size: 159, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 13:45:57,875 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=355830.0, ans=0.125 2023-06-19 13:46:32,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=355890.0, ans=0.1 2023-06-19 13:47:33,449 INFO [train.py:996] (2/4) Epoch 2, batch 28850, loss[loss=0.2871, simple_loss=0.3352, pruned_loss=0.1195, over 21526.00 frames. ], tot_loss[loss=0.2856, simple_loss=0.3446, pruned_loss=0.1133, over 4287399.70 frames. ], batch size: 548, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 13:47:34,467 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.88 vs. limit=15.0 2023-06-19 13:48:12,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=356130.0, ans=10.0 2023-06-19 13:48:13,292 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.315e+02 2.928e+02 3.488e+02 4.240e+02 7.326e+02, threshold=6.975e+02, percent-clipped=5.0 2023-06-19 13:48:46,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=356190.0, ans=0.125 2023-06-19 13:49:55,045 INFO [train.py:996] (2/4) Epoch 2, batch 28900, loss[loss=0.2787, simple_loss=0.3375, pruned_loss=0.11, over 21647.00 frames. ], tot_loss[loss=0.2884, simple_loss=0.3465, pruned_loss=0.1152, over 4282965.14 frames. ], batch size: 230, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 13:51:04,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=356490.0, ans=0.125 2023-06-19 13:52:05,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=356610.0, ans=0.0 2023-06-19 13:52:11,441 INFO [train.py:996] (2/4) Epoch 2, batch 28950, loss[loss=0.3192, simple_loss=0.3954, pruned_loss=0.1215, over 21541.00 frames. ], tot_loss[loss=0.2847, simple_loss=0.3444, pruned_loss=0.1125, over 4281395.69 frames. ], batch size: 471, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 13:52:20,336 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=356670.0, ans=0.0 2023-06-19 13:53:03,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=356730.0, ans=0.125 2023-06-19 13:53:15,938 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.063e+02 2.845e+02 3.297e+02 3.891e+02 9.356e+02, threshold=6.594e+02, percent-clipped=2.0 2023-06-19 13:53:21,347 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.00 vs. limit=10.0 2023-06-19 13:53:51,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=356850.0, ans=0.125 2023-06-19 13:54:01,145 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=356850.0, ans=0.125 2023-06-19 13:54:19,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=356910.0, ans=0.0 2023-06-19 13:54:23,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=356910.0, ans=6.0 2023-06-19 13:54:36,134 INFO [train.py:996] (2/4) Epoch 2, batch 29000, loss[loss=0.306, simple_loss=0.3742, pruned_loss=0.1189, over 21373.00 frames. ], tot_loss[loss=0.2851, simple_loss=0.3471, pruned_loss=0.1115, over 4277949.99 frames. ], batch size: 143, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 13:54:54,816 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=356970.0, ans=0.125 2023-06-19 13:55:27,680 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.29 vs. limit=10.0 2023-06-19 13:55:59,597 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 13:56:45,647 INFO [train.py:996] (2/4) Epoch 2, batch 29050, loss[loss=0.2807, simple_loss=0.3389, pruned_loss=0.1113, over 21310.00 frames. ], tot_loss[loss=0.2856, simple_loss=0.3461, pruned_loss=0.1126, over 4278930.70 frames. ], batch size: 176, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 13:56:55,788 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=357270.0, ans=0.0 2023-06-19 13:57:13,417 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=357330.0, ans=0.0 2023-06-19 13:57:22,899 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.747e+02 2.800e+02 3.368e+02 3.834e+02 6.942e+02, threshold=6.736e+02, percent-clipped=1.0 2023-06-19 13:58:23,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=357450.0, ans=0.125 2023-06-19 13:58:41,492 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-06-19 13:58:56,886 INFO [train.py:996] (2/4) Epoch 2, batch 29100, loss[loss=0.2569, simple_loss=0.3006, pruned_loss=0.1066, over 21499.00 frames. ], tot_loss[loss=0.2771, simple_loss=0.336, pruned_loss=0.1091, over 4279417.17 frames. ], batch size: 441, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 13:59:38,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=357690.0, ans=0.2 2023-06-19 14:00:35,299 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.73 vs. limit=15.0 2023-06-19 14:00:47,414 INFO [train.py:996] (2/4) Epoch 2, batch 29150, loss[loss=0.2475, simple_loss=0.3034, pruned_loss=0.09585, over 21710.00 frames. ], tot_loss[loss=0.2749, simple_loss=0.3361, pruned_loss=0.1069, over 4281148.99 frames. ], batch size: 124, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 14:01:02,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=357870.0, ans=0.0 2023-06-19 14:01:02,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=357870.0, ans=0.1 2023-06-19 14:01:30,659 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.955e+02 2.793e+02 3.362e+02 4.160e+02 6.908e+02, threshold=6.724e+02, percent-clipped=1.0 2023-06-19 14:01:34,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=357990.0, ans=0.0 2023-06-19 14:01:36,943 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=357990.0, ans=0.125 2023-06-19 14:02:25,304 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.11 vs. limit=15.0 2023-06-19 14:02:38,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=358110.0, ans=0.0 2023-06-19 14:02:53,700 INFO [train.py:996] (2/4) Epoch 2, batch 29200, loss[loss=0.2867, simple_loss=0.342, pruned_loss=0.1157, over 21418.00 frames. ], tot_loss[loss=0.2717, simple_loss=0.332, pruned_loss=0.1057, over 4282749.62 frames. ], batch size: 473, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 14:03:27,263 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.75 vs. limit=22.5 2023-06-19 14:03:44,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=358290.0, ans=0.125 2023-06-19 14:03:45,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=358290.0, ans=0.0 2023-06-19 14:03:47,149 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=358290.0, ans=0.2 2023-06-19 14:04:58,585 INFO [train.py:996] (2/4) Epoch 2, batch 29250, loss[loss=0.2112, simple_loss=0.2866, pruned_loss=0.06784, over 21204.00 frames. ], tot_loss[loss=0.2672, simple_loss=0.3294, pruned_loss=0.1024, over 4279690.44 frames. ], batch size: 176, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 14:05:07,347 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.09 vs. limit=15.0 2023-06-19 14:05:08,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=358470.0, ans=0.125 2023-06-19 14:05:23,814 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.737e+02 2.478e+02 3.001e+02 3.725e+02 6.344e+02, threshold=6.002e+02, percent-clipped=0.0 2023-06-19 14:06:57,919 INFO [train.py:996] (2/4) Epoch 2, batch 29300, loss[loss=0.2506, simple_loss=0.3087, pruned_loss=0.09627, over 21817.00 frames. ], tot_loss[loss=0.2682, simple_loss=0.3318, pruned_loss=0.1023, over 4269405.87 frames. ], batch size: 317, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 14:09:07,177 INFO [train.py:996] (2/4) Epoch 2, batch 29350, loss[loss=0.3446, simple_loss=0.3953, pruned_loss=0.147, over 21494.00 frames. ], tot_loss[loss=0.2666, simple_loss=0.329, pruned_loss=0.1021, over 4276266.68 frames. ], batch size: 509, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 14:09:10,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=359070.0, ans=0.125 2023-06-19 14:10:02,476 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.127e+02 2.578e+02 2.975e+02 3.361e+02 4.111e+02, threshold=5.949e+02, percent-clipped=0.0 2023-06-19 14:10:02,854 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=359130.0, ans=0.125 2023-06-19 14:10:27,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=359250.0, ans=0.125 2023-06-19 14:10:51,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=359310.0, ans=0.125 2023-06-19 14:11:03,630 INFO [train.py:996] (2/4) Epoch 2, batch 29400, loss[loss=0.254, simple_loss=0.338, pruned_loss=0.08501, over 21190.00 frames. ], tot_loss[loss=0.2635, simple_loss=0.3285, pruned_loss=0.09927, over 4283620.78 frames. ], batch size: 548, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 14:11:26,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=359370.0, ans=0.1 2023-06-19 14:11:35,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=359430.0, ans=0.0 2023-06-19 14:12:29,597 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.60 vs. limit=15.0 2023-06-19 14:13:09,872 INFO [train.py:996] (2/4) Epoch 2, batch 29450, loss[loss=0.2942, simple_loss=0.3544, pruned_loss=0.117, over 21590.00 frames. ], tot_loss[loss=0.2609, simple_loss=0.3261, pruned_loss=0.09786, over 4280064.99 frames. ], batch size: 263, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 14:13:13,084 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 14:14:13,948 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.042e+02 2.887e+02 3.363e+02 4.208e+02 7.799e+02, threshold=6.726e+02, percent-clipped=7.0 2023-06-19 14:14:20,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=359790.0, ans=0.125 2023-06-19 14:14:31,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=359790.0, ans=0.125 2023-06-19 14:14:59,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=359910.0, ans=0.125 2023-06-19 14:15:23,685 INFO [train.py:996] (2/4) Epoch 2, batch 29500, loss[loss=0.2862, simple_loss=0.3402, pruned_loss=0.1161, over 21866.00 frames. ], tot_loss[loss=0.2695, simple_loss=0.3325, pruned_loss=0.1032, over 4280804.05 frames. ], batch size: 371, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 14:15:56,845 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.97 vs. limit=15.0 2023-06-19 14:16:03,337 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=360030.0, ans=0.125 2023-06-19 14:17:33,215 INFO [train.py:996] (2/4) Epoch 2, batch 29550, loss[loss=0.2699, simple_loss=0.3282, pruned_loss=0.1058, over 21322.00 frames. ], tot_loss[loss=0.2708, simple_loss=0.3319, pruned_loss=0.1048, over 4290534.81 frames. ], batch size: 176, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 14:18:20,043 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 2.713e+02 3.208e+02 3.753e+02 5.993e+02, threshold=6.415e+02, percent-clipped=0.0 2023-06-19 14:18:48,451 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=360450.0, ans=0.125 2023-06-19 14:18:59,686 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=15.0 2023-06-19 14:19:09,590 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.60 vs. limit=22.5 2023-06-19 14:19:52,642 INFO [train.py:996] (2/4) Epoch 2, batch 29600, loss[loss=0.308, simple_loss=0.3819, pruned_loss=0.117, over 21637.00 frames. ], tot_loss[loss=0.2773, simple_loss=0.3392, pruned_loss=0.1078, over 4286870.30 frames. ], batch size: 389, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 14:20:20,451 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=360630.0, ans=0.0 2023-06-19 14:20:45,908 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=360690.0, ans=0.0 2023-06-19 14:20:48,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=360690.0, ans=0.0 2023-06-19 14:20:55,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=360690.0, ans=0.125 2023-06-19 14:21:10,839 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=21.21 vs. limit=15.0 2023-06-19 14:21:16,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=360750.0, ans=0.0 2023-06-19 14:21:16,756 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.49 vs. limit=15.0 2023-06-19 14:21:19,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=360750.0, ans=0.125 2023-06-19 14:22:07,134 INFO [train.py:996] (2/4) Epoch 2, batch 29650, loss[loss=0.2769, simple_loss=0.332, pruned_loss=0.111, over 21839.00 frames. ], tot_loss[loss=0.2706, simple_loss=0.3349, pruned_loss=0.1031, over 4284282.44 frames. ], batch size: 371, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 14:22:17,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=360870.0, ans=0.125 2023-06-19 14:22:20,695 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=360870.0, ans=0.02 2023-06-19 14:22:30,889 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=360930.0, ans=0.1 2023-06-19 14:22:43,170 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.808e+02 2.621e+02 3.260e+02 3.981e+02 6.748e+02, threshold=6.520e+02, percent-clipped=1.0 2023-06-19 14:23:22,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=361050.0, ans=0.0 2023-06-19 14:23:30,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=361050.0, ans=0.1 2023-06-19 14:24:20,066 INFO [train.py:996] (2/4) Epoch 2, batch 29700, loss[loss=0.4038, simple_loss=0.4546, pruned_loss=0.1765, over 21565.00 frames. ], tot_loss[loss=0.2738, simple_loss=0.3382, pruned_loss=0.1047, over 4290750.22 frames. ], batch size: 507, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 14:26:05,126 INFO [train.py:996] (2/4) Epoch 2, batch 29750, loss[loss=0.2623, simple_loss=0.3469, pruned_loss=0.08887, over 21714.00 frames. ], tot_loss[loss=0.2762, simple_loss=0.3435, pruned_loss=0.1044, over 4291078.00 frames. ], batch size: 298, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 14:26:16,494 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.45 vs. limit=15.0 2023-06-19 14:26:17,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=361470.0, ans=0.1 2023-06-19 14:26:32,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=361530.0, ans=0.0 2023-06-19 14:26:40,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=361530.0, ans=0.1 2023-06-19 14:26:41,003 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.997e+02 2.594e+02 3.244e+02 4.467e+02 8.629e+02, threshold=6.487e+02, percent-clipped=7.0 2023-06-19 14:26:58,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=361590.0, ans=0.2 2023-06-19 14:27:01,312 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=361590.0, ans=0.125 2023-06-19 14:27:32,169 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=361650.0, ans=0.1 2023-06-19 14:28:12,796 INFO [train.py:996] (2/4) Epoch 2, batch 29800, loss[loss=0.3486, simple_loss=0.3746, pruned_loss=0.1612, over 21763.00 frames. ], tot_loss[loss=0.2787, simple_loss=0.3455, pruned_loss=0.106, over 4293572.63 frames. ], batch size: 508, lr: 1.44e-02, grad_scale: 64.0 2023-06-19 14:28:41,671 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.96 vs. limit=15.0 2023-06-19 14:28:51,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=361830.0, ans=0.1 2023-06-19 14:28:53,942 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.11 vs. limit=15.0 2023-06-19 14:29:41,830 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.26 vs. limit=22.5 2023-06-19 14:29:58,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=362010.0, ans=0.125 2023-06-19 14:30:10,958 INFO [train.py:996] (2/4) Epoch 2, batch 29850, loss[loss=0.234, simple_loss=0.3043, pruned_loss=0.08186, over 21796.00 frames. ], tot_loss[loss=0.2741, simple_loss=0.3407, pruned_loss=0.1037, over 4289170.49 frames. ], batch size: 247, lr: 1.44e-02, grad_scale: 64.0 2023-06-19 14:30:25,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=362070.0, ans=0.125 2023-06-19 14:30:38,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=362130.0, ans=0.1 2023-06-19 14:30:47,952 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.864e+02 2.554e+02 3.049e+02 3.847e+02 7.515e+02, threshold=6.099e+02, percent-clipped=1.0 2023-06-19 14:30:48,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=362130.0, ans=0.125 2023-06-19 14:31:04,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=362190.0, ans=0.125 2023-06-19 14:32:22,166 INFO [train.py:996] (2/4) Epoch 2, batch 29900, loss[loss=0.3065, simple_loss=0.3719, pruned_loss=0.1205, over 21792.00 frames. ], tot_loss[loss=0.2739, simple_loss=0.3386, pruned_loss=0.1046, over 4298037.46 frames. ], batch size: 118, lr: 1.44e-02, grad_scale: 64.0 2023-06-19 14:33:27,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=362490.0, ans=0.1 2023-06-19 14:33:29,416 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.72 vs. limit=15.0 2023-06-19 14:34:31,645 INFO [train.py:996] (2/4) Epoch 2, batch 29950, loss[loss=0.3936, simple_loss=0.4122, pruned_loss=0.1875, over 21460.00 frames. ], tot_loss[loss=0.2818, simple_loss=0.3436, pruned_loss=0.11, over 4296226.93 frames. ], batch size: 510, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 14:34:41,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=362670.0, ans=0.1 2023-06-19 14:35:14,923 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.105e+02 2.991e+02 3.617e+02 4.391e+02 6.676e+02, threshold=7.234e+02, percent-clipped=4.0 2023-06-19 14:35:22,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=362790.0, ans=0.125 2023-06-19 14:35:33,235 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=362790.0, ans=0.1 2023-06-19 14:36:16,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=362850.0, ans=0.1 2023-06-19 14:36:18,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=362910.0, ans=0.0 2023-06-19 14:36:36,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=362910.0, ans=0.125 2023-06-19 14:36:39,171 INFO [train.py:996] (2/4) Epoch 2, batch 30000, loss[loss=0.2574, simple_loss=0.3244, pruned_loss=0.09524, over 21141.00 frames. ], tot_loss[loss=0.2822, simple_loss=0.3453, pruned_loss=0.1095, over 4290588.25 frames. ], batch size: 143, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 14:36:39,172 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-19 14:37:25,482 INFO [train.py:1028] (2/4) Epoch 2, validation: loss=0.2591, simple_loss=0.3611, pruned_loss=0.07848, over 1796401.00 frames. 2023-06-19 14:37:25,486 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-19 14:37:28,232 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.19 vs. limit=15.0 2023-06-19 14:37:45,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=363030.0, ans=0.125 2023-06-19 14:37:58,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=363030.0, ans=0.1 2023-06-19 14:38:15,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=363090.0, ans=0.125 2023-06-19 14:38:15,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=363090.0, ans=0.2 2023-06-19 14:39:25,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=363210.0, ans=0.125 2023-06-19 14:39:47,271 INFO [train.py:996] (2/4) Epoch 2, batch 30050, loss[loss=0.3205, simple_loss=0.4324, pruned_loss=0.1043, over 20818.00 frames. ], tot_loss[loss=0.2832, simple_loss=0.3511, pruned_loss=0.1077, over 4285375.75 frames. ], batch size: 607, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 14:39:56,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=363270.0, ans=0.125 2023-06-19 14:40:02,286 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=363270.0, ans=0.125 2023-06-19 14:40:23,935 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=363330.0, ans=0.07 2023-06-19 14:40:25,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=363330.0, ans=0.125 2023-06-19 14:40:26,215 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.931e+02 2.570e+02 3.053e+02 3.907e+02 7.518e+02, threshold=6.106e+02, percent-clipped=1.0 2023-06-19 14:41:21,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=363510.0, ans=0.0 2023-06-19 14:41:25,603 INFO [train.py:996] (2/4) Epoch 2, batch 30100, loss[loss=0.279, simple_loss=0.3227, pruned_loss=0.1177, over 21334.00 frames. ], tot_loss[loss=0.2805, simple_loss=0.3481, pruned_loss=0.1065, over 4281779.17 frames. ], batch size: 160, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 14:41:37,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=363570.0, ans=0.0 2023-06-19 14:41:55,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=363570.0, ans=0.0 2023-06-19 14:43:36,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=363870.0, ans=0.0 2023-06-19 14:43:37,591 INFO [train.py:996] (2/4) Epoch 2, batch 30150, loss[loss=0.3435, simple_loss=0.3856, pruned_loss=0.1507, over 21747.00 frames. ], tot_loss[loss=0.2807, simple_loss=0.3447, pruned_loss=0.1084, over 4281474.91 frames. ], batch size: 441, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 14:43:57,571 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.89 vs. limit=12.0 2023-06-19 14:44:01,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=363930.0, ans=0.125 2023-06-19 14:44:32,789 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.104e+02 2.807e+02 3.248e+02 3.802e+02 5.683e+02, threshold=6.495e+02, percent-clipped=0.0 2023-06-19 14:44:33,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=363930.0, ans=0.0 2023-06-19 14:44:36,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=363990.0, ans=0.1 2023-06-19 14:45:01,165 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=364050.0, ans=0.2 2023-06-19 14:45:53,083 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.44 vs. limit=6.0 2023-06-19 14:45:53,860 INFO [train.py:996] (2/4) Epoch 2, batch 30200, loss[loss=0.267, simple_loss=0.332, pruned_loss=0.1011, over 21241.00 frames. ], tot_loss[loss=0.2797, simple_loss=0.3468, pruned_loss=0.1063, over 4276134.02 frames. ], batch size: 159, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 14:46:44,087 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=364230.0, ans=0.0 2023-06-19 14:47:21,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=364350.0, ans=0.125 2023-06-19 14:47:53,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=364410.0, ans=0.125 2023-06-19 14:48:15,229 INFO [train.py:996] (2/4) Epoch 2, batch 30250, loss[loss=0.3283, simple_loss=0.4345, pruned_loss=0.1111, over 20778.00 frames. ], tot_loss[loss=0.2857, simple_loss=0.354, pruned_loss=0.1087, over 4274501.14 frames. ], batch size: 607, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 14:48:45,308 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=364530.0, ans=0.0 2023-06-19 14:48:53,826 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.145e+02 2.937e+02 3.483e+02 4.410e+02 7.312e+02, threshold=6.966e+02, percent-clipped=2.0 2023-06-19 14:49:22,964 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.45 vs. limit=15.0 2023-06-19 14:49:29,448 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=364590.0, ans=0.1 2023-06-19 14:49:29,457 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=364590.0, ans=0.125 2023-06-19 14:49:48,174 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.99 vs. limit=22.5 2023-06-19 14:50:12,981 INFO [train.py:996] (2/4) Epoch 2, batch 30300, loss[loss=0.2612, simple_loss=0.3122, pruned_loss=0.1051, over 21758.00 frames. ], tot_loss[loss=0.2829, simple_loss=0.3495, pruned_loss=0.1082, over 4279686.54 frames. ], batch size: 372, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 14:50:25,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=364770.0, ans=0.1 2023-06-19 14:50:26,862 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=364770.0, ans=0.0 2023-06-19 14:50:35,683 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.70 vs. limit=15.0 2023-06-19 14:50:50,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=364830.0, ans=0.125 2023-06-19 14:51:10,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=364890.0, ans=0.0 2023-06-19 14:51:12,122 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=364890.0, ans=0.125 2023-06-19 14:51:25,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=364890.0, ans=0.0 2023-06-19 14:52:10,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=364950.0, ans=0.125 2023-06-19 14:52:44,516 INFO [train.py:996] (2/4) Epoch 2, batch 30350, loss[loss=0.2622, simple_loss=0.3191, pruned_loss=0.1027, over 21605.00 frames. ], tot_loss[loss=0.2853, simple_loss=0.351, pruned_loss=0.1098, over 4279795.96 frames. ], batch size: 230, lr: 1.44e-02, grad_scale: 16.0 2023-06-19 14:52:48,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=365070.0, ans=0.125 2023-06-19 14:53:22,910 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.07 vs. limit=22.5 2023-06-19 14:53:34,002 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.107e+02 2.858e+02 3.357e+02 4.811e+02 8.525e+02, threshold=6.714e+02, percent-clipped=9.0 2023-06-19 14:54:44,286 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=365310.0, ans=0.0 2023-06-19 14:55:26,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=365310.0, ans=0.1 2023-06-19 14:55:31,014 INFO [train.py:996] (2/4) Epoch 2, batch 30400, loss[loss=0.259, simple_loss=0.2945, pruned_loss=0.1118, over 20216.00 frames. ], tot_loss[loss=0.2775, simple_loss=0.3417, pruned_loss=0.1067, over 4267868.66 frames. ], batch size: 703, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 14:56:12,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=365370.0, ans=0.1 2023-06-19 14:58:15,761 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.11 vs. limit=8.0 2023-06-19 14:58:17,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=365550.0, ans=0.125 2023-06-19 14:58:58,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=365550.0, ans=0.125 2023-06-19 14:59:39,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=365610.0, ans=0.0 2023-06-19 15:00:10,526 INFO [train.py:996] (2/4) Epoch 2, batch 30450, loss[loss=0.3508, simple_loss=0.4569, pruned_loss=0.1224, over 19874.00 frames. ], tot_loss[loss=0.2812, simple_loss=0.3446, pruned_loss=0.1088, over 4206871.24 frames. ], batch size: 702, lr: 1.43e-02, grad_scale: 32.0 2023-06-19 15:01:05,921 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 15:01:46,279 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.336e+02 3.753e+02 4.971e+02 7.783e+02 2.032e+03, threshold=9.942e+02, percent-clipped=30.0 2023-06-19 15:01:47,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=365790.0, ans=0.125 2023-06-19 15:01:48,937 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=365790.0, ans=0.07 2023-06-19 15:01:49,392 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.71 vs. limit=12.0 2023-06-19 15:02:16,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=365790.0, ans=0.0 2023-06-19 15:02:52,496 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.86 vs. limit=15.0 2023-06-19 15:05:22,951 INFO [train.py:996] (2/4) Epoch 3, batch 0, loss[loss=0.2509, simple_loss=0.2914, pruned_loss=0.1052, over 20801.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.2914, pruned_loss=0.1052, over 20801.00 frames. ], batch size: 609, lr: 1.22e-02, grad_scale: 32.0 2023-06-19 15:05:22,952 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-19 15:06:09,288 INFO [train.py:1028] (2/4) Epoch 3, validation: loss=0.2643, simple_loss=0.3711, pruned_loss=0.07872, over 1796401.00 frames. 2023-06-19 15:06:09,290 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-19 15:06:25,233 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=365934.0, ans=0.1 2023-06-19 15:06:33,742 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 15:06:43,272 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=19.09 vs. limit=22.5 2023-06-19 15:06:46,859 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=366054.0, ans=0.1 2023-06-19 15:07:04,618 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=366114.0, ans=0.0 2023-06-19 15:07:13,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=366114.0, ans=0.1 2023-06-19 15:07:40,175 INFO [train.py:996] (2/4) Epoch 3, batch 50, loss[loss=0.3166, simple_loss=0.3966, pruned_loss=0.1184, over 21632.00 frames. ], tot_loss[loss=0.2839, simple_loss=0.3511, pruned_loss=0.1083, over 956367.11 frames. ], batch size: 389, lr: 1.22e-02, grad_scale: 32.0 2023-06-19 15:08:23,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=366294.0, ans=0.125 2023-06-19 15:08:45,477 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.132e+02 2.896e+02 3.528e+02 5.821e+02 1.512e+03, threshold=7.056e+02, percent-clipped=7.0 2023-06-19 15:09:09,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=366414.0, ans=0.125 2023-06-19 15:09:16,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=366474.0, ans=0.125 2023-06-19 15:09:46,258 INFO [train.py:996] (2/4) Epoch 3, batch 100, loss[loss=0.2791, simple_loss=0.362, pruned_loss=0.09812, over 21814.00 frames. ], tot_loss[loss=0.2904, simple_loss=0.3629, pruned_loss=0.1089, over 1694182.46 frames. ], batch size: 316, lr: 1.22e-02, grad_scale: 32.0 2023-06-19 15:10:16,384 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.03 vs. limit=22.5 2023-06-19 15:10:43,317 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.03 vs. limit=15.0 2023-06-19 15:11:01,800 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 15:11:19,674 INFO [train.py:996] (2/4) Epoch 3, batch 150, loss[loss=0.2571, simple_loss=0.3362, pruned_loss=0.089, over 21373.00 frames. ], tot_loss[loss=0.2906, simple_loss=0.364, pruned_loss=0.1086, over 2257609.42 frames. ], batch size: 194, lr: 1.22e-02, grad_scale: 32.0 2023-06-19 15:11:25,515 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=366834.0, ans=0.125 2023-06-19 15:11:37,678 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.59 vs. limit=15.0 2023-06-19 15:11:38,934 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.91 vs. limit=22.5 2023-06-19 15:11:57,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=366894.0, ans=0.1 2023-06-19 15:12:12,533 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.932e+02 2.572e+02 2.987e+02 3.801e+02 6.423e+02, threshold=5.974e+02, percent-clipped=0.0 2023-06-19 15:13:13,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=367074.0, ans=0.0 2023-06-19 15:13:25,868 INFO [train.py:996] (2/4) Epoch 3, batch 200, loss[loss=0.3621, simple_loss=0.4382, pruned_loss=0.143, over 21500.00 frames. ], tot_loss[loss=0.2871, simple_loss=0.3595, pruned_loss=0.1074, over 2691670.42 frames. ], batch size: 471, lr: 1.22e-02, grad_scale: 32.0 2023-06-19 15:13:59,992 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=367194.0, ans=0.0 2023-06-19 15:15:04,331 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 15:15:25,188 INFO [train.py:996] (2/4) Epoch 3, batch 250, loss[loss=0.276, simple_loss=0.3277, pruned_loss=0.1121, over 21895.00 frames. ], tot_loss[loss=0.2827, simple_loss=0.3544, pruned_loss=0.1055, over 3046497.89 frames. ], batch size: 298, lr: 1.22e-02, grad_scale: 32.0 2023-06-19 15:15:25,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=367434.0, ans=0.0 2023-06-19 15:15:31,130 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=367434.0, ans=0.0 2023-06-19 15:16:07,062 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=367494.0, ans=0.0 2023-06-19 15:16:20,116 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.154e+02 2.820e+02 3.135e+02 3.949e+02 6.710e+02, threshold=6.270e+02, percent-clipped=4.0 2023-06-19 15:16:49,703 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=367674.0, ans=0.0 2023-06-19 15:17:07,727 INFO [train.py:996] (2/4) Epoch 3, batch 300, loss[loss=0.2613, simple_loss=0.315, pruned_loss=0.1039, over 21484.00 frames. ], tot_loss[loss=0.2778, simple_loss=0.3479, pruned_loss=0.1039, over 3304558.59 frames. ], batch size: 194, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:18:31,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=367854.0, ans=0.0 2023-06-19 15:18:54,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=367914.0, ans=0.125 2023-06-19 15:19:02,302 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.38 vs. limit=15.0 2023-06-19 15:19:06,033 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=367974.0, ans=0.125 2023-06-19 15:19:27,165 INFO [train.py:996] (2/4) Epoch 3, batch 350, loss[loss=0.2994, simple_loss=0.3817, pruned_loss=0.1086, over 21664.00 frames. ], tot_loss[loss=0.2738, simple_loss=0.3415, pruned_loss=0.103, over 3520187.00 frames. ], batch size: 441, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:19:44,710 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 15:20:04,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=368094.0, ans=0.0 2023-06-19 15:20:27,640 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.025e+02 2.616e+02 3.033e+02 3.600e+02 6.018e+02, threshold=6.066e+02, percent-clipped=0.0 2023-06-19 15:21:07,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=368274.0, ans=0.0 2023-06-19 15:21:14,063 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.35 vs. limit=15.0 2023-06-19 15:21:19,192 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=368334.0, ans=0.1 2023-06-19 15:21:20,088 INFO [train.py:996] (2/4) Epoch 3, batch 400, loss[loss=0.277, simple_loss=0.3422, pruned_loss=0.1059, over 21642.00 frames. ], tot_loss[loss=0.267, simple_loss=0.3328, pruned_loss=0.1006, over 3690559.06 frames. ], batch size: 415, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:21:34,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=368334.0, ans=0.125 2023-06-19 15:21:36,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=368334.0, ans=0.125 2023-06-19 15:21:55,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=368334.0, ans=0.1 2023-06-19 15:23:08,871 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.08 vs. limit=15.0 2023-06-19 15:23:12,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=368574.0, ans=0.2 2023-06-19 15:23:14,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=368574.0, ans=0.125 2023-06-19 15:23:15,564 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=368574.0, ans=0.0 2023-06-19 15:23:44,682 INFO [train.py:996] (2/4) Epoch 3, batch 450, loss[loss=0.273, simple_loss=0.342, pruned_loss=0.102, over 21918.00 frames. ], tot_loss[loss=0.2634, simple_loss=0.3279, pruned_loss=0.09945, over 3823103.54 frames. ], batch size: 316, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:24:18,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=368634.0, ans=0.125 2023-06-19 15:24:18,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=368634.0, ans=0.5 2023-06-19 15:24:53,236 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.675e+02 2.675e+02 3.174e+02 4.141e+02 7.803e+02, threshold=6.347e+02, percent-clipped=3.0 2023-06-19 15:25:10,834 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.85 vs. limit=6.0 2023-06-19 15:25:28,760 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=368814.0, ans=0.1 2023-06-19 15:25:34,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=368874.0, ans=0.035 2023-06-19 15:25:59,101 INFO [train.py:996] (2/4) Epoch 3, batch 500, loss[loss=0.222, simple_loss=0.2833, pruned_loss=0.08039, over 21969.00 frames. ], tot_loss[loss=0.2606, simple_loss=0.3285, pruned_loss=0.0963, over 3924097.83 frames. ], batch size: 119, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:26:19,238 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 15:26:38,423 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.66 vs. limit=15.0 2023-06-19 15:26:56,505 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=368994.0, ans=0.2 2023-06-19 15:28:12,608 INFO [train.py:996] (2/4) Epoch 3, batch 550, loss[loss=0.2406, simple_loss=0.2979, pruned_loss=0.09165, over 21835.00 frames. ], tot_loss[loss=0.2633, simple_loss=0.3331, pruned_loss=0.09673, over 3999962.83 frames. ], batch size: 98, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:29:00,449 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=369354.0, ans=0.2 2023-06-19 15:29:06,951 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.955e+02 2.826e+02 3.270e+02 4.070e+02 7.651e+02, threshold=6.541e+02, percent-clipped=1.0 2023-06-19 15:29:24,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=369354.0, ans=0.125 2023-06-19 15:29:38,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=369414.0, ans=0.015 2023-06-19 15:29:39,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=369414.0, ans=0.1 2023-06-19 15:30:01,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=369474.0, ans=0.125 2023-06-19 15:30:15,622 INFO [train.py:996] (2/4) Epoch 3, batch 600, loss[loss=0.2385, simple_loss=0.3021, pruned_loss=0.08749, over 21731.00 frames. ], tot_loss[loss=0.2635, simple_loss=0.3342, pruned_loss=0.0964, over 4059551.73 frames. ], batch size: 112, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:30:41,652 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=369594.0, ans=0.125 2023-06-19 15:31:32,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=369714.0, ans=0.125 2023-06-19 15:31:47,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=369774.0, ans=0.0 2023-06-19 15:32:18,090 INFO [train.py:996] (2/4) Epoch 3, batch 650, loss[loss=0.2923, simple_loss=0.3409, pruned_loss=0.1218, over 21825.00 frames. ], tot_loss[loss=0.263, simple_loss=0.3344, pruned_loss=0.09578, over 4113541.11 frames. ], batch size: 441, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:32:40,027 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=369894.0, ans=0.1 2023-06-19 15:33:22,687 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.090e+02 2.840e+02 3.848e+02 4.493e+02 8.755e+02, threshold=7.695e+02, percent-clipped=3.0 2023-06-19 15:34:26,364 INFO [train.py:996] (2/4) Epoch 3, batch 700, loss[loss=0.3511, simple_loss=0.4221, pruned_loss=0.14, over 21695.00 frames. ], tot_loss[loss=0.2655, simple_loss=0.3364, pruned_loss=0.09729, over 4155117.03 frames. ], batch size: 441, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:34:30,996 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-19 15:34:53,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=370194.0, ans=0.09899494936611666 2023-06-19 15:36:29,391 INFO [train.py:996] (2/4) Epoch 3, batch 750, loss[loss=0.2691, simple_loss=0.3325, pruned_loss=0.1029, over 21841.00 frames. ], tot_loss[loss=0.2672, simple_loss=0.3361, pruned_loss=0.09912, over 4190860.95 frames. ], batch size: 124, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:37:12,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=370494.0, ans=0.2 2023-06-19 15:37:23,510 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=370554.0, ans=0.1 2023-06-19 15:37:24,931 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=370554.0, ans=0.125 2023-06-19 15:37:31,642 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.984e+02 2.860e+02 3.172e+02 4.099e+02 8.438e+02, threshold=6.343e+02, percent-clipped=1.0 2023-06-19 15:37:51,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=370614.0, ans=0.2 2023-06-19 15:37:59,020 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=15.0 2023-06-19 15:38:37,266 INFO [train.py:996] (2/4) Epoch 3, batch 800, loss[loss=0.2284, simple_loss=0.2827, pruned_loss=0.08701, over 21588.00 frames. ], tot_loss[loss=0.2654, simple_loss=0.3324, pruned_loss=0.09926, over 4204892.78 frames. ], batch size: 247, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:38:41,371 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.06 vs. limit=8.0 2023-06-19 15:39:20,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=370794.0, ans=0.0 2023-06-19 15:39:20,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=370794.0, ans=0.0 2023-06-19 15:39:52,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=370914.0, ans=0.2 2023-06-19 15:40:05,903 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=15.0 2023-06-19 15:40:14,526 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.71 vs. limit=15.0 2023-06-19 15:40:36,935 INFO [train.py:996] (2/4) Epoch 3, batch 850, loss[loss=0.2599, simple_loss=0.313, pruned_loss=0.1034, over 21812.00 frames. ], tot_loss[loss=0.2626, simple_loss=0.3291, pruned_loss=0.09801, over 4216608.97 frames. ], batch size: 298, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:41:34,615 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.26 vs. limit=10.0 2023-06-19 15:41:35,041 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.029e+02 2.741e+02 3.074e+02 3.686e+02 5.946e+02, threshold=6.148e+02, percent-clipped=0.0 2023-06-19 15:42:15,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=371274.0, ans=0.2 2023-06-19 15:42:15,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=371274.0, ans=0.0 2023-06-19 15:42:37,791 INFO [train.py:996] (2/4) Epoch 3, batch 900, loss[loss=0.2471, simple_loss=0.3252, pruned_loss=0.08457, over 21091.00 frames. ], tot_loss[loss=0.2622, simple_loss=0.3267, pruned_loss=0.09883, over 4227602.11 frames. ], batch size: 608, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:42:39,896 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=371334.0, ans=0.2 2023-06-19 15:42:47,819 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=371334.0, ans=0.0 2023-06-19 15:42:49,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=371334.0, ans=0.0 2023-06-19 15:42:52,144 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=371334.0, ans=0.1 2023-06-19 15:42:56,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=371394.0, ans=0.1 2023-06-19 15:43:08,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=371394.0, ans=0.1 2023-06-19 15:43:08,163 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=371394.0, ans=0.125 2023-06-19 15:43:20,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=371394.0, ans=0.0 2023-06-19 15:43:58,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=371514.0, ans=0.2 2023-06-19 15:44:09,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=371514.0, ans=0.125 2023-06-19 15:44:19,022 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 15:44:40,529 INFO [train.py:996] (2/4) Epoch 3, batch 950, loss[loss=0.216, simple_loss=0.2946, pruned_loss=0.06866, over 21826.00 frames. ], tot_loss[loss=0.2628, simple_loss=0.3267, pruned_loss=0.0994, over 4243158.28 frames. ], batch size: 282, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:44:49,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=371634.0, ans=0.0 2023-06-19 15:45:27,747 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.01 vs. limit=15.0 2023-06-19 15:45:35,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=371754.0, ans=0.0 2023-06-19 15:45:36,730 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.069e+02 2.723e+02 3.079e+02 3.859e+02 5.682e+02, threshold=6.158e+02, percent-clipped=0.0 2023-06-19 15:45:37,212 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=371754.0, ans=0.05 2023-06-19 15:45:58,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=371814.0, ans=0.125 2023-06-19 15:46:27,672 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-19 15:46:40,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=371874.0, ans=0.125 2023-06-19 15:46:46,775 INFO [train.py:996] (2/4) Epoch 3, batch 1000, loss[loss=0.2348, simple_loss=0.2886, pruned_loss=0.09053, over 21561.00 frames. ], tot_loss[loss=0.2625, simple_loss=0.3267, pruned_loss=0.09911, over 4258555.77 frames. ], batch size: 263, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:47:34,350 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.80 vs. limit=22.5 2023-06-19 15:47:35,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=371994.0, ans=0.0 2023-06-19 15:47:36,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=371994.0, ans=0.1 2023-06-19 15:47:47,969 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 15:47:58,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=372054.0, ans=0.125 2023-06-19 15:48:03,302 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=372114.0, ans=0.1 2023-06-19 15:48:34,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=372174.0, ans=0.1 2023-06-19 15:49:10,912 INFO [train.py:996] (2/4) Epoch 3, batch 1050, loss[loss=0.2593, simple_loss=0.3275, pruned_loss=0.09553, over 21860.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.3258, pruned_loss=0.0988, over 4271412.79 frames. ], batch size: 332, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:49:42,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=372294.0, ans=0.125 2023-06-19 15:49:53,861 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.923e+02 2.550e+02 3.026e+02 4.001e+02 6.814e+02, threshold=6.053e+02, percent-clipped=1.0 2023-06-19 15:51:06,129 INFO [train.py:996] (2/4) Epoch 3, batch 1100, loss[loss=0.3098, simple_loss=0.3578, pruned_loss=0.1309, over 21737.00 frames. ], tot_loss[loss=0.2629, simple_loss=0.328, pruned_loss=0.09889, over 4277749.31 frames. ], batch size: 414, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:51:37,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=372594.0, ans=0.0 2023-06-19 15:53:23,746 INFO [train.py:996] (2/4) Epoch 3, batch 1150, loss[loss=0.3541, simple_loss=0.397, pruned_loss=0.1556, over 21449.00 frames. ], tot_loss[loss=0.2636, simple_loss=0.3284, pruned_loss=0.09938, over 4283288.95 frames. ], batch size: 471, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:54:36,937 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.881e+02 2.547e+02 3.199e+02 3.726e+02 8.923e+02, threshold=6.397e+02, percent-clipped=6.0 2023-06-19 15:54:50,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=373014.0, ans=0.125 2023-06-19 15:55:37,476 INFO [train.py:996] (2/4) Epoch 3, batch 1200, loss[loss=0.2461, simple_loss=0.3293, pruned_loss=0.08144, over 21618.00 frames. ], tot_loss[loss=0.263, simple_loss=0.3289, pruned_loss=0.09849, over 4282673.57 frames. ], batch size: 230, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:56:10,364 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=373194.0, ans=0.04949747468305833 2023-06-19 15:56:30,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=373194.0, ans=0.0 2023-06-19 15:57:25,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=373374.0, ans=0.125 2023-06-19 15:57:45,895 INFO [train.py:996] (2/4) Epoch 3, batch 1250, loss[loss=0.2757, simple_loss=0.3415, pruned_loss=0.105, over 21819.00 frames. ], tot_loss[loss=0.2648, simple_loss=0.3317, pruned_loss=0.09893, over 4282259.14 frames. ], batch size: 112, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:58:56,586 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.046e+02 2.686e+02 3.168e+02 3.917e+02 6.417e+02, threshold=6.337e+02, percent-clipped=1.0 2023-06-19 15:59:19,365 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 15:59:37,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=373674.0, ans=0.0 2023-06-19 15:59:44,116 INFO [train.py:996] (2/4) Epoch 3, batch 1300, loss[loss=0.2761, simple_loss=0.3616, pruned_loss=0.09534, over 21813.00 frames. ], tot_loss[loss=0.2632, simple_loss=0.3306, pruned_loss=0.0979, over 4275365.64 frames. ], batch size: 316, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 16:00:47,745 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.69 vs. limit=15.0 2023-06-19 16:01:14,931 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=373914.0, ans=0.0 2023-06-19 16:01:30,493 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=373974.0, ans=0.07 2023-06-19 16:01:52,288 INFO [train.py:996] (2/4) Epoch 3, batch 1350, loss[loss=0.3485, simple_loss=0.3948, pruned_loss=0.1511, over 21452.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3321, pruned_loss=0.1, over 4277232.36 frames. ], batch size: 471, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:02:06,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=374094.0, ans=0.125 2023-06-19 16:02:52,127 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.213e+02 2.921e+02 3.498e+02 4.356e+02 8.229e+02, threshold=6.996e+02, percent-clipped=3.0 2023-06-19 16:03:28,030 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.50 vs. limit=15.0 2023-06-19 16:03:51,335 INFO [train.py:996] (2/4) Epoch 3, batch 1400, loss[loss=0.2494, simple_loss=0.3305, pruned_loss=0.08415, over 21750.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.3321, pruned_loss=0.09979, over 4279515.02 frames. ], batch size: 298, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:04:15,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=374394.0, ans=0.0 2023-06-19 16:04:17,138 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=374394.0, ans=0.125 2023-06-19 16:05:54,423 INFO [train.py:996] (2/4) Epoch 3, batch 1450, loss[loss=0.3174, simple_loss=0.3684, pruned_loss=0.1332, over 21219.00 frames. ], tot_loss[loss=0.2663, simple_loss=0.3314, pruned_loss=0.1006, over 4274167.14 frames. ], batch size: 143, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:05:55,424 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=22.5 2023-06-19 16:05:56,386 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=374634.0, ans=0.125 2023-06-19 16:06:37,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=374694.0, ans=0.125 2023-06-19 16:06:52,283 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.92 vs. limit=15.0 2023-06-19 16:06:55,449 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.003e+02 2.676e+02 2.954e+02 3.782e+02 5.807e+02, threshold=5.909e+02, percent-clipped=0.0 2023-06-19 16:07:21,948 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.27 vs. limit=10.0 2023-06-19 16:07:35,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=374874.0, ans=0.125 2023-06-19 16:08:02,897 INFO [train.py:996] (2/4) Epoch 3, batch 1500, loss[loss=0.2248, simple_loss=0.3206, pruned_loss=0.06456, over 20959.00 frames. ], tot_loss[loss=0.2684, simple_loss=0.3329, pruned_loss=0.102, over 4276648.44 frames. ], batch size: 607, lr: 1.20e-02, grad_scale: 16.0 2023-06-19 16:09:00,905 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-06-19 16:09:11,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=375054.0, ans=0.0 2023-06-19 16:10:05,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=375174.0, ans=0.0 2023-06-19 16:10:13,231 INFO [train.py:996] (2/4) Epoch 3, batch 1550, loss[loss=0.2057, simple_loss=0.295, pruned_loss=0.05821, over 21751.00 frames. ], tot_loss[loss=0.2677, simple_loss=0.3319, pruned_loss=0.1018, over 4280575.64 frames. ], batch size: 351, lr: 1.20e-02, grad_scale: 16.0 2023-06-19 16:11:10,283 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.913e+02 2.586e+02 2.974e+02 3.700e+02 5.786e+02, threshold=5.949e+02, percent-clipped=0.0 2023-06-19 16:11:29,052 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=375414.0, ans=0.125 2023-06-19 16:12:07,165 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=375474.0, ans=0.125 2023-06-19 16:12:22,183 INFO [train.py:996] (2/4) Epoch 3, batch 1600, loss[loss=0.2793, simple_loss=0.3424, pruned_loss=0.1081, over 21709.00 frames. ], tot_loss[loss=0.2632, simple_loss=0.3284, pruned_loss=0.09904, over 4276538.43 frames. ], batch size: 351, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:12:22,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=375534.0, ans=0.0 2023-06-19 16:12:24,191 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=375534.0, ans=0.0 2023-06-19 16:13:02,963 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=375594.0, ans=0.125 2023-06-19 16:14:36,776 INFO [train.py:996] (2/4) Epoch 3, batch 1650, loss[loss=0.2866, simple_loss=0.3387, pruned_loss=0.1172, over 21615.00 frames. ], tot_loss[loss=0.2634, simple_loss=0.3293, pruned_loss=0.09871, over 4279799.14 frames. ], batch size: 471, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:15:38,327 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.11 vs. limit=15.0 2023-06-19 16:15:44,522 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.135e+02 2.739e+02 3.147e+02 3.944e+02 6.533e+02, threshold=6.293e+02, percent-clipped=5.0 2023-06-19 16:17:09,617 INFO [train.py:996] (2/4) Epoch 3, batch 1700, loss[loss=0.2922, simple_loss=0.376, pruned_loss=0.1042, over 21752.00 frames. ], tot_loss[loss=0.2688, simple_loss=0.3355, pruned_loss=0.1011, over 4277612.17 frames. ], batch size: 351, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:17:13,832 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.06 vs. limit=15.0 2023-06-19 16:17:33,898 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=376194.0, ans=0.0 2023-06-19 16:19:01,137 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=376314.0, ans=0.0 2023-06-19 16:19:35,964 INFO [train.py:996] (2/4) Epoch 3, batch 1750, loss[loss=0.2134, simple_loss=0.3018, pruned_loss=0.06253, over 21720.00 frames. ], tot_loss[loss=0.2668, simple_loss=0.3353, pruned_loss=0.09921, over 4270619.45 frames. ], batch size: 351, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:19:55,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=376494.0, ans=0.125 2023-06-19 16:20:46,784 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.788e+02 2.780e+02 3.176e+02 3.729e+02 7.377e+02, threshold=6.353e+02, percent-clipped=1.0 2023-06-19 16:20:56,109 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.24 vs. limit=6.0 2023-06-19 16:21:47,722 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.57 vs. limit=22.5 2023-06-19 16:21:50,702 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.71 vs. limit=10.0 2023-06-19 16:21:59,651 INFO [train.py:996] (2/4) Epoch 3, batch 1800, loss[loss=0.26, simple_loss=0.3337, pruned_loss=0.0932, over 21627.00 frames. ], tot_loss[loss=0.2609, simple_loss=0.3304, pruned_loss=0.09567, over 4267745.46 frames. ], batch size: 263, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:22:04,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=376734.0, ans=0.1 2023-06-19 16:22:22,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=376734.0, ans=0.09899494936611666 2023-06-19 16:23:24,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=376854.0, ans=0.0 2023-06-19 16:23:33,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=376914.0, ans=0.125 2023-06-19 16:23:45,269 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.92 vs. limit=15.0 2023-06-19 16:23:49,746 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.26 vs. limit=6.0 2023-06-19 16:24:07,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=377034.0, ans=0.1 2023-06-19 16:24:08,970 INFO [train.py:996] (2/4) Epoch 3, batch 1850, loss[loss=0.2576, simple_loss=0.3272, pruned_loss=0.09401, over 21799.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.3312, pruned_loss=0.09355, over 4274576.21 frames. ], batch size: 298, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:24:09,549 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=377034.0, ans=0.125 2023-06-19 16:24:32,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=377034.0, ans=0.2 2023-06-19 16:24:33,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=377034.0, ans=0.0 2023-06-19 16:25:05,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=377154.0, ans=0.1 2023-06-19 16:25:27,903 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.791e+02 2.637e+02 3.036e+02 3.803e+02 8.113e+02, threshold=6.071e+02, percent-clipped=1.0 2023-06-19 16:25:36,498 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 16:26:29,756 INFO [train.py:996] (2/4) Epoch 3, batch 1900, loss[loss=0.2357, simple_loss=0.3176, pruned_loss=0.07693, over 21232.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3299, pruned_loss=0.09313, over 4276540.49 frames. ], batch size: 176, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:26:30,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=377334.0, ans=0.125 2023-06-19 16:26:55,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=377334.0, ans=0.125 2023-06-19 16:27:00,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=377394.0, ans=0.0 2023-06-19 16:27:36,919 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-19 16:28:25,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=377574.0, ans=0.1 2023-06-19 16:28:29,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=377574.0, ans=0.04949747468305833 2023-06-19 16:28:36,280 INFO [train.py:996] (2/4) Epoch 3, batch 1950, loss[loss=0.3288, simple_loss=0.3922, pruned_loss=0.1327, over 21594.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.3244, pruned_loss=0.09286, over 4277324.68 frames. ], batch size: 441, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:29:44,947 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=377754.0, ans=0.125 2023-06-19 16:29:53,872 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=377754.0, ans=0.0 2023-06-19 16:29:56,252 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.026e+02 2.787e+02 3.264e+02 3.675e+02 5.955e+02, threshold=6.529e+02, percent-clipped=0.0 2023-06-19 16:30:08,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=377814.0, ans=0.1 2023-06-19 16:30:11,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=377814.0, ans=0.2 2023-06-19 16:30:51,317 INFO [train.py:996] (2/4) Epoch 3, batch 2000, loss[loss=0.2786, simple_loss=0.3547, pruned_loss=0.1012, over 21552.00 frames. ], tot_loss[loss=0.254, simple_loss=0.323, pruned_loss=0.0925, over 4270447.86 frames. ], batch size: 441, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:31:55,379 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.93 vs. limit=22.5 2023-06-19 16:32:08,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=378054.0, ans=22.5 2023-06-19 16:32:27,055 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.93 vs. limit=15.0 2023-06-19 16:32:34,434 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=378114.0, ans=0.2 2023-06-19 16:32:58,797 INFO [train.py:996] (2/4) Epoch 3, batch 2050, loss[loss=0.2989, simple_loss=0.399, pruned_loss=0.09937, over 21266.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3252, pruned_loss=0.09309, over 4272055.20 frames. ], batch size: 548, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:34:00,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=378354.0, ans=0.125 2023-06-19 16:34:07,969 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.882e+02 2.758e+02 3.171e+02 3.876e+02 8.323e+02, threshold=6.343e+02, percent-clipped=2.0 2023-06-19 16:34:47,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff3.min_abs, batch_count=378474.0, ans=0.2 2023-06-19 16:34:54,471 INFO [train.py:996] (2/4) Epoch 3, batch 2100, loss[loss=0.3296, simple_loss=0.3959, pruned_loss=0.1317, over 21654.00 frames. ], tot_loss[loss=0.2608, simple_loss=0.3288, pruned_loss=0.0964, over 4277816.19 frames. ], batch size: 414, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:34:56,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=378534.0, ans=0.05 2023-06-19 16:34:59,100 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=378534.0, ans=0.1 2023-06-19 16:35:00,604 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=378534.0, ans=0.125 2023-06-19 16:36:39,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=378774.0, ans=0.0 2023-06-19 16:37:17,378 INFO [train.py:996] (2/4) Epoch 3, batch 2150, loss[loss=0.2229, simple_loss=0.2819, pruned_loss=0.08195, over 21603.00 frames. ], tot_loss[loss=0.2593, simple_loss=0.3267, pruned_loss=0.09598, over 4274006.18 frames. ], batch size: 298, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:38:23,709 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.954e+02 2.947e+02 3.794e+02 4.834e+02 7.445e+02, threshold=7.587e+02, percent-clipped=4.0 2023-06-19 16:38:25,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=378954.0, ans=0.125 2023-06-19 16:39:18,290 INFO [train.py:996] (2/4) Epoch 3, batch 2200, loss[loss=0.3612, simple_loss=0.4108, pruned_loss=0.1558, over 21485.00 frames. ], tot_loss[loss=0.2621, simple_loss=0.3308, pruned_loss=0.09672, over 4268934.98 frames. ], batch size: 471, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:39:32,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=379134.0, ans=0.0 2023-06-19 16:39:51,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=379194.0, ans=0.0 2023-06-19 16:40:59,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=379314.0, ans=0.035 2023-06-19 16:41:06,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=379314.0, ans=0.125 2023-06-19 16:41:38,419 INFO [train.py:996] (2/4) Epoch 3, batch 2250, loss[loss=0.2327, simple_loss=0.3114, pruned_loss=0.07694, over 21746.00 frames. ], tot_loss[loss=0.2605, simple_loss=0.3298, pruned_loss=0.09562, over 4268931.09 frames. ], batch size: 298, lr: 1.20e-02, grad_scale: 16.0 2023-06-19 16:41:38,840 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=379434.0, ans=0.125 2023-06-19 16:42:34,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=379554.0, ans=0.0 2023-06-19 16:42:47,066 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.766e+02 2.563e+02 3.119e+02 3.980e+02 5.506e+02, threshold=6.238e+02, percent-clipped=0.0 2023-06-19 16:43:07,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=379614.0, ans=0.1 2023-06-19 16:43:18,808 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=379674.0, ans=0.125 2023-06-19 16:43:24,516 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=379674.0, ans=0.125 2023-06-19 16:43:46,265 INFO [train.py:996] (2/4) Epoch 3, batch 2300, loss[loss=0.2483, simple_loss=0.3022, pruned_loss=0.09722, over 21842.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.323, pruned_loss=0.09461, over 4277259.06 frames. ], batch size: 118, lr: 1.20e-02, grad_scale: 16.0 2023-06-19 16:43:51,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=379734.0, ans=0.125 2023-06-19 16:43:51,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=379734.0, ans=0.2 2023-06-19 16:43:58,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=379734.0, ans=0.0 2023-06-19 16:44:30,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=379794.0, ans=0.5 2023-06-19 16:44:52,336 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=379854.0, ans=0.125 2023-06-19 16:44:57,066 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=379854.0, ans=0.1 2023-06-19 16:45:04,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=379914.0, ans=0.0 2023-06-19 16:45:35,715 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 16:45:41,106 INFO [train.py:996] (2/4) Epoch 3, batch 2350, loss[loss=0.2127, simple_loss=0.2737, pruned_loss=0.0758, over 21231.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3207, pruned_loss=0.09469, over 4267034.59 frames. ], batch size: 176, lr: 1.20e-02, grad_scale: 16.0 2023-06-19 16:46:38,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=380094.0, ans=0.0 2023-06-19 16:46:58,129 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.986e+02 2.669e+02 3.076e+02 3.679e+02 5.519e+02, threshold=6.152e+02, percent-clipped=0.0 2023-06-19 16:47:15,696 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.07 vs. limit=15.0 2023-06-19 16:48:11,840 INFO [train.py:996] (2/4) Epoch 3, batch 2400, loss[loss=0.2928, simple_loss=0.3503, pruned_loss=0.1177, over 21828.00 frames. ], tot_loss[loss=0.2596, simple_loss=0.3244, pruned_loss=0.0974, over 4261366.17 frames. ], batch size: 282, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 16:49:38,287 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 16:50:33,700 INFO [train.py:996] (2/4) Epoch 3, batch 2450, loss[loss=0.2666, simple_loss=0.3129, pruned_loss=0.1102, over 21836.00 frames. ], tot_loss[loss=0.2651, simple_loss=0.3311, pruned_loss=0.09959, over 4262471.91 frames. ], batch size: 98, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 16:51:23,583 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=380754.0, ans=0.2 2023-06-19 16:51:32,009 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.042e+02 2.697e+02 3.047e+02 3.544e+02 7.014e+02, threshold=6.094e+02, percent-clipped=1.0 2023-06-19 16:51:39,953 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=380814.0, ans=0.1 2023-06-19 16:51:43,553 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.24 vs. limit=22.5 2023-06-19 16:51:51,461 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=380874.0, ans=0.1 2023-06-19 16:52:22,743 INFO [train.py:996] (2/4) Epoch 3, batch 2500, loss[loss=0.2205, simple_loss=0.2774, pruned_loss=0.08185, over 21561.00 frames. ], tot_loss[loss=0.2634, simple_loss=0.3293, pruned_loss=0.09869, over 4256705.49 frames. ], batch size: 263, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 16:52:48,086 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 16:53:07,129 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-19 16:53:31,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=381114.0, ans=0.125 2023-06-19 16:54:19,206 INFO [train.py:996] (2/4) Epoch 3, batch 2550, loss[loss=0.2595, simple_loss=0.3186, pruned_loss=0.1002, over 21833.00 frames. ], tot_loss[loss=0.2626, simple_loss=0.3276, pruned_loss=0.0988, over 4252323.12 frames. ], batch size: 98, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 16:55:20,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=381354.0, ans=0.125 2023-06-19 16:55:33,969 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.850e+02 2.648e+02 3.147e+02 3.811e+02 6.835e+02, threshold=6.294e+02, percent-clipped=1.0 2023-06-19 16:55:37,784 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.52 vs. limit=15.0 2023-06-19 16:55:38,668 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=381414.0, ans=0.1 2023-06-19 16:56:11,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=381474.0, ans=0.125 2023-06-19 16:56:34,138 INFO [train.py:996] (2/4) Epoch 3, batch 2600, loss[loss=0.2706, simple_loss=0.33, pruned_loss=0.1056, over 21818.00 frames. ], tot_loss[loss=0.266, simple_loss=0.3295, pruned_loss=0.1012, over 4259181.12 frames. ], batch size: 247, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 16:56:36,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=381534.0, ans=0.0 2023-06-19 16:58:31,101 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=381774.0, ans=0.125 2023-06-19 16:58:35,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=381774.0, ans=0.125 2023-06-19 16:58:59,098 INFO [train.py:996] (2/4) Epoch 3, batch 2650, loss[loss=0.246, simple_loss=0.3131, pruned_loss=0.08946, over 21851.00 frames. ], tot_loss[loss=0.2684, simple_loss=0.331, pruned_loss=0.1029, over 4272178.87 frames. ], batch size: 124, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 16:59:19,875 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=381894.0, ans=0.125 2023-06-19 16:59:26,394 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.88 vs. limit=22.5 2023-06-19 16:59:41,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=381894.0, ans=22.5 2023-06-19 17:00:02,530 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.157e+02 2.896e+02 3.459e+02 3.981e+02 6.985e+02, threshold=6.919e+02, percent-clipped=4.0 2023-06-19 17:00:26,879 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=382014.0, ans=0.0 2023-06-19 17:01:11,823 INFO [train.py:996] (2/4) Epoch 3, batch 2700, loss[loss=0.224, simple_loss=0.2785, pruned_loss=0.08472, over 21193.00 frames. ], tot_loss[loss=0.2653, simple_loss=0.3297, pruned_loss=0.1004, over 4269551.37 frames. ], batch size: 143, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:01:19,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=382134.0, ans=0.2 2023-06-19 17:01:35,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=382194.0, ans=0.0 2023-06-19 17:01:36,822 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 17:03:20,415 INFO [train.py:996] (2/4) Epoch 3, batch 2750, loss[loss=0.2888, simple_loss=0.3588, pruned_loss=0.1094, over 20827.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.3286, pruned_loss=0.09993, over 4279293.01 frames. ], batch size: 607, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:03:53,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=382494.0, ans=0.125 2023-06-19 17:04:17,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=382554.0, ans=10.0 2023-06-19 17:04:33,263 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.194e+02 2.812e+02 3.458e+02 3.888e+02 7.269e+02, threshold=6.916e+02, percent-clipped=2.0 2023-06-19 17:04:46,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=382614.0, ans=0.125 2023-06-19 17:04:46,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=382614.0, ans=0.125 2023-06-19 17:05:21,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=382674.0, ans=0.125 2023-06-19 17:05:43,917 INFO [train.py:996] (2/4) Epoch 3, batch 2800, loss[loss=0.311, simple_loss=0.3654, pruned_loss=0.1283, over 21319.00 frames. ], tot_loss[loss=0.2689, simple_loss=0.3336, pruned_loss=0.1021, over 4275152.75 frames. ], batch size: 549, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:06:46,983 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=382854.0, ans=10.0 2023-06-19 17:07:51,277 INFO [train.py:996] (2/4) Epoch 3, batch 2850, loss[loss=0.2249, simple_loss=0.2873, pruned_loss=0.0812, over 21760.00 frames. ], tot_loss[loss=0.2691, simple_loss=0.3334, pruned_loss=0.1024, over 4266058.88 frames. ], batch size: 282, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:07:51,714 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=383034.0, ans=0.0 2023-06-19 17:08:57,674 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.123e+02 2.952e+02 3.438e+02 4.041e+02 6.558e+02, threshold=6.876e+02, percent-clipped=0.0 2023-06-19 17:09:01,009 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=383214.0, ans=0.125 2023-06-19 17:09:41,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=383274.0, ans=0.125 2023-06-19 17:09:43,310 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=383274.0, ans=0.125 2023-06-19 17:09:53,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=383274.0, ans=0.125 2023-06-19 17:09:56,214 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 17:10:03,354 INFO [train.py:996] (2/4) Epoch 3, batch 2900, loss[loss=0.2502, simple_loss=0.3106, pruned_loss=0.09489, over 21339.00 frames. ], tot_loss[loss=0.267, simple_loss=0.3308, pruned_loss=0.1016, over 4274692.02 frames. ], batch size: 176, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:11:36,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=383514.0, ans=0.0 2023-06-19 17:11:59,450 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=383574.0, ans=0.125 2023-06-19 17:11:59,942 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.36 vs. limit=15.0 2023-06-19 17:12:17,736 INFO [train.py:996] (2/4) Epoch 3, batch 2950, loss[loss=0.2431, simple_loss=0.3287, pruned_loss=0.07878, over 21416.00 frames. ], tot_loss[loss=0.2677, simple_loss=0.331, pruned_loss=0.1022, over 4272929.78 frames. ], batch size: 194, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:12:42,141 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=383634.0, ans=0.1 2023-06-19 17:13:27,105 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.917e+02 2.666e+02 3.329e+02 3.985e+02 6.298e+02, threshold=6.658e+02, percent-clipped=0.0 2023-06-19 17:13:52,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=383814.0, ans=0.04949747468305833 2023-06-19 17:14:18,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=383874.0, ans=0.125 2023-06-19 17:14:27,048 INFO [train.py:996] (2/4) Epoch 3, batch 3000, loss[loss=0.3054, simple_loss=0.369, pruned_loss=0.1208, over 21270.00 frames. ], tot_loss[loss=0.2684, simple_loss=0.3335, pruned_loss=0.1017, over 4270666.83 frames. ], batch size: 143, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:14:27,050 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-19 17:15:26,753 INFO [train.py:1028] (2/4) Epoch 3, validation: loss=0.2641, simple_loss=0.3582, pruned_loss=0.08497, over 1796401.00 frames. 2023-06-19 17:15:26,756 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-19 17:15:27,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=383934.0, ans=0.125 2023-06-19 17:16:37,270 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.22 vs. limit=22.5 2023-06-19 17:16:40,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=384114.0, ans=0.0 2023-06-19 17:17:09,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=384114.0, ans=0.125 2023-06-19 17:17:10,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=384174.0, ans=0.1 2023-06-19 17:17:10,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=384174.0, ans=0.125 2023-06-19 17:17:32,336 INFO [train.py:996] (2/4) Epoch 3, batch 3050, loss[loss=0.2741, simple_loss=0.3622, pruned_loss=0.09306, over 21525.00 frames. ], tot_loss[loss=0.2667, simple_loss=0.3337, pruned_loss=0.09981, over 4266680.97 frames. ], batch size: 471, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:18:02,841 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.42 vs. limit=15.0 2023-06-19 17:18:36,309 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.634e+02 2.612e+02 3.068e+02 3.884e+02 6.954e+02, threshold=6.136e+02, percent-clipped=1.0 2023-06-19 17:19:02,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=384474.0, ans=0.07 2023-06-19 17:19:25,763 INFO [train.py:996] (2/4) Epoch 3, batch 3100, loss[loss=0.3059, simple_loss=0.3532, pruned_loss=0.1293, over 21773.00 frames. ], tot_loss[loss=0.2652, simple_loss=0.3332, pruned_loss=0.0986, over 4273205.76 frames. ], batch size: 441, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:20:40,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=384654.0, ans=0.1 2023-06-19 17:20:42,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=384714.0, ans=0.125 2023-06-19 17:21:13,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=384714.0, ans=0.0 2023-06-19 17:21:32,969 INFO [train.py:996] (2/4) Epoch 3, batch 3150, loss[loss=0.302, simple_loss=0.3614, pruned_loss=0.1212, over 21490.00 frames. ], tot_loss[loss=0.2671, simple_loss=0.3348, pruned_loss=0.09966, over 4274376.67 frames. ], batch size: 194, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:22:50,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=384954.0, ans=0.2 2023-06-19 17:22:54,647 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.858e+02 2.666e+02 3.183e+02 4.146e+02 7.472e+02, threshold=6.366e+02, percent-clipped=3.0 2023-06-19 17:23:04,019 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.16 vs. limit=15.0 2023-06-19 17:23:13,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=385014.0, ans=0.2 2023-06-19 17:23:22,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=385074.0, ans=0.125 2023-06-19 17:23:54,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=385074.0, ans=0.0 2023-06-19 17:24:00,455 INFO [train.py:996] (2/4) Epoch 3, batch 3200, loss[loss=0.2925, simple_loss=0.3653, pruned_loss=0.1099, over 21607.00 frames. ], tot_loss[loss=0.268, simple_loss=0.3359, pruned_loss=0.1001, over 4275374.96 frames. ], batch size: 414, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:26:13,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=385434.0, ans=0.125 2023-06-19 17:26:14,501 INFO [train.py:996] (2/4) Epoch 3, batch 3250, loss[loss=0.3114, simple_loss=0.3544, pruned_loss=0.1342, over 21812.00 frames. ], tot_loss[loss=0.2712, simple_loss=0.3368, pruned_loss=0.1028, over 4279218.08 frames. ], batch size: 441, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:26:36,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=385434.0, ans=0.2 2023-06-19 17:27:12,405 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.73 vs. limit=15.0 2023-06-19 17:27:15,360 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 3.186e+02 3.700e+02 4.457e+02 5.967e+02, threshold=7.400e+02, percent-clipped=0.0 2023-06-19 17:27:40,658 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.82 vs. limit=15.0 2023-06-19 17:28:06,449 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=385674.0, ans=0.09899494936611666 2023-06-19 17:28:25,307 INFO [train.py:996] (2/4) Epoch 3, batch 3300, loss[loss=0.2505, simple_loss=0.3051, pruned_loss=0.09794, over 21435.00 frames. ], tot_loss[loss=0.2691, simple_loss=0.3334, pruned_loss=0.1024, over 4280525.06 frames. ], batch size: 389, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:28:53,505 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=385794.0, ans=0.1 2023-06-19 17:29:00,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=385794.0, ans=0.125 2023-06-19 17:30:48,679 INFO [train.py:996] (2/4) Epoch 3, batch 3350, loss[loss=0.2557, simple_loss=0.3465, pruned_loss=0.08239, over 21313.00 frames. ], tot_loss[loss=0.2711, simple_loss=0.3364, pruned_loss=0.1029, over 4276717.73 frames. ], batch size: 548, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:31:25,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=386094.0, ans=0.125 2023-06-19 17:32:02,750 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.837e+02 2.950e+02 3.389e+02 4.289e+02 7.899e+02, threshold=6.778e+02, percent-clipped=1.0 2023-06-19 17:32:44,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=386274.0, ans=0.2 2023-06-19 17:33:05,258 INFO [train.py:996] (2/4) Epoch 3, batch 3400, loss[loss=0.2489, simple_loss=0.3083, pruned_loss=0.09473, over 21835.00 frames. ], tot_loss[loss=0.2709, simple_loss=0.3362, pruned_loss=0.1028, over 4281619.23 frames. ], batch size: 372, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:33:16,394 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=386334.0, ans=0.0 2023-06-19 17:33:35,715 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=386394.0, ans=0.2 2023-06-19 17:33:41,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=386454.0, ans=0.0 2023-06-19 17:33:48,338 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=386454.0, ans=0.0 2023-06-19 17:33:59,402 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.45 vs. limit=6.0 2023-06-19 17:34:21,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=386514.0, ans=0.1 2023-06-19 17:34:24,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=386574.0, ans=0.0 2023-06-19 17:35:07,131 INFO [train.py:996] (2/4) Epoch 3, batch 3450, loss[loss=0.2441, simple_loss=0.302, pruned_loss=0.09305, over 21845.00 frames. ], tot_loss[loss=0.268, simple_loss=0.3322, pruned_loss=0.1019, over 4278670.37 frames. ], batch size: 107, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:35:12,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=386634.0, ans=0.2 2023-06-19 17:35:12,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=386634.0, ans=0.95 2023-06-19 17:35:40,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=386694.0, ans=0.0 2023-06-19 17:36:23,990 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.980e+02 2.906e+02 3.291e+02 4.087e+02 6.835e+02, threshold=6.581e+02, percent-clipped=1.0 2023-06-19 17:36:33,693 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.53 vs. limit=22.5 2023-06-19 17:37:12,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=386874.0, ans=0.0 2023-06-19 17:37:16,556 INFO [train.py:996] (2/4) Epoch 3, batch 3500, loss[loss=0.2863, simple_loss=0.3503, pruned_loss=0.1112, over 21968.00 frames. ], tot_loss[loss=0.2789, simple_loss=0.3437, pruned_loss=0.1071, over 4284431.49 frames. ], batch size: 317, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 17:38:56,810 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=387114.0, ans=0.125 2023-06-19 17:39:34,542 INFO [train.py:996] (2/4) Epoch 3, batch 3550, loss[loss=0.2531, simple_loss=0.3038, pruned_loss=0.1011, over 21303.00 frames. ], tot_loss[loss=0.2802, simple_loss=0.3452, pruned_loss=0.1076, over 4285736.63 frames. ], batch size: 160, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 17:40:22,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=387294.0, ans=0.125 2023-06-19 17:40:38,673 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.99 vs. limit=10.0 2023-06-19 17:40:51,392 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.232e+02 2.929e+02 3.306e+02 4.196e+02 6.685e+02, threshold=6.611e+02, percent-clipped=1.0 2023-06-19 17:41:44,380 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.64 vs. limit=10.0 2023-06-19 17:41:53,747 INFO [train.py:996] (2/4) Epoch 3, batch 3600, loss[loss=0.3122, simple_loss=0.3817, pruned_loss=0.1213, over 21822.00 frames. ], tot_loss[loss=0.2766, simple_loss=0.3401, pruned_loss=0.1066, over 4279727.26 frames. ], batch size: 124, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 17:42:04,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=387534.0, ans=0.0 2023-06-19 17:42:22,897 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.69 vs. limit=15.0 2023-06-19 17:44:19,134 INFO [train.py:996] (2/4) Epoch 3, batch 3650, loss[loss=0.2464, simple_loss=0.3223, pruned_loss=0.08529, over 21715.00 frames. ], tot_loss[loss=0.2765, simple_loss=0.34, pruned_loss=0.1066, over 4271736.05 frames. ], batch size: 298, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 17:44:55,741 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.90 vs. limit=15.0 2023-06-19 17:45:31,523 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.989e+02 2.846e+02 3.301e+02 4.049e+02 6.625e+02, threshold=6.601e+02, percent-clipped=2.0 2023-06-19 17:45:41,422 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.10 vs. limit=22.5 2023-06-19 17:45:48,680 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.42 vs. limit=15.0 2023-06-19 17:46:25,669 INFO [train.py:996] (2/4) Epoch 3, batch 3700, loss[loss=0.2599, simple_loss=0.3571, pruned_loss=0.08136, over 20916.00 frames. ], tot_loss[loss=0.2733, simple_loss=0.3375, pruned_loss=0.1045, over 4280396.93 frames. ], batch size: 608, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 17:46:52,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=388194.0, ans=0.125 2023-06-19 17:48:23,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=388374.0, ans=0.0 2023-06-19 17:48:26,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=388374.0, ans=0.1 2023-06-19 17:48:35,624 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.64 vs. limit=15.0 2023-06-19 17:48:56,823 INFO [train.py:996] (2/4) Epoch 3, batch 3750, loss[loss=0.2304, simple_loss=0.297, pruned_loss=0.08193, over 21751.00 frames. ], tot_loss[loss=0.2712, simple_loss=0.3357, pruned_loss=0.1034, over 4284012.15 frames. ], batch size: 247, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 17:49:01,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=388434.0, ans=0.125 2023-06-19 17:50:07,715 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.966e+02 3.161e+02 3.639e+02 4.340e+02 7.555e+02, threshold=7.277e+02, percent-clipped=2.0 2023-06-19 17:50:12,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=388614.0, ans=0.1 2023-06-19 17:50:25,983 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.96 vs. limit=15.0 2023-06-19 17:50:42,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=388674.0, ans=0.2 2023-06-19 17:50:56,767 INFO [train.py:996] (2/4) Epoch 3, batch 3800, loss[loss=0.2781, simple_loss=0.3434, pruned_loss=0.1064, over 21995.00 frames. ], tot_loss[loss=0.27, simple_loss=0.3352, pruned_loss=0.1024, over 4289116.42 frames. ], batch size: 317, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 17:51:25,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=388734.0, ans=0.0 2023-06-19 17:51:59,944 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.94 vs. limit=15.0 2023-06-19 17:53:09,363 INFO [train.py:996] (2/4) Epoch 3, batch 3850, loss[loss=0.2299, simple_loss=0.2801, pruned_loss=0.08981, over 21628.00 frames. ], tot_loss[loss=0.2671, simple_loss=0.3308, pruned_loss=0.1017, over 4279570.72 frames. ], batch size: 247, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 17:53:33,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=389094.0, ans=0.2 2023-06-19 17:53:35,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=389094.0, ans=0.025 2023-06-19 17:54:12,932 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.159e+02 2.931e+02 3.555e+02 4.316e+02 7.141e+02, threshold=7.110e+02, percent-clipped=0.0 2023-06-19 17:54:34,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=389214.0, ans=0.125 2023-06-19 17:54:35,731 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=389214.0, ans=0.2 2023-06-19 17:55:21,647 INFO [train.py:996] (2/4) Epoch 3, batch 3900, loss[loss=0.2829, simple_loss=0.3414, pruned_loss=0.1122, over 21727.00 frames. ], tot_loss[loss=0.265, simple_loss=0.3274, pruned_loss=0.1013, over 4279753.79 frames. ], batch size: 389, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 17:56:04,751 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=389394.0, ans=0.125 2023-06-19 17:56:31,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=389454.0, ans=0.0 2023-06-19 17:56:31,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=389454.0, ans=0.125 2023-06-19 17:56:48,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=389514.0, ans=0.1 2023-06-19 17:56:48,485 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.67 vs. limit=10.0 2023-06-19 17:56:56,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=389514.0, ans=0.125 2023-06-19 17:57:03,547 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_na.min_abs, batch_count=389574.0, ans=0.02 2023-06-19 17:57:16,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=389574.0, ans=0.1 2023-06-19 17:57:31,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=389634.0, ans=0.125 2023-06-19 17:57:32,064 INFO [train.py:996] (2/4) Epoch 3, batch 3950, loss[loss=0.2376, simple_loss=0.2942, pruned_loss=0.09046, over 21641.00 frames. ], tot_loss[loss=0.2649, simple_loss=0.3287, pruned_loss=0.1005, over 4288585.57 frames. ], batch size: 263, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 17:58:43,724 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=389754.0, ans=0.1 2023-06-19 17:58:47,715 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.753e+02 2.533e+02 2.923e+02 3.801e+02 7.027e+02, threshold=5.846e+02, percent-clipped=0.0 2023-06-19 17:58:50,141 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.15 vs. limit=12.0 2023-06-19 17:59:41,386 INFO [train.py:996] (2/4) Epoch 3, batch 4000, loss[loss=0.2137, simple_loss=0.2741, pruned_loss=0.07663, over 21632.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.3211, pruned_loss=0.09699, over 4288162.49 frames. ], batch size: 282, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 17:59:45,589 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.50 vs. limit=15.0 2023-06-19 18:01:12,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=390114.0, ans=0.125 2023-06-19 18:01:17,895 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.60 vs. limit=15.0 2023-06-19 18:01:43,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=390234.0, ans=0.125 2023-06-19 18:01:44,589 INFO [train.py:996] (2/4) Epoch 3, batch 4050, loss[loss=0.2902, simple_loss=0.3437, pruned_loss=0.1184, over 21837.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3206, pruned_loss=0.09459, over 4285599.36 frames. ], batch size: 124, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 18:01:44,980 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=390234.0, ans=0.1 2023-06-19 18:01:57,797 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=390234.0, ans=0.0 2023-06-19 18:02:20,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=390234.0, ans=0.0 2023-06-19 18:02:49,099 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=390354.0, ans=0.125 2023-06-19 18:03:07,898 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.889e+02 2.511e+02 2.852e+02 3.605e+02 5.912e+02, threshold=5.705e+02, percent-clipped=1.0 2023-06-19 18:03:16,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=390414.0, ans=0.2 2023-06-19 18:03:34,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=390414.0, ans=0.035 2023-06-19 18:03:59,019 INFO [train.py:996] (2/4) Epoch 3, batch 4100, loss[loss=0.2363, simple_loss=0.3114, pruned_loss=0.08062, over 21325.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.3227, pruned_loss=0.09541, over 4293601.79 frames. ], batch size: 176, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 18:04:50,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=390654.0, ans=0.125 2023-06-19 18:04:55,061 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=390654.0, ans=0.1 2023-06-19 18:05:33,772 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.39 vs. limit=15.0 2023-06-19 18:06:02,088 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.38 vs. limit=15.0 2023-06-19 18:06:25,552 INFO [train.py:996] (2/4) Epoch 3, batch 4150, loss[loss=0.215, simple_loss=0.292, pruned_loss=0.06895, over 21312.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.3226, pruned_loss=0.09299, over 4284051.41 frames. ], batch size: 131, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 18:06:37,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=390834.0, ans=0.125 2023-06-19 18:07:30,119 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.626e+02 2.414e+02 2.926e+02 3.479e+02 5.759e+02, threshold=5.851e+02, percent-clipped=1.0 2023-06-19 18:08:05,448 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=391074.0, ans=0.0 2023-06-19 18:08:12,738 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=15.0 2023-06-19 18:08:37,044 INFO [train.py:996] (2/4) Epoch 3, batch 4200, loss[loss=0.2239, simple_loss=0.2946, pruned_loss=0.07659, over 21457.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3226, pruned_loss=0.09273, over 4277072.68 frames. ], batch size: 195, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 18:09:34,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=391254.0, ans=0.0 2023-06-19 18:10:25,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=391374.0, ans=0.2 2023-06-19 18:10:59,595 INFO [train.py:996] (2/4) Epoch 3, batch 4250, loss[loss=0.255, simple_loss=0.349, pruned_loss=0.08056, over 20737.00 frames. ], tot_loss[loss=0.2594, simple_loss=0.3292, pruned_loss=0.09481, over 4274411.83 frames. ], batch size: 608, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 18:11:22,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=391494.0, ans=0.2 2023-06-19 18:12:15,436 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.957e+02 3.056e+02 3.919e+02 5.866e+02 1.121e+03, threshold=7.838e+02, percent-clipped=25.0 2023-06-19 18:12:27,413 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.27 vs. limit=10.0 2023-06-19 18:12:35,670 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.94 vs. limit=22.5 2023-06-19 18:13:22,652 INFO [train.py:996] (2/4) Epoch 3, batch 4300, loss[loss=0.2829, simple_loss=0.3763, pruned_loss=0.09471, over 21657.00 frames. ], tot_loss[loss=0.2629, simple_loss=0.3335, pruned_loss=0.09618, over 4277674.04 frames. ], batch size: 414, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 18:14:12,497 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=391854.0, ans=0.0 2023-06-19 18:14:28,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=391854.0, ans=0.0 2023-06-19 18:14:30,349 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=391854.0, ans=0.125 2023-06-19 18:15:08,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=391914.0, ans=0.0 2023-06-19 18:15:11,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=391914.0, ans=0.0 2023-06-19 18:15:38,822 INFO [train.py:996] (2/4) Epoch 3, batch 4350, loss[loss=0.2387, simple_loss=0.2993, pruned_loss=0.08902, over 21611.00 frames. ], tot_loss[loss=0.2626, simple_loss=0.3338, pruned_loss=0.09571, over 4275264.90 frames. ], batch size: 298, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 18:15:51,257 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=392034.0, ans=0.1 2023-06-19 18:16:28,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=392094.0, ans=0.125 2023-06-19 18:16:43,768 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.04 vs. limit=15.0 2023-06-19 18:16:44,721 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=392154.0, ans=0.125 2023-06-19 18:16:56,979 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.860e+02 2.727e+02 3.088e+02 4.174e+02 7.759e+02, threshold=6.176e+02, percent-clipped=0.0 2023-06-19 18:17:15,289 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=392214.0, ans=0.0 2023-06-19 18:17:35,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=392274.0, ans=0.125 2023-06-19 18:17:39,771 INFO [train.py:996] (2/4) Epoch 3, batch 4400, loss[loss=0.2608, simple_loss=0.3441, pruned_loss=0.08877, over 21770.00 frames. ], tot_loss[loss=0.2609, simple_loss=0.3308, pruned_loss=0.09549, over 4261750.40 frames. ], batch size: 352, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 18:18:21,077 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=392394.0, ans=0.125 2023-06-19 18:19:53,125 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.38 vs. limit=10.0 2023-06-19 18:19:54,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=392574.0, ans=0.125 2023-06-19 18:20:00,026 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=392574.0, ans=0.0 2023-06-19 18:20:02,750 INFO [train.py:996] (2/4) Epoch 3, batch 4450, loss[loss=0.2713, simple_loss=0.3544, pruned_loss=0.09405, over 21407.00 frames. ], tot_loss[loss=0.2681, simple_loss=0.3398, pruned_loss=0.09822, over 4267453.73 frames. ], batch size: 211, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 18:20:59,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=392754.0, ans=0.1 2023-06-19 18:21:17,852 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.64 vs. limit=10.0 2023-06-19 18:21:24,216 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.204e+02 2.794e+02 3.143e+02 3.836e+02 6.823e+02, threshold=6.286e+02, percent-clipped=3.0 2023-06-19 18:21:26,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=392814.0, ans=0.1 2023-06-19 18:21:52,681 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=18.22 vs. limit=15.0 2023-06-19 18:22:00,548 INFO [train.py:996] (2/4) Epoch 3, batch 4500, loss[loss=0.2597, simple_loss=0.343, pruned_loss=0.08821, over 21733.00 frames. ], tot_loss[loss=0.271, simple_loss=0.3412, pruned_loss=0.1004, over 4276192.48 frames. ], batch size: 247, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 18:22:01,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=392934.0, ans=0.1 2023-06-19 18:22:21,319 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 18:22:55,808 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=392994.0, ans=0.0 2023-06-19 18:23:03,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=392994.0, ans=0.125 2023-06-19 18:23:15,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=393054.0, ans=0.125 2023-06-19 18:23:59,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=393174.0, ans=0.1 2023-06-19 18:23:59,636 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.30 vs. limit=10.0 2023-06-19 18:24:04,026 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.41 vs. limit=6.0 2023-06-19 18:24:33,947 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=393234.0, ans=0.125 2023-06-19 18:24:34,829 INFO [train.py:996] (2/4) Epoch 3, batch 4550, loss[loss=0.2803, simple_loss=0.354, pruned_loss=0.1033, over 21866.00 frames. ], tot_loss[loss=0.2744, simple_loss=0.3452, pruned_loss=0.1018, over 4278321.65 frames. ], batch size: 282, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 18:25:16,488 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.14 vs. limit=15.0 2023-06-19 18:25:53,413 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.086e+02 2.924e+02 3.565e+02 4.370e+02 6.839e+02, threshold=7.130e+02, percent-clipped=5.0 2023-06-19 18:26:21,777 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=15.0 2023-06-19 18:26:22,517 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=393474.0, ans=0.125 2023-06-19 18:26:54,822 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 18:26:57,138 INFO [train.py:996] (2/4) Epoch 3, batch 4600, loss[loss=0.2237, simple_loss=0.2985, pruned_loss=0.07449, over 21730.00 frames. ], tot_loss[loss=0.2764, simple_loss=0.3466, pruned_loss=0.1031, over 4279893.87 frames. ], batch size: 247, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 18:28:27,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=393714.0, ans=0.125 2023-06-19 18:29:14,435 INFO [train.py:996] (2/4) Epoch 3, batch 4650, loss[loss=0.2171, simple_loss=0.2867, pruned_loss=0.07371, over 21898.00 frames. ], tot_loss[loss=0.2716, simple_loss=0.3407, pruned_loss=0.1012, over 4280843.28 frames. ], batch size: 118, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 18:29:36,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=393834.0, ans=10.0 2023-06-19 18:29:40,190 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.12 vs. limit=15.0 2023-06-19 18:29:54,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=393954.0, ans=0.125 2023-06-19 18:30:21,160 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.898e+02 2.410e+02 2.813e+02 3.483e+02 6.632e+02, threshold=5.627e+02, percent-clipped=0.0 2023-06-19 18:30:22,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=394014.0, ans=0.125 2023-06-19 18:30:50,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=394074.0, ans=0.2 2023-06-19 18:30:53,297 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=394074.0, ans=0.0 2023-06-19 18:31:17,313 INFO [train.py:996] (2/4) Epoch 3, batch 4700, loss[loss=0.2434, simple_loss=0.2914, pruned_loss=0.09767, over 21705.00 frames. ], tot_loss[loss=0.2634, simple_loss=0.3299, pruned_loss=0.09844, over 4285979.93 frames. ], batch size: 283, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 18:31:29,071 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=394134.0, ans=0.125 2023-06-19 18:33:35,278 INFO [train.py:996] (2/4) Epoch 3, batch 4750, loss[loss=0.2575, simple_loss=0.3107, pruned_loss=0.1021, over 21810.00 frames. ], tot_loss[loss=0.2627, simple_loss=0.327, pruned_loss=0.09924, over 4286686.39 frames. ], batch size: 282, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 18:34:22,850 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=394554.0, ans=0.125 2023-06-19 18:34:23,420 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-06-19 18:34:40,261 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.088e+02 2.729e+02 3.140e+02 4.884e+02 6.960e+02, threshold=6.280e+02, percent-clipped=14.0 2023-06-19 18:35:37,354 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.28 vs. limit=12.0 2023-06-19 18:35:53,971 INFO [train.py:996] (2/4) Epoch 3, batch 4800, loss[loss=0.2515, simple_loss=0.2998, pruned_loss=0.1016, over 20291.00 frames. ], tot_loss[loss=0.263, simple_loss=0.3266, pruned_loss=0.09967, over 4278759.66 frames. ], batch size: 703, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 18:36:33,254 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=394854.0, ans=0.125 2023-06-19 18:37:09,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=394914.0, ans=0.125 2023-06-19 18:37:56,651 INFO [train.py:996] (2/4) Epoch 3, batch 4850, loss[loss=0.2533, simple_loss=0.3217, pruned_loss=0.09245, over 21884.00 frames. ], tot_loss[loss=0.2608, simple_loss=0.3262, pruned_loss=0.09765, over 4281095.35 frames. ], batch size: 118, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 18:38:06,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=395034.0, ans=0.125 2023-06-19 18:38:42,087 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=395154.0, ans=0.0 2023-06-19 18:38:56,949 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=395154.0, ans=0.0 2023-06-19 18:38:57,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=395154.0, ans=0.5 2023-06-19 18:39:00,886 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.128e+02 2.909e+02 3.496e+02 4.405e+02 5.702e+02, threshold=6.991e+02, percent-clipped=0.0 2023-06-19 18:40:00,491 INFO [train.py:996] (2/4) Epoch 3, batch 4900, loss[loss=0.3078, simple_loss=0.384, pruned_loss=0.1158, over 21648.00 frames. ], tot_loss[loss=0.2623, simple_loss=0.3276, pruned_loss=0.09851, over 4285910.24 frames. ], batch size: 389, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 18:40:15,947 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.67 vs. limit=10.0 2023-06-19 18:41:00,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=395454.0, ans=0.2 2023-06-19 18:41:01,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=395454.0, ans=0.0 2023-06-19 18:41:28,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=395514.0, ans=0.0 2023-06-19 18:42:02,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=395574.0, ans=0.1 2023-06-19 18:42:09,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=395574.0, ans=0.0 2023-06-19 18:42:21,346 INFO [train.py:996] (2/4) Epoch 3, batch 4950, loss[loss=0.2458, simple_loss=0.2775, pruned_loss=0.107, over 20038.00 frames. ], tot_loss[loss=0.2603, simple_loss=0.3291, pruned_loss=0.0958, over 4285426.80 frames. ], batch size: 704, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 18:42:23,196 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=395634.0, ans=0.1 2023-06-19 18:42:48,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=395694.0, ans=0.1 2023-06-19 18:42:52,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=395694.0, ans=0.0 2023-06-19 18:42:59,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=395694.0, ans=0.125 2023-06-19 18:43:23,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=395754.0, ans=0.125 2023-06-19 18:43:36,182 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.860e+02 2.438e+02 2.884e+02 3.463e+02 6.412e+02, threshold=5.767e+02, percent-clipped=0.0 2023-06-19 18:44:19,060 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.81 vs. limit=15.0 2023-06-19 18:44:25,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=395874.0, ans=0.125 2023-06-19 18:44:30,723 INFO [train.py:996] (2/4) Epoch 3, batch 5000, loss[loss=0.2383, simple_loss=0.3181, pruned_loss=0.07921, over 21604.00 frames. ], tot_loss[loss=0.2555, simple_loss=0.3269, pruned_loss=0.09202, over 4279486.21 frames. ], batch size: 230, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 18:45:22,314 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=396054.0, ans=0.1 2023-06-19 18:45:37,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=396054.0, ans=0.125 2023-06-19 18:46:06,863 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=15.0 2023-06-19 18:46:39,130 INFO [train.py:996] (2/4) Epoch 3, batch 5050, loss[loss=0.2756, simple_loss=0.3523, pruned_loss=0.09948, over 21442.00 frames. ], tot_loss[loss=0.2573, simple_loss=0.3281, pruned_loss=0.09326, over 4271482.45 frames. ], batch size: 548, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 18:46:42,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=396234.0, ans=0.125 2023-06-19 18:47:46,312 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 2.997e+02 3.476e+02 4.224e+02 8.088e+02, threshold=6.952e+02, percent-clipped=5.0 2023-06-19 18:47:55,709 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=396414.0, ans=0.2 2023-06-19 18:48:26,729 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=396414.0, ans=10.0 2023-06-19 18:49:06,028 INFO [train.py:996] (2/4) Epoch 3, batch 5100, loss[loss=0.2888, simple_loss=0.3402, pruned_loss=0.1187, over 21787.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3251, pruned_loss=0.09318, over 4274437.38 frames. ], batch size: 441, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 18:50:46,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=396774.0, ans=0.125 2023-06-19 18:51:09,500 INFO [train.py:996] (2/4) Epoch 3, batch 5150, loss[loss=0.3244, simple_loss=0.3821, pruned_loss=0.1333, over 21554.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3251, pruned_loss=0.09461, over 4283776.33 frames. ], batch size: 471, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 18:51:42,840 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=396894.0, ans=0.0 2023-06-19 18:51:45,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=396894.0, ans=0.0 2023-06-19 18:51:47,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=396894.0, ans=0.1 2023-06-19 18:51:47,834 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.94 vs. limit=15.0 2023-06-19 18:52:22,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=396954.0, ans=0.0 2023-06-19 18:52:27,684 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.267e+02 2.779e+02 3.148e+02 3.948e+02 7.558e+02, threshold=6.295e+02, percent-clipped=3.0 2023-06-19 18:52:30,123 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.21 vs. limit=10.0 2023-06-19 18:52:32,958 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.59 vs. limit=15.0 2023-06-19 18:53:16,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=397074.0, ans=0.035 2023-06-19 18:53:35,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=397134.0, ans=0.125 2023-06-19 18:53:36,797 INFO [train.py:996] (2/4) Epoch 3, batch 5200, loss[loss=0.255, simple_loss=0.3437, pruned_loss=0.08316, over 21492.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3256, pruned_loss=0.09535, over 4288332.15 frames. ], batch size: 211, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 18:54:49,595 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.51 vs. limit=15.0 2023-06-19 18:55:28,495 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.98 vs. limit=12.0 2023-06-19 18:55:41,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=397374.0, ans=0.125 2023-06-19 18:55:53,384 INFO [train.py:996] (2/4) Epoch 3, batch 5250, loss[loss=0.2249, simple_loss=0.303, pruned_loss=0.07341, over 21368.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.3322, pruned_loss=0.09561, over 4285827.48 frames. ], batch size: 176, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 18:56:21,830 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=397554.0, ans=0.125 2023-06-19 18:56:51,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=397554.0, ans=0.0 2023-06-19 18:56:53,726 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.062e+02 2.637e+02 3.155e+02 4.003e+02 7.471e+02, threshold=6.309e+02, percent-clipped=2.0 2023-06-19 18:57:13,420 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.15 vs. limit=15.0 2023-06-19 18:57:46,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=397674.0, ans=0.0 2023-06-19 18:57:56,753 INFO [train.py:996] (2/4) Epoch 3, batch 5300, loss[loss=0.2735, simple_loss=0.3323, pruned_loss=0.1074, over 21596.00 frames. ], tot_loss[loss=0.2625, simple_loss=0.3323, pruned_loss=0.0964, over 4286965.61 frames. ], batch size: 195, lr: 1.17e-02, grad_scale: 16.0 2023-06-19 18:58:44,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=397794.0, ans=0.125 2023-06-19 18:58:47,516 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=397794.0, ans=0.07 2023-06-19 18:59:27,355 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=397914.0, ans=0.125 2023-06-19 18:59:47,432 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.34 vs. limit=15.0 2023-06-19 18:59:57,894 INFO [train.py:996] (2/4) Epoch 3, batch 5350, loss[loss=0.2502, simple_loss=0.3156, pruned_loss=0.09237, over 21557.00 frames. ], tot_loss[loss=0.2632, simple_loss=0.3312, pruned_loss=0.09764, over 4294737.33 frames. ], batch size: 194, lr: 1.17e-02, grad_scale: 16.0 2023-06-19 19:00:15,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=398034.0, ans=0.0 2023-06-19 19:00:19,866 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=398034.0, ans=0.125 2023-06-19 19:00:21,207 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=398034.0, ans=0.125 2023-06-19 19:01:15,594 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.192e+02 2.703e+02 3.245e+02 4.023e+02 6.387e+02, threshold=6.490e+02, percent-clipped=1.0 2023-06-19 19:01:22,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=398214.0, ans=0.125 2023-06-19 19:01:57,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=398274.0, ans=0.0 2023-06-19 19:02:04,387 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.39 vs. limit=6.0 2023-06-19 19:02:24,841 INFO [train.py:996] (2/4) Epoch 3, batch 5400, loss[loss=0.2563, simple_loss=0.3165, pruned_loss=0.09804, over 21904.00 frames. ], tot_loss[loss=0.2634, simple_loss=0.3295, pruned_loss=0.09862, over 4294582.53 frames. ], batch size: 351, lr: 1.17e-02, grad_scale: 16.0 2023-06-19 19:03:12,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=398454.0, ans=0.125 2023-06-19 19:04:28,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=398574.0, ans=0.0 2023-06-19 19:04:42,283 INFO [train.py:996] (2/4) Epoch 3, batch 5450, loss[loss=0.372, simple_loss=0.4439, pruned_loss=0.1501, over 21517.00 frames. ], tot_loss[loss=0.2607, simple_loss=0.3292, pruned_loss=0.09611, over 4290179.42 frames. ], batch size: 507, lr: 1.17e-02, grad_scale: 16.0 2023-06-19 19:04:49,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=398634.0, ans=0.1 2023-06-19 19:05:07,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=398694.0, ans=0.07 2023-06-19 19:06:02,255 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=398814.0, ans=0.1 2023-06-19 19:06:03,150 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.605e+02 2.356e+02 2.919e+02 3.477e+02 6.016e+02, threshold=5.839e+02, percent-clipped=0.0 2023-06-19 19:06:47,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=398874.0, ans=0.035 2023-06-19 19:06:54,314 INFO [train.py:996] (2/4) Epoch 3, batch 5500, loss[loss=0.2853, simple_loss=0.3335, pruned_loss=0.1186, over 21604.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3321, pruned_loss=0.09206, over 4282953.35 frames. ], batch size: 548, lr: 1.17e-02, grad_scale: 16.0 2023-06-19 19:07:18,725 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.52 vs. limit=22.5 2023-06-19 19:08:31,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=399114.0, ans=0.125 2023-06-19 19:09:12,511 INFO [train.py:996] (2/4) Epoch 3, batch 5550, loss[loss=0.1459, simple_loss=0.1969, pruned_loss=0.04747, over 16022.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3287, pruned_loss=0.08858, over 4271537.76 frames. ], batch size: 60, lr: 1.17e-02, grad_scale: 16.0 2023-06-19 19:09:52,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=399294.0, ans=0.2 2023-06-19 19:10:15,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=399354.0, ans=0.1 2023-06-19 19:10:27,285 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=399354.0, ans=0.1 2023-06-19 19:10:34,711 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.84 vs. limit=15.0 2023-06-19 19:10:42,139 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.622e+02 2.351e+02 2.803e+02 3.299e+02 6.466e+02, threshold=5.606e+02, percent-clipped=1.0 2023-06-19 19:11:23,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=399474.0, ans=0.125 2023-06-19 19:11:25,033 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.93 vs. limit=15.0 2023-06-19 19:11:52,835 INFO [train.py:996] (2/4) Epoch 3, batch 5600, loss[loss=0.2662, simple_loss=0.3475, pruned_loss=0.09243, over 20788.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3245, pruned_loss=0.08517, over 4277137.58 frames. ], batch size: 607, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 19:11:57,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=399534.0, ans=0.125 2023-06-19 19:12:41,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=399594.0, ans=0.0 2023-06-19 19:13:45,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=399774.0, ans=0.2 2023-06-19 19:13:49,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=399774.0, ans=0.0 2023-06-19 19:14:09,698 INFO [train.py:996] (2/4) Epoch 3, batch 5650, loss[loss=0.2614, simple_loss=0.3276, pruned_loss=0.09758, over 21789.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3316, pruned_loss=0.08914, over 4271382.19 frames. ], batch size: 112, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 19:15:28,359 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 2.620e+02 2.997e+02 3.705e+02 7.555e+02, threshold=5.994e+02, percent-clipped=4.0 2023-06-19 19:16:02,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=400074.0, ans=0.0 2023-06-19 19:16:18,574 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.14 vs. limit=22.5 2023-06-19 19:16:35,376 INFO [train.py:996] (2/4) Epoch 3, batch 5700, loss[loss=0.2581, simple_loss=0.3103, pruned_loss=0.1029, over 21257.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3308, pruned_loss=0.0903, over 4273189.11 frames. ], batch size: 608, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 19:16:52,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=400134.0, ans=0.125 2023-06-19 19:17:26,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=400194.0, ans=0.125 2023-06-19 19:17:28,324 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.97 vs. limit=15.0 2023-06-19 19:18:01,357 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=400254.0, ans=0.125 2023-06-19 19:18:20,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=400314.0, ans=0.0 2023-06-19 19:18:50,711 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=400374.0, ans=0.125 2023-06-19 19:18:58,570 INFO [train.py:996] (2/4) Epoch 3, batch 5750, loss[loss=0.2405, simple_loss=0.3301, pruned_loss=0.07546, over 21606.00 frames. ], tot_loss[loss=0.2517, simple_loss=0.3279, pruned_loss=0.08779, over 4278311.29 frames. ], batch size: 389, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:20:11,189 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.787e+02 2.423e+02 2.924e+02 3.462e+02 7.613e+02, threshold=5.849e+02, percent-clipped=6.0 2023-06-19 19:20:34,981 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.62 vs. limit=6.0 2023-06-19 19:20:40,899 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=400614.0, ans=0.125 2023-06-19 19:21:04,745 INFO [train.py:996] (2/4) Epoch 3, batch 5800, loss[loss=0.2985, simple_loss=0.3869, pruned_loss=0.105, over 21668.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3279, pruned_loss=0.087, over 4275259.43 frames. ], batch size: 441, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:21:38,493 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.43 vs. limit=15.0 2023-06-19 19:23:33,804 INFO [train.py:996] (2/4) Epoch 3, batch 5850, loss[loss=0.1947, simple_loss=0.2936, pruned_loss=0.04787, over 21693.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.323, pruned_loss=0.0821, over 4271743.09 frames. ], batch size: 247, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:24:04,461 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.59 vs. limit=15.0 2023-06-19 19:25:07,395 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.401e+02 1.976e+02 2.345e+02 2.902e+02 5.016e+02, threshold=4.690e+02, percent-clipped=0.0 2023-06-19 19:25:32,652 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=401274.0, ans=0.125 2023-06-19 19:25:40,989 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=401274.0, ans=0.125 2023-06-19 19:25:40,997 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=401274.0, ans=0.125 2023-06-19 19:25:47,461 INFO [train.py:996] (2/4) Epoch 3, batch 5900, loss[loss=0.2476, simple_loss=0.347, pruned_loss=0.07409, over 21153.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3166, pruned_loss=0.07669, over 4272128.20 frames. ], batch size: 548, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:27:13,853 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-06-19 19:27:25,760 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.13 vs. limit=12.0 2023-06-19 19:28:00,302 INFO [train.py:996] (2/4) Epoch 3, batch 5950, loss[loss=0.2482, simple_loss=0.3041, pruned_loss=0.09617, over 21812.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3182, pruned_loss=0.08281, over 4275652.18 frames. ], batch size: 112, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:28:37,838 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=10.49 vs. limit=10.0 2023-06-19 19:29:06,743 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 2.649e+02 3.303e+02 4.502e+02 8.568e+02, threshold=6.607e+02, percent-clipped=21.0 2023-06-19 19:30:02,811 INFO [train.py:996] (2/4) Epoch 3, batch 6000, loss[loss=0.2397, simple_loss=0.3175, pruned_loss=0.08096, over 20020.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3148, pruned_loss=0.0861, over 4276282.61 frames. ], batch size: 702, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:30:02,811 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-19 19:30:52,520 INFO [train.py:1028] (2/4) Epoch 3, validation: loss=0.2725, simple_loss=0.3668, pruned_loss=0.0891, over 1796401.00 frames. 2023-06-19 19:30:52,520 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-19 19:30:55,443 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-19 19:31:01,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=401934.0, ans=0.125 2023-06-19 19:31:39,459 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=402054.0, ans=0.035 2023-06-19 19:31:39,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=402054.0, ans=0.125 2023-06-19 19:31:57,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=402114.0, ans=15.0 2023-06-19 19:32:35,242 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.65 vs. limit=15.0 2023-06-19 19:32:45,299 INFO [train.py:996] (2/4) Epoch 3, batch 6050, loss[loss=0.2416, simple_loss=0.2943, pruned_loss=0.09446, over 21514.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3098, pruned_loss=0.08703, over 4270526.88 frames. ], batch size: 391, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:33:27,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=402294.0, ans=10.0 2023-06-19 19:33:29,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=402294.0, ans=0.0 2023-06-19 19:33:29,364 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=402294.0, ans=0.125 2023-06-19 19:33:32,357 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=402294.0, ans=0.125 2023-06-19 19:33:54,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=402354.0, ans=10.0 2023-06-19 19:34:07,192 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.734e+02 2.314e+02 2.772e+02 3.160e+02 4.254e+02, threshold=5.544e+02, percent-clipped=0.0 2023-06-19 19:34:26,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=402474.0, ans=0.0 2023-06-19 19:34:51,208 INFO [train.py:996] (2/4) Epoch 3, batch 6100, loss[loss=0.2735, simple_loss=0.3388, pruned_loss=0.1041, over 17163.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3069, pruned_loss=0.08496, over 4266694.48 frames. ], batch size: 60, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:35:28,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=402594.0, ans=0.0 2023-06-19 19:35:58,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=402654.0, ans=0.1 2023-06-19 19:36:11,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=402714.0, ans=0.5 2023-06-19 19:36:26,547 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=402714.0, ans=0.1 2023-06-19 19:37:00,836 INFO [train.py:996] (2/4) Epoch 3, batch 6150, loss[loss=0.2402, simple_loss=0.3114, pruned_loss=0.08446, over 21777.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3111, pruned_loss=0.0877, over 4270895.61 frames. ], batch size: 333, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:37:26,593 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.92 vs. limit=10.0 2023-06-19 19:37:28,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=402894.0, ans=0.125 2023-06-19 19:37:41,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=402894.0, ans=0.0 2023-06-19 19:38:07,913 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.84 vs. limit=15.0 2023-06-19 19:38:11,182 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.776e+02 2.614e+02 3.017e+02 3.568e+02 5.916e+02, threshold=6.034e+02, percent-clipped=1.0 2023-06-19 19:38:17,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=403014.0, ans=0.1 2023-06-19 19:38:45,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=403074.0, ans=0.2 2023-06-19 19:38:52,131 INFO [train.py:996] (2/4) Epoch 3, batch 6200, loss[loss=0.2675, simple_loss=0.3361, pruned_loss=0.09941, over 21767.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3124, pruned_loss=0.08666, over 4264521.51 frames. ], batch size: 247, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:39:01,674 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.58 vs. limit=15.0 2023-06-19 19:39:59,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=403254.0, ans=0.2 2023-06-19 19:40:15,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=403314.0, ans=0.125 2023-06-19 19:41:00,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=403374.0, ans=0.125 2023-06-19 19:41:12,520 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=403434.0, ans=0.125 2023-06-19 19:41:13,512 INFO [train.py:996] (2/4) Epoch 3, batch 6250, loss[loss=0.2261, simple_loss=0.3158, pruned_loss=0.06824, over 21374.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3174, pruned_loss=0.08638, over 4270151.23 frames. ], batch size: 211, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:41:40,841 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 19:41:42,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=403494.0, ans=0.2 2023-06-19 19:42:32,760 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.039e+02 2.931e+02 3.643e+02 4.697e+02 7.748e+02, threshold=7.286e+02, percent-clipped=9.0 2023-06-19 19:42:50,133 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.54 vs. limit=22.5 2023-06-19 19:43:33,699 INFO [train.py:996] (2/4) Epoch 3, batch 6300, loss[loss=0.2285, simple_loss=0.2774, pruned_loss=0.0898, over 20272.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3205, pruned_loss=0.0854, over 4273705.15 frames. ], batch size: 703, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:43:35,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=403734.0, ans=0.2 2023-06-19 19:43:51,592 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=403734.0, ans=0.07 2023-06-19 19:43:53,059 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=403794.0, ans=0.125 2023-06-19 19:44:20,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=403854.0, ans=0.125 2023-06-19 19:44:52,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=403914.0, ans=0.125 2023-06-19 19:45:23,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=403974.0, ans=0.2 2023-06-19 19:45:26,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=403974.0, ans=0.125 2023-06-19 19:45:37,160 INFO [train.py:996] (2/4) Epoch 3, batch 6350, loss[loss=0.2902, simple_loss=0.3933, pruned_loss=0.09358, over 21295.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3258, pruned_loss=0.09043, over 4284119.00 frames. ], batch size: 548, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:46:20,618 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=404154.0, ans=0.0 2023-06-19 19:46:50,245 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.035e+02 2.707e+02 3.149e+02 3.828e+02 5.758e+02, threshold=6.298e+02, percent-clipped=0.0 2023-06-19 19:47:44,507 INFO [train.py:996] (2/4) Epoch 3, batch 6400, loss[loss=0.3171, simple_loss=0.3748, pruned_loss=0.1297, over 21832.00 frames. ], tot_loss[loss=0.2621, simple_loss=0.3341, pruned_loss=0.09505, over 4276799.76 frames. ], batch size: 441, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:48:10,248 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=404394.0, ans=0.1 2023-06-19 19:48:16,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=404394.0, ans=0.125 2023-06-19 19:49:16,805 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.14 vs. limit=15.0 2023-06-19 19:49:24,881 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=404574.0, ans=0.1 2023-06-19 19:49:39,741 INFO [train.py:996] (2/4) Epoch 3, batch 6450, loss[loss=0.2319, simple_loss=0.3229, pruned_loss=0.0705, over 21854.00 frames. ], tot_loss[loss=0.2638, simple_loss=0.3377, pruned_loss=0.09497, over 4278682.49 frames. ], batch size: 371, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:49:51,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=404634.0, ans=0.125 2023-06-19 19:50:31,000 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 19:50:42,062 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.852e+02 2.507e+02 2.854e+02 3.745e+02 6.123e+02, threshold=5.708e+02, percent-clipped=0.0 2023-06-19 19:51:01,278 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=404874.0, ans=0.015 2023-06-19 19:51:36,641 INFO [train.py:996] (2/4) Epoch 3, batch 6500, loss[loss=0.2281, simple_loss=0.3003, pruned_loss=0.07797, over 21571.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.33, pruned_loss=0.09311, over 4275204.55 frames. ], batch size: 230, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:51:45,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=404934.0, ans=0.0 2023-06-19 19:52:46,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=405114.0, ans=0.0 2023-06-19 19:52:52,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_na.min_abs, batch_count=405114.0, ans=0.02 2023-06-19 19:52:55,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=405114.0, ans=0.125 2023-06-19 19:52:59,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=405114.0, ans=0.2 2023-06-19 19:53:27,024 INFO [train.py:996] (2/4) Epoch 3, batch 6550, loss[loss=0.3271, simple_loss=0.3704, pruned_loss=0.1419, over 21631.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3296, pruned_loss=0.09282, over 4281727.74 frames. ], batch size: 507, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:53:55,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=405294.0, ans=0.95 2023-06-19 19:54:42,035 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.989e+02 2.670e+02 3.153e+02 4.224e+02 7.538e+02, threshold=6.306e+02, percent-clipped=8.0 2023-06-19 19:54:43,305 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.93 vs. limit=22.5 2023-06-19 19:55:22,193 INFO [train.py:996] (2/4) Epoch 3, batch 6600, loss[loss=0.2356, simple_loss=0.2868, pruned_loss=0.09218, over 21528.00 frames. ], tot_loss[loss=0.2555, simple_loss=0.3243, pruned_loss=0.0933, over 4284602.57 frames. ], batch size: 414, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:55:25,537 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=405534.0, ans=0.2 2023-06-19 19:56:39,112 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.11 vs. limit=15.0 2023-06-19 19:57:21,373 INFO [train.py:996] (2/4) Epoch 3, batch 6650, loss[loss=0.2353, simple_loss=0.2949, pruned_loss=0.08785, over 21578.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3149, pruned_loss=0.09005, over 4274599.91 frames. ], batch size: 391, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:57:28,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=405834.0, ans=0.2 2023-06-19 19:58:17,413 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.714e+02 2.325e+02 2.612e+02 3.036e+02 4.161e+02, threshold=5.224e+02, percent-clipped=0.0 2023-06-19 19:58:24,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=406014.0, ans=0.2 2023-06-19 19:58:25,451 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.41 vs. limit=12.0 2023-06-19 19:58:31,872 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=406014.0, ans=0.125 2023-06-19 19:58:49,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=406074.0, ans=0.2 2023-06-19 19:58:49,948 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=12.0 2023-06-19 19:58:57,726 INFO [train.py:996] (2/4) Epoch 3, batch 6700, loss[loss=0.235, simple_loss=0.2956, pruned_loss=0.0872, over 21462.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3087, pruned_loss=0.08921, over 4279725.17 frames. ], batch size: 212, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:59:43,233 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=406254.0, ans=0.1 2023-06-19 19:59:43,315 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=406254.0, ans=0.125 2023-06-19 19:59:45,211 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.35 vs. limit=15.0 2023-06-19 20:00:49,689 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.57 vs. limit=12.0 2023-06-19 20:00:51,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=406374.0, ans=0.125 2023-06-19 20:00:56,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=406374.0, ans=0.1 2023-06-19 20:00:58,773 INFO [train.py:996] (2/4) Epoch 3, batch 6750, loss[loss=0.2366, simple_loss=0.2925, pruned_loss=0.0903, over 21270.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3076, pruned_loss=0.09018, over 4270818.84 frames. ], batch size: 176, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 20:01:48,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=406554.0, ans=0.1 2023-06-19 20:01:48,427 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=406554.0, ans=0.0 2023-06-19 20:01:58,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=406554.0, ans=0.125 2023-06-19 20:02:00,798 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.964e+02 2.669e+02 3.031e+02 3.481e+02 8.147e+02, threshold=6.062e+02, percent-clipped=3.0 2023-06-19 20:02:35,072 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.26 vs. limit=10.0 2023-06-19 20:02:41,231 INFO [train.py:996] (2/4) Epoch 3, batch 6800, loss[loss=0.3011, simple_loss=0.3378, pruned_loss=0.1322, over 21574.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3101, pruned_loss=0.09305, over 4280593.19 frames. ], batch size: 473, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 20:02:43,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=406734.0, ans=0.125 2023-06-19 20:02:49,833 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=406734.0, ans=0.125 2023-06-19 20:03:12,182 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-19 20:03:44,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=406854.0, ans=0.2 2023-06-19 20:04:39,332 INFO [train.py:996] (2/4) Epoch 3, batch 6850, loss[loss=0.2724, simple_loss=0.3191, pruned_loss=0.1129, over 21348.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3112, pruned_loss=0.09448, over 4278075.27 frames. ], batch size: 176, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 20:05:01,679 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.94 vs. limit=15.0 2023-06-19 20:05:53,823 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.970e+02 2.761e+02 3.161e+02 3.726e+02 8.116e+02, threshold=6.323e+02, percent-clipped=2.0 2023-06-19 20:06:08,107 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=407214.0, ans=0.0 2023-06-19 20:06:47,900 INFO [train.py:996] (2/4) Epoch 3, batch 6900, loss[loss=0.2261, simple_loss=0.2986, pruned_loss=0.07682, over 21407.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3132, pruned_loss=0.09416, over 4280984.19 frames. ], batch size: 211, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 20:06:55,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=407334.0, ans=0.05 2023-06-19 20:06:57,433 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.27 vs. limit=6.0 2023-06-19 20:07:47,805 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 20:07:54,076 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.92 vs. limit=10.0 2023-06-19 20:08:07,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=407514.0, ans=0.125 2023-06-19 20:08:17,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=407514.0, ans=0.125 2023-06-19 20:08:32,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=407574.0, ans=0.04949747468305833 2023-06-19 20:08:36,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=407574.0, ans=0.125 2023-06-19 20:08:53,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=407574.0, ans=0.125 2023-06-19 20:08:57,084 INFO [train.py:996] (2/4) Epoch 3, batch 6950, loss[loss=0.2721, simple_loss=0.339, pruned_loss=0.1026, over 21359.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3141, pruned_loss=0.0908, over 4277814.39 frames. ], batch size: 159, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:09:50,553 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.60 vs. limit=10.0 2023-06-19 20:09:54,367 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=407754.0, ans=0.025 2023-06-19 20:10:12,700 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.685e+02 2.540e+02 2.981e+02 3.670e+02 6.199e+02, threshold=5.963e+02, percent-clipped=0.0 2023-06-19 20:10:25,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=407814.0, ans=0.0 2023-06-19 20:10:44,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=407874.0, ans=0.1 2023-06-19 20:10:59,037 INFO [train.py:996] (2/4) Epoch 3, batch 7000, loss[loss=0.2397, simple_loss=0.2917, pruned_loss=0.09383, over 21803.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.3177, pruned_loss=0.09357, over 4272903.32 frames. ], batch size: 352, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:12:09,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=408054.0, ans=0.07 2023-06-19 20:12:32,473 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.42 vs. limit=22.5 2023-06-19 20:12:38,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=408174.0, ans=0.1 2023-06-19 20:12:47,634 INFO [train.py:996] (2/4) Epoch 3, batch 7050, loss[loss=0.2882, simple_loss=0.3699, pruned_loss=0.1032, over 19973.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3158, pruned_loss=0.09195, over 4276967.60 frames. ], batch size: 702, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:14:15,672 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.899e+02 2.619e+02 2.979e+02 3.619e+02 9.670e+02, threshold=5.957e+02, percent-clipped=3.0 2023-06-19 20:14:23,568 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=408414.0, ans=0.125 2023-06-19 20:14:59,127 INFO [train.py:996] (2/4) Epoch 3, batch 7100, loss[loss=0.3317, simple_loss=0.3828, pruned_loss=0.1402, over 21357.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.3198, pruned_loss=0.09375, over 4274333.03 frames. ], batch size: 507, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:15:19,950 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.23 vs. limit=15.0 2023-06-19 20:15:35,930 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.49 vs. limit=15.0 2023-06-19 20:15:44,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=408594.0, ans=10.0 2023-06-19 20:15:44,355 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=408594.0, ans=0.125 2023-06-19 20:15:53,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=408654.0, ans=0.125 2023-06-19 20:16:28,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=408714.0, ans=0.0 2023-06-19 20:16:48,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=408774.0, ans=0.0 2023-06-19 20:16:49,627 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=408774.0, ans=0.5 2023-06-19 20:16:59,029 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=408774.0, ans=10.0 2023-06-19 20:16:59,603 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.47 vs. limit=15.0 2023-06-19 20:17:16,297 INFO [train.py:996] (2/4) Epoch 3, batch 7150, loss[loss=0.2814, simple_loss=0.3489, pruned_loss=0.107, over 21597.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3173, pruned_loss=0.09191, over 4278749.69 frames. ], batch size: 389, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:17:24,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=408834.0, ans=0.02 2023-06-19 20:18:07,608 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.54 vs. limit=15.0 2023-06-19 20:18:32,439 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.824e+02 2.361e+02 2.956e+02 3.378e+02 5.911e+02, threshold=5.912e+02, percent-clipped=0.0 2023-06-19 20:18:32,758 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=409014.0, ans=0.015 2023-06-19 20:18:46,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=409074.0, ans=0.125 2023-06-19 20:19:23,242 INFO [train.py:996] (2/4) Epoch 3, batch 7200, loss[loss=0.2333, simple_loss=0.297, pruned_loss=0.08479, over 21695.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3202, pruned_loss=0.09448, over 4276593.29 frames. ], batch size: 282, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:20:37,185 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.15 vs. limit=22.5 2023-06-19 20:21:31,195 INFO [train.py:996] (2/4) Epoch 3, batch 7250, loss[loss=0.2718, simple_loss=0.3091, pruned_loss=0.1173, over 21311.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.3157, pruned_loss=0.09458, over 4275941.20 frames. ], batch size: 473, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:22:43,335 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.149e+02 2.618e+02 2.951e+02 3.716e+02 8.405e+02, threshold=5.903e+02, percent-clipped=1.0 2023-06-19 20:22:52,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=409614.0, ans=0.125 2023-06-19 20:23:23,303 INFO [train.py:996] (2/4) Epoch 3, batch 7300, loss[loss=0.2588, simple_loss=0.3063, pruned_loss=0.1057, over 21517.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3094, pruned_loss=0.09321, over 4276220.03 frames. ], batch size: 442, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:25:18,517 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=409974.0, ans=0.1 2023-06-19 20:25:27,393 INFO [train.py:996] (2/4) Epoch 3, batch 7350, loss[loss=0.2712, simple_loss=0.3262, pruned_loss=0.1081, over 21546.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3082, pruned_loss=0.09381, over 4272352.82 frames. ], batch size: 230, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:25:43,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=410034.0, ans=0.125 2023-06-19 20:26:46,544 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.936e+02 2.697e+02 3.166e+02 3.579e+02 5.616e+02, threshold=6.332e+02, percent-clipped=0.0 2023-06-19 20:27:20,333 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=410274.0, ans=0.125 2023-06-19 20:27:34,129 INFO [train.py:996] (2/4) Epoch 3, batch 7400, loss[loss=0.2361, simple_loss=0.3169, pruned_loss=0.07769, over 21675.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3159, pruned_loss=0.09694, over 4278154.61 frames. ], batch size: 298, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:27:34,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=410334.0, ans=0.2 2023-06-19 20:29:27,364 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=410574.0, ans=0.0 2023-06-19 20:29:32,617 INFO [train.py:996] (2/4) Epoch 3, batch 7450, loss[loss=0.2594, simple_loss=0.3152, pruned_loss=0.1018, over 21811.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.3143, pruned_loss=0.09518, over 4282307.01 frames. ], batch size: 352, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:29:33,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=410634.0, ans=0.125 2023-06-19 20:29:45,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=410634.0, ans=0.2 2023-06-19 20:29:59,929 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.21 vs. limit=10.0 2023-06-19 20:30:09,156 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 20:30:41,269 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=410754.0, ans=0.125 2023-06-19 20:30:42,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=410754.0, ans=0.0 2023-06-19 20:30:55,142 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.038e+02 2.751e+02 3.404e+02 4.260e+02 8.554e+02, threshold=6.809e+02, percent-clipped=4.0 2023-06-19 20:31:31,145 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.49 vs. limit=22.5 2023-06-19 20:31:39,275 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.45 vs. limit=15.0 2023-06-19 20:31:51,009 INFO [train.py:996] (2/4) Epoch 3, batch 7500, loss[loss=0.2782, simple_loss=0.371, pruned_loss=0.09268, over 21901.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.3193, pruned_loss=0.09665, over 4279523.12 frames. ], batch size: 317, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:32:05,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=410994.0, ans=10.0 2023-06-19 20:32:05,994 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.37 vs. limit=15.0 2023-06-19 20:33:47,523 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=411174.0, ans=0.125 2023-06-19 20:33:53,281 INFO [train.py:996] (2/4) Epoch 3, batch 7550, loss[loss=0.234, simple_loss=0.3157, pruned_loss=0.07612, over 21801.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.326, pruned_loss=0.09465, over 4281455.41 frames. ], batch size: 118, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:34:47,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=411354.0, ans=0.125 2023-06-19 20:34:54,094 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.947e+02 2.594e+02 3.226e+02 4.168e+02 6.750e+02, threshold=6.453e+02, percent-clipped=0.0 2023-06-19 20:35:39,515 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=411474.0, ans=0.2 2023-06-19 20:35:45,564 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=411474.0, ans=0.125 2023-06-19 20:35:48,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=411474.0, ans=0.0 2023-06-19 20:35:49,774 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=411474.0, ans=0.125 2023-06-19 20:35:53,977 INFO [train.py:996] (2/4) Epoch 3, batch 7600, loss[loss=0.2336, simple_loss=0.3068, pruned_loss=0.08017, over 21447.00 frames. ], tot_loss[loss=0.2566, simple_loss=0.3249, pruned_loss=0.09413, over 4280713.83 frames. ], batch size: 211, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:36:31,449 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=411654.0, ans=0.1 2023-06-19 20:37:17,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=411774.0, ans=0.125 2023-06-19 20:37:39,156 INFO [train.py:996] (2/4) Epoch 3, batch 7650, loss[loss=0.2473, simple_loss=0.3062, pruned_loss=0.09426, over 21783.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.3242, pruned_loss=0.09574, over 4286963.40 frames. ], batch size: 247, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:38:31,320 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.18 vs. limit=6.0 2023-06-19 20:38:44,904 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.91 vs. limit=10.0 2023-06-19 20:38:54,172 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.958e+02 2.672e+02 3.022e+02 3.544e+02 5.089e+02, threshold=6.045e+02, percent-clipped=0.0 2023-06-19 20:39:42,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=412074.0, ans=0.1 2023-06-19 20:39:48,294 INFO [train.py:996] (2/4) Epoch 3, batch 7700, loss[loss=0.3284, simple_loss=0.3927, pruned_loss=0.132, over 21806.00 frames. ], tot_loss[loss=0.2629, simple_loss=0.3282, pruned_loss=0.09882, over 4285176.38 frames. ], batch size: 118, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:39:52,331 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.00 vs. limit=22.5 2023-06-19 20:39:54,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=412134.0, ans=0.0 2023-06-19 20:40:16,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=412194.0, ans=0.2 2023-06-19 20:40:16,674 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=412194.0, ans=0.1 2023-06-19 20:40:53,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=412254.0, ans=0.0 2023-06-19 20:40:56,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=412254.0, ans=0.125 2023-06-19 20:41:30,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=412314.0, ans=0.1 2023-06-19 20:41:47,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=412374.0, ans=0.1 2023-06-19 20:42:14,287 INFO [train.py:996] (2/4) Epoch 3, batch 7750, loss[loss=0.3055, simple_loss=0.3965, pruned_loss=0.1073, over 21790.00 frames. ], tot_loss[loss=0.2657, simple_loss=0.3336, pruned_loss=0.09896, over 4271807.75 frames. ], batch size: 282, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:42:37,576 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.40 vs. limit=15.0 2023-06-19 20:42:42,809 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=412494.0, ans=0.125 2023-06-19 20:43:25,543 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.119e+02 3.175e+02 3.701e+02 4.532e+02 8.848e+02, threshold=7.402e+02, percent-clipped=6.0 2023-06-19 20:43:49,090 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.84 vs. limit=6.0 2023-06-19 20:44:17,431 INFO [train.py:996] (2/4) Epoch 3, batch 7800, loss[loss=0.2969, simple_loss=0.3644, pruned_loss=0.1147, over 21563.00 frames. ], tot_loss[loss=0.265, simple_loss=0.3322, pruned_loss=0.09889, over 4267043.93 frames. ], batch size: 441, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:44:32,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=412734.0, ans=0.1 2023-06-19 20:44:45,430 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=412794.0, ans=0.125 2023-06-19 20:45:26,233 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=412914.0, ans=0.125 2023-06-19 20:45:32,192 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=412914.0, ans=0.125 2023-06-19 20:45:46,158 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.62 vs. limit=22.5 2023-06-19 20:46:01,769 INFO [train.py:996] (2/4) Epoch 3, batch 7850, loss[loss=0.2695, simple_loss=0.315, pruned_loss=0.112, over 19974.00 frames. ], tot_loss[loss=0.2619, simple_loss=0.326, pruned_loss=0.09893, over 4257150.89 frames. ], batch size: 703, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:46:32,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=413094.0, ans=0.125 2023-06-19 20:46:57,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=413154.0, ans=0.1 2023-06-19 20:46:58,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=413154.0, ans=0.125 2023-06-19 20:47:05,855 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.951e+02 2.925e+02 3.666e+02 4.492e+02 8.258e+02, threshold=7.332e+02, percent-clipped=3.0 2023-06-19 20:47:10,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=413214.0, ans=0.125 2023-06-19 20:47:31,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=413274.0, ans=0.125 2023-06-19 20:48:04,377 INFO [train.py:996] (2/4) Epoch 3, batch 7900, loss[loss=0.2128, simple_loss=0.2903, pruned_loss=0.0677, over 21601.00 frames. ], tot_loss[loss=0.2585, simple_loss=0.3216, pruned_loss=0.09772, over 4258711.43 frames. ], batch size: 230, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:48:41,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=413394.0, ans=0.0 2023-06-19 20:48:41,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=413394.0, ans=0.125 2023-06-19 20:48:44,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=413394.0, ans=0.125 2023-06-19 20:48:58,484 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.86 vs. limit=22.5 2023-06-19 20:49:25,431 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=22.5 2023-06-19 20:49:27,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=413514.0, ans=0.125 2023-06-19 20:49:29,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=413514.0, ans=0.0 2023-06-19 20:49:30,715 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=413514.0, ans=0.0 2023-06-19 20:50:12,451 INFO [train.py:996] (2/4) Epoch 3, batch 7950, loss[loss=0.2541, simple_loss=0.3472, pruned_loss=0.08046, over 21665.00 frames. ], tot_loss[loss=0.2607, simple_loss=0.3273, pruned_loss=0.09701, over 4260683.17 frames. ], batch size: 389, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 20:50:16,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=413634.0, ans=0.125 2023-06-19 20:50:29,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=413634.0, ans=0.125 2023-06-19 20:50:57,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=413694.0, ans=0.125 2023-06-19 20:51:06,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=413754.0, ans=0.125 2023-06-19 20:51:18,356 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.813e+02 2.993e+02 3.629e+02 4.291e+02 8.050e+02, threshold=7.259e+02, percent-clipped=3.0 2023-06-19 20:51:34,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=413814.0, ans=0.125 2023-06-19 20:51:54,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=413874.0, ans=0.125 2023-06-19 20:52:18,829 INFO [train.py:996] (2/4) Epoch 3, batch 8000, loss[loss=0.3421, simple_loss=0.3925, pruned_loss=0.1458, over 21435.00 frames. ], tot_loss[loss=0.266, simple_loss=0.3327, pruned_loss=0.09965, over 4258174.16 frames. ], batch size: 471, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:53:02,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=414054.0, ans=15.0 2023-06-19 20:54:25,795 INFO [train.py:996] (2/4) Epoch 3, batch 8050, loss[loss=0.1924, simple_loss=0.2409, pruned_loss=0.07194, over 21869.00 frames. ], tot_loss[loss=0.2662, simple_loss=0.3352, pruned_loss=0.09862, over 4256845.67 frames. ], batch size: 107, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:54:36,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=414234.0, ans=10.0 2023-06-19 20:55:40,679 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.095e+02 2.904e+02 3.441e+02 4.463e+02 1.081e+03, threshold=6.883e+02, percent-clipped=4.0 2023-06-19 20:56:18,527 INFO [train.py:996] (2/4) Epoch 3, batch 8100, loss[loss=0.2657, simple_loss=0.3288, pruned_loss=0.1013, over 21880.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3339, pruned_loss=0.0992, over 4265679.58 frames. ], batch size: 124, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:57:23,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=414654.0, ans=0.125 2023-06-19 20:57:59,233 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=414714.0, ans=0.5 2023-06-19 20:58:28,904 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=414774.0, ans=0.0 2023-06-19 20:58:47,264 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=414774.0, ans=0.2 2023-06-19 20:58:51,124 INFO [train.py:996] (2/4) Epoch 3, batch 8150, loss[loss=0.3039, simple_loss=0.4031, pruned_loss=0.1024, over 21555.00 frames. ], tot_loss[loss=0.2734, simple_loss=0.3435, pruned_loss=0.1017, over 4266113.80 frames. ], batch size: 441, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 20:58:53,621 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.89 vs. limit=15.0 2023-06-19 20:59:00,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=414834.0, ans=0.1 2023-06-19 20:59:16,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=414894.0, ans=0.125 2023-06-19 21:00:01,454 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.034e+02 2.727e+02 3.206e+02 3.958e+02 8.712e+02, threshold=6.412e+02, percent-clipped=8.0 2023-06-19 21:00:28,971 INFO [train.py:996] (2/4) Epoch 3, batch 8200, loss[loss=0.1864, simple_loss=0.2413, pruned_loss=0.06577, over 16038.00 frames. ], tot_loss[loss=0.265, simple_loss=0.3341, pruned_loss=0.09798, over 4259107.37 frames. ], batch size: 63, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:00:44,963 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=415134.0, ans=0.1 2023-06-19 21:01:38,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=415254.0, ans=0.0 2023-06-19 21:02:26,219 INFO [train.py:996] (2/4) Epoch 3, batch 8250, loss[loss=0.2558, simple_loss=0.3299, pruned_loss=0.09084, over 21441.00 frames. ], tot_loss[loss=0.2626, simple_loss=0.3318, pruned_loss=0.09665, over 4256934.90 frames. ], batch size: 194, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:02:28,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=415434.0, ans=0.2 2023-06-19 21:02:47,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=415494.0, ans=0.2 2023-06-19 21:02:55,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=415494.0, ans=0.1 2023-06-19 21:03:12,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=415494.0, ans=0.1 2023-06-19 21:03:23,422 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.80 vs. limit=6.0 2023-06-19 21:03:47,723 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.841e+02 2.708e+02 3.103e+02 3.539e+02 5.585e+02, threshold=6.206e+02, percent-clipped=0.0 2023-06-19 21:04:29,217 INFO [train.py:996] (2/4) Epoch 3, batch 8300, loss[loss=0.2458, simple_loss=0.3314, pruned_loss=0.08013, over 21728.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3296, pruned_loss=0.09283, over 4266866.25 frames. ], batch size: 332, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:04:54,174 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=415794.0, ans=0.125 2023-06-19 21:05:26,505 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=415854.0, ans=0.0 2023-06-19 21:05:55,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=415974.0, ans=0.2 2023-06-19 21:06:12,433 INFO [train.py:996] (2/4) Epoch 3, batch 8350, loss[loss=0.24, simple_loss=0.3253, pruned_loss=0.07735, over 21401.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.3293, pruned_loss=0.09072, over 4268516.08 frames. ], batch size: 211, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:07:05,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn1.whiten.whitening_limit, batch_count=416094.0, ans=22.5 2023-06-19 21:07:23,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=416214.0, ans=0.125 2023-06-19 21:07:24,673 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.926e+02 2.460e+02 2.877e+02 3.560e+02 6.454e+02, threshold=5.755e+02, percent-clipped=1.0 2023-06-19 21:08:10,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=416274.0, ans=0.125 2023-06-19 21:08:24,292 INFO [train.py:996] (2/4) Epoch 3, batch 8400, loss[loss=0.2345, simple_loss=0.3248, pruned_loss=0.07206, over 21216.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3254, pruned_loss=0.08802, over 4272352.07 frames. ], batch size: 548, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:08:48,593 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.99 vs. limit=15.0 2023-06-19 21:08:51,447 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.29 vs. limit=6.0 2023-06-19 21:08:59,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=416394.0, ans=0.2 2023-06-19 21:09:36,904 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=416514.0, ans=0.1 2023-06-19 21:09:55,446 INFO [train.py:996] (2/4) Epoch 3, batch 8450, loss[loss=0.2519, simple_loss=0.3123, pruned_loss=0.09579, over 21805.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3228, pruned_loss=0.08794, over 4278398.41 frames. ], batch size: 298, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:10:11,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=416634.0, ans=0.125 2023-06-19 21:10:22,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=416694.0, ans=0.1 2023-06-19 21:11:05,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=416814.0, ans=0.125 2023-06-19 21:11:06,458 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.788e+02 2.528e+02 3.268e+02 4.022e+02 7.297e+02, threshold=6.535e+02, percent-clipped=4.0 2023-06-19 21:11:34,975 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=416874.0, ans=0.125 2023-06-19 21:11:53,073 INFO [train.py:996] (2/4) Epoch 3, batch 8500, loss[loss=0.3169, simple_loss=0.4296, pruned_loss=0.1021, over 20845.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.3204, pruned_loss=0.09023, over 4279932.13 frames. ], batch size: 607, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:11:59,470 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=416934.0, ans=0.125 2023-06-19 21:12:15,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=416994.0, ans=0.035 2023-06-19 21:12:16,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=416994.0, ans=0.0 2023-06-19 21:12:41,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=417054.0, ans=0.125 2023-06-19 21:13:27,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=417174.0, ans=0.125 2023-06-19 21:13:58,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=417234.0, ans=0.125 2023-06-19 21:13:58,920 INFO [train.py:996] (2/4) Epoch 3, batch 8550, loss[loss=0.3038, simple_loss=0.3812, pruned_loss=0.1132, over 21243.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3265, pruned_loss=0.09439, over 4274551.11 frames. ], batch size: 548, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:14:03,674 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=417234.0, ans=0.0 2023-06-19 21:14:39,403 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=417294.0, ans=0.0 2023-06-19 21:15:01,318 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=417354.0, ans=0.025 2023-06-19 21:15:14,153 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.077e+02 2.754e+02 3.232e+02 3.747e+02 6.984e+02, threshold=6.464e+02, percent-clipped=1.0 2023-06-19 21:15:24,984 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=417414.0, ans=0.0 2023-06-19 21:16:03,103 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=417474.0, ans=0.1 2023-06-19 21:16:04,531 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=417474.0, ans=0.125 2023-06-19 21:16:08,813 INFO [train.py:996] (2/4) Epoch 3, batch 8600, loss[loss=0.2322, simple_loss=0.3123, pruned_loss=0.0761, over 21396.00 frames. ], tot_loss[loss=0.2631, simple_loss=0.3338, pruned_loss=0.09624, over 4277877.93 frames. ], batch size: 211, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:16:14,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=417534.0, ans=0.125 2023-06-19 21:16:41,645 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.88 vs. limit=15.0 2023-06-19 21:16:42,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=417594.0, ans=0.125 2023-06-19 21:16:59,237 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 21:17:02,596 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.77 vs. limit=6.0 2023-06-19 21:17:31,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=417714.0, ans=0.125 2023-06-19 21:18:02,544 INFO [train.py:996] (2/4) Epoch 3, batch 8650, loss[loss=0.1652, simple_loss=0.2504, pruned_loss=0.03994, over 21289.00 frames. ], tot_loss[loss=0.2668, simple_loss=0.338, pruned_loss=0.09776, over 4271267.93 frames. ], batch size: 194, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:19:20,786 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.564e+02 2.624e+02 3.051e+02 3.895e+02 8.480e+02, threshold=6.103e+02, percent-clipped=4.0 2023-06-19 21:19:30,083 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=418014.0, ans=0.125 2023-06-19 21:19:52,982 INFO [train.py:996] (2/4) Epoch 3, batch 8700, loss[loss=0.2546, simple_loss=0.3123, pruned_loss=0.09842, over 21443.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.3278, pruned_loss=0.09385, over 4261351.72 frames. ], batch size: 389, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:20:10,462 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=418134.0, ans=0.1 2023-06-19 21:20:22,434 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=418194.0, ans=0.0 2023-06-19 21:21:13,355 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=418314.0, ans=0.125 2023-06-19 21:22:05,619 INFO [train.py:996] (2/4) Epoch 3, batch 8750, loss[loss=0.2572, simple_loss=0.3208, pruned_loss=0.09678, over 21896.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.3248, pruned_loss=0.09456, over 4264047.02 frames. ], batch size: 333, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:22:29,895 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=418494.0, ans=0.125 2023-06-19 21:22:35,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=418554.0, ans=0.1 2023-06-19 21:22:59,180 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.830e+02 2.653e+02 3.166e+02 3.937e+02 6.791e+02, threshold=6.332e+02, percent-clipped=3.0 2023-06-19 21:23:14,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=418614.0, ans=0.0 2023-06-19 21:23:17,240 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=418674.0, ans=0.125 2023-06-19 21:23:40,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=418674.0, ans=0.125 2023-06-19 21:23:43,287 INFO [train.py:996] (2/4) Epoch 3, batch 8800, loss[loss=0.2814, simple_loss=0.3592, pruned_loss=0.1018, over 21524.00 frames. ], tot_loss[loss=0.2645, simple_loss=0.3333, pruned_loss=0.0978, over 4272240.81 frames. ], batch size: 194, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:23:49,571 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=418734.0, ans=10.0 2023-06-19 21:24:28,241 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=418854.0, ans=0.0 2023-06-19 21:24:37,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=418854.0, ans=0.125 2023-06-19 21:25:04,286 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=418914.0, ans=0.125 2023-06-19 21:25:11,018 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.48 vs. limit=15.0 2023-06-19 21:25:29,648 INFO [train.py:996] (2/4) Epoch 3, batch 8850, loss[loss=0.2417, simple_loss=0.3229, pruned_loss=0.08026, over 21574.00 frames. ], tot_loss[loss=0.2706, simple_loss=0.3408, pruned_loss=0.1002, over 4274269.82 frames. ], batch size: 263, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:26:43,019 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.235e+02 2.885e+02 3.451e+02 3.948e+02 6.880e+02, threshold=6.902e+02, percent-clipped=2.0 2023-06-19 21:26:57,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=419214.0, ans=0.0 2023-06-19 21:26:59,429 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=419274.0, ans=0.125 2023-06-19 21:27:12,928 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.19 vs. limit=15.0 2023-06-19 21:27:20,928 INFO [train.py:996] (2/4) Epoch 3, batch 8900, loss[loss=0.2353, simple_loss=0.295, pruned_loss=0.08778, over 21406.00 frames. ], tot_loss[loss=0.2666, simple_loss=0.3348, pruned_loss=0.09917, over 4279058.12 frames. ], batch size: 131, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:27:30,396 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=419334.0, ans=0.2 2023-06-19 21:27:57,083 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=419454.0, ans=0.125 2023-06-19 21:28:38,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=419514.0, ans=0.2 2023-06-19 21:28:47,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=419514.0, ans=0.2 2023-06-19 21:28:53,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=419574.0, ans=0.2 2023-06-19 21:28:58,667 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 21:29:05,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=419574.0, ans=0.125 2023-06-19 21:29:23,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=419574.0, ans=0.04949747468305833 2023-06-19 21:29:26,017 INFO [train.py:996] (2/4) Epoch 3, batch 8950, loss[loss=0.2287, simple_loss=0.2966, pruned_loss=0.08042, over 21459.00 frames. ], tot_loss[loss=0.2616, simple_loss=0.3305, pruned_loss=0.09634, over 4267699.59 frames. ], batch size: 194, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:29:26,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=419634.0, ans=0.125 2023-06-19 21:29:27,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=419634.0, ans=0.0 2023-06-19 21:29:50,396 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=419694.0, ans=10.0 2023-06-19 21:30:46,386 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.971e+02 2.693e+02 3.248e+02 3.972e+02 8.193e+02, threshold=6.496e+02, percent-clipped=3.0 2023-06-19 21:30:57,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=419874.0, ans=0.04949747468305833 2023-06-19 21:30:59,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=419874.0, ans=0.0 2023-06-19 21:31:13,411 INFO [train.py:996] (2/4) Epoch 3, batch 9000, loss[loss=0.2594, simple_loss=0.3085, pruned_loss=0.1052, over 21757.00 frames. ], tot_loss[loss=0.259, simple_loss=0.3251, pruned_loss=0.09639, over 4266248.10 frames. ], batch size: 351, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:31:13,411 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-19 21:31:59,490 INFO [train.py:1028] (2/4) Epoch 3, validation: loss=0.2739, simple_loss=0.372, pruned_loss=0.08794, over 1796401.00 frames. 2023-06-19 21:31:59,492 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-19 21:32:37,153 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 21:33:24,941 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.whiten.whitening_limit, batch_count=420174.0, ans=15.0 2023-06-19 21:33:49,577 INFO [train.py:996] (2/4) Epoch 3, batch 9050, loss[loss=0.3577, simple_loss=0.3955, pruned_loss=0.16, over 21334.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3222, pruned_loss=0.09361, over 4267355.56 frames. ], batch size: 507, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:34:16,190 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.73 vs. limit=15.0 2023-06-19 21:34:25,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=420234.0, ans=0.0 2023-06-19 21:34:43,171 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=420294.0, ans=0.1 2023-06-19 21:34:52,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=420354.0, ans=0.125 2023-06-19 21:35:01,885 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.29 vs. limit=22.5 2023-06-19 21:35:09,277 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.914e+02 2.569e+02 2.874e+02 3.466e+02 6.244e+02, threshold=5.748e+02, percent-clipped=0.0 2023-06-19 21:35:52,828 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.41 vs. limit=6.0 2023-06-19 21:35:55,381 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.63 vs. limit=15.0 2023-06-19 21:36:03,323 INFO [train.py:996] (2/4) Epoch 3, batch 9100, loss[loss=0.2861, simple_loss=0.3528, pruned_loss=0.1097, over 21205.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3291, pruned_loss=0.09656, over 4264211.21 frames. ], batch size: 143, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:36:10,668 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=420534.0, ans=0.0 2023-06-19 21:37:09,942 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.50 vs. limit=10.0 2023-06-19 21:37:23,711 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.80 vs. limit=22.5 2023-06-19 21:37:53,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=420774.0, ans=0.0 2023-06-19 21:38:16,198 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=420834.0, ans=0.125 2023-06-19 21:38:17,235 INFO [train.py:996] (2/4) Epoch 3, batch 9150, loss[loss=0.2507, simple_loss=0.3249, pruned_loss=0.08827, over 21442.00 frames. ], tot_loss[loss=0.2605, simple_loss=0.3327, pruned_loss=0.09419, over 4264462.58 frames. ], batch size: 131, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:39:09,982 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.07 vs. limit=15.0 2023-06-19 21:39:46,701 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.842e+02 2.484e+02 2.819e+02 3.300e+02 4.761e+02, threshold=5.639e+02, percent-clipped=0.0 2023-06-19 21:40:30,504 INFO [train.py:996] (2/4) Epoch 3, batch 9200, loss[loss=0.2913, simple_loss=0.362, pruned_loss=0.1102, over 21873.00 frames. ], tot_loss[loss=0.2622, simple_loss=0.3363, pruned_loss=0.09406, over 4264876.25 frames. ], batch size: 371, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:40:32,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=421134.0, ans=0.0 2023-06-19 21:41:33,587 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.59 vs. limit=15.0 2023-06-19 21:41:40,871 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=421254.0, ans=0.0 2023-06-19 21:42:21,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=421374.0, ans=0.0 2023-06-19 21:42:36,623 INFO [train.py:996] (2/4) Epoch 3, batch 9250, loss[loss=0.2788, simple_loss=0.355, pruned_loss=0.1013, over 21428.00 frames. ], tot_loss[loss=0.2682, simple_loss=0.34, pruned_loss=0.09815, over 4266364.36 frames. ], batch size: 131, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:42:38,998 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 21:42:45,023 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.60 vs. limit=15.0 2023-06-19 21:43:03,217 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=421494.0, ans=0.95 2023-06-19 21:43:09,681 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.21 vs. limit=5.0 2023-06-19 21:43:30,541 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.077e+02 2.625e+02 3.038e+02 3.484e+02 5.422e+02, threshold=6.077e+02, percent-clipped=0.0 2023-06-19 21:43:30,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=421614.0, ans=0.125 2023-06-19 21:44:05,312 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=421674.0, ans=0.125 2023-06-19 21:44:17,070 INFO [train.py:996] (2/4) Epoch 3, batch 9300, loss[loss=0.2972, simple_loss=0.374, pruned_loss=0.1103, over 21575.00 frames. ], tot_loss[loss=0.265, simple_loss=0.3346, pruned_loss=0.09767, over 4266336.59 frames. ], batch size: 414, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:45:55,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=421914.0, ans=0.0 2023-06-19 21:46:29,840 INFO [train.py:996] (2/4) Epoch 3, batch 9350, loss[loss=0.2801, simple_loss=0.35, pruned_loss=0.1051, over 21639.00 frames. ], tot_loss[loss=0.2675, simple_loss=0.3397, pruned_loss=0.09768, over 4267244.77 frames. ], batch size: 263, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 21:46:37,522 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=422034.0, ans=0.035 2023-06-19 21:46:59,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=422034.0, ans=0.125 2023-06-19 21:47:10,739 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 21:47:47,275 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.47 vs. limit=6.0 2023-06-19 21:47:50,598 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.058e+02 2.663e+02 3.172e+02 3.860e+02 6.008e+02, threshold=6.345e+02, percent-clipped=0.0 2023-06-19 21:48:00,033 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.45 vs. limit=12.0 2023-06-19 21:48:05,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=422214.0, ans=0.125 2023-06-19 21:48:22,864 INFO [train.py:996] (2/4) Epoch 3, batch 9400, loss[loss=0.2276, simple_loss=0.2888, pruned_loss=0.08322, over 21686.00 frames. ], tot_loss[loss=0.2686, simple_loss=0.3403, pruned_loss=0.09841, over 4262333.56 frames. ], batch size: 282, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 21:48:39,154 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 21:49:18,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=422454.0, ans=0.125 2023-06-19 21:49:22,387 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=422514.0, ans=0.0 2023-06-19 21:50:11,533 INFO [train.py:996] (2/4) Epoch 3, batch 9450, loss[loss=0.2394, simple_loss=0.2984, pruned_loss=0.0902, over 21991.00 frames. ], tot_loss[loss=0.2619, simple_loss=0.331, pruned_loss=0.09644, over 4264308.00 frames. ], batch size: 103, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 21:50:54,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=422694.0, ans=0.1 2023-06-19 21:51:28,766 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.824e+02 3.255e+02 3.921e+02 7.411e+02, threshold=6.510e+02, percent-clipped=5.0 2023-06-19 21:51:29,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=422814.0, ans=0.2 2023-06-19 21:51:32,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=422814.0, ans=0.0 2023-06-19 21:51:36,592 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 21:52:04,913 INFO [train.py:996] (2/4) Epoch 3, batch 9500, loss[loss=0.1995, simple_loss=0.265, pruned_loss=0.06702, over 21332.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.3228, pruned_loss=0.09399, over 4272099.18 frames. ], batch size: 211, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 21:52:06,889 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=422934.0, ans=0.2 2023-06-19 21:52:08,434 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=422934.0, ans=0.125 2023-06-19 21:52:24,297 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 21:52:31,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=422994.0, ans=0.1 2023-06-19 21:53:25,675 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.73 vs. limit=15.0 2023-06-19 21:53:51,866 INFO [train.py:996] (2/4) Epoch 3, batch 9550, loss[loss=0.2927, simple_loss=0.3655, pruned_loss=0.1099, over 21471.00 frames. ], tot_loss[loss=0.2606, simple_loss=0.3286, pruned_loss=0.09628, over 4263885.04 frames. ], batch size: 159, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 21:53:56,730 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.96 vs. limit=10.0 2023-06-19 21:55:02,055 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=423354.0, ans=0.02 2023-06-19 21:55:07,881 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=423414.0, ans=0.125 2023-06-19 21:55:14,490 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.773e+02 2.688e+02 3.199e+02 3.613e+02 5.972e+02, threshold=6.398e+02, percent-clipped=0.0 2023-06-19 21:55:31,424 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=423474.0, ans=0.0 2023-06-19 21:55:35,286 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.45 vs. limit=15.0 2023-06-19 21:55:40,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=423534.0, ans=0.0 2023-06-19 21:55:41,567 INFO [train.py:996] (2/4) Epoch 3, batch 9600, loss[loss=0.2915, simple_loss=0.3388, pruned_loss=0.1221, over 21846.00 frames. ], tot_loss[loss=0.2638, simple_loss=0.3317, pruned_loss=0.09796, over 4270869.50 frames. ], batch size: 441, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 21:56:11,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=423594.0, ans=0.125 2023-06-19 21:57:18,933 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.11 vs. limit=10.0 2023-06-19 21:57:31,851 INFO [train.py:996] (2/4) Epoch 3, batch 9650, loss[loss=0.3, simple_loss=0.3665, pruned_loss=0.1168, over 21227.00 frames. ], tot_loss[loss=0.2651, simple_loss=0.3336, pruned_loss=0.0983, over 4273067.76 frames. ], batch size: 143, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 21:58:32,153 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.55 vs. limit=15.0 2023-06-19 21:59:07,585 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.122e+02 2.684e+02 3.158e+02 3.769e+02 7.574e+02, threshold=6.315e+02, percent-clipped=3.0 2023-06-19 21:59:12,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=424014.0, ans=0.125 2023-06-19 21:59:15,923 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-06-19 21:59:52,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=424134.0, ans=0.025 2023-06-19 21:59:53,772 INFO [train.py:996] (2/4) Epoch 3, batch 9700, loss[loss=0.3575, simple_loss=0.399, pruned_loss=0.158, over 21384.00 frames. ], tot_loss[loss=0.2679, simple_loss=0.3369, pruned_loss=0.09946, over 4267367.57 frames. ], batch size: 507, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:00:04,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=424134.0, ans=0.125 2023-06-19 22:00:15,884 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=424194.0, ans=0.0 2023-06-19 22:01:26,880 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.90 vs. limit=15.0 2023-06-19 22:01:32,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=424374.0, ans=0.2 2023-06-19 22:01:38,962 INFO [train.py:996] (2/4) Epoch 3, batch 9750, loss[loss=0.2484, simple_loss=0.3059, pruned_loss=0.09542, over 21839.00 frames. ], tot_loss[loss=0.2634, simple_loss=0.3311, pruned_loss=0.0979, over 4262888.44 frames. ], batch size: 98, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:01:51,170 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=424434.0, ans=0.0 2023-06-19 22:02:01,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=424494.0, ans=0.125 2023-06-19 22:02:44,339 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.903e+02 2.697e+02 3.201e+02 3.859e+02 6.121e+02, threshold=6.401e+02, percent-clipped=0.0 2023-06-19 22:03:13,840 INFO [train.py:996] (2/4) Epoch 3, batch 9800, loss[loss=0.3001, simple_loss=0.3447, pruned_loss=0.1277, over 21800.00 frames. ], tot_loss[loss=0.2625, simple_loss=0.3298, pruned_loss=0.09764, over 4247232.18 frames. ], batch size: 510, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:03:43,184 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=424794.0, ans=0.0 2023-06-19 22:04:53,545 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.97 vs. limit=15.0 2023-06-19 22:04:59,792 INFO [train.py:996] (2/4) Epoch 3, batch 9850, loss[loss=0.2174, simple_loss=0.2812, pruned_loss=0.07678, over 21813.00 frames. ], tot_loss[loss=0.2604, simple_loss=0.3264, pruned_loss=0.09722, over 4254662.02 frames. ], batch size: 98, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:05:44,977 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.97 vs. limit=15.0 2023-06-19 22:06:02,083 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=425154.0, ans=0.125 2023-06-19 22:06:04,362 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.13 vs. limit=15.0 2023-06-19 22:06:23,733 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.796e+02 2.545e+02 2.826e+02 3.275e+02 4.693e+02, threshold=5.651e+02, percent-clipped=0.0 2023-06-19 22:06:27,441 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=425214.0, ans=0.0 2023-06-19 22:06:39,697 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=425274.0, ans=0.125 2023-06-19 22:06:39,809 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=425274.0, ans=0.0 2023-06-19 22:06:41,809 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.04 vs. limit=15.0 2023-06-19 22:07:14,425 INFO [train.py:996] (2/4) Epoch 3, batch 9900, loss[loss=0.2336, simple_loss=0.3003, pruned_loss=0.08345, over 21671.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.3203, pruned_loss=0.09615, over 4261853.55 frames. ], batch size: 298, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:07:26,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=425334.0, ans=0.125 2023-06-19 22:07:33,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=425394.0, ans=0.125 2023-06-19 22:08:01,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=425454.0, ans=0.125 2023-06-19 22:08:32,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=425574.0, ans=0.2 2023-06-19 22:08:51,744 INFO [train.py:996] (2/4) Epoch 3, batch 9950, loss[loss=0.3358, simple_loss=0.3842, pruned_loss=0.1437, over 21407.00 frames. ], tot_loss[loss=0.2598, simple_loss=0.3228, pruned_loss=0.09844, over 4260792.49 frames. ], batch size: 471, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:08:52,073 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=425634.0, ans=0.015 2023-06-19 22:09:08,834 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.21 vs. limit=15.0 2023-06-19 22:09:11,628 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.35 vs. limit=12.0 2023-06-19 22:09:14,688 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.57 vs. limit=15.0 2023-06-19 22:09:40,920 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.10 vs. limit=12.0 2023-06-19 22:10:03,746 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=425814.0, ans=0.125 2023-06-19 22:10:11,503 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.878e+02 2.786e+02 3.203e+02 4.575e+02 9.878e+02, threshold=6.406e+02, percent-clipped=15.0 2023-06-19 22:10:54,982 INFO [train.py:996] (2/4) Epoch 3, batch 10000, loss[loss=0.265, simple_loss=0.3305, pruned_loss=0.09978, over 21924.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3173, pruned_loss=0.09636, over 4272687.78 frames. ], batch size: 372, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:11:12,947 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=425934.0, ans=0.1 2023-06-19 22:11:32,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=425994.0, ans=0.05 2023-06-19 22:13:06,295 INFO [train.py:996] (2/4) Epoch 3, batch 10050, loss[loss=0.2518, simple_loss=0.3259, pruned_loss=0.0888, over 21596.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.3219, pruned_loss=0.09762, over 4275038.40 frames. ], batch size: 441, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:13:06,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=426234.0, ans=0.1 2023-06-19 22:13:26,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=426294.0, ans=0.1 2023-06-19 22:14:22,132 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.827e+02 2.478e+02 2.837e+02 3.285e+02 4.855e+02, threshold=5.673e+02, percent-clipped=0.0 2023-06-19 22:14:40,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=426474.0, ans=0.125 2023-06-19 22:15:05,151 INFO [train.py:996] (2/4) Epoch 3, batch 10100, loss[loss=0.2665, simple_loss=0.3244, pruned_loss=0.1043, over 20048.00 frames. ], tot_loss[loss=0.2555, simple_loss=0.3191, pruned_loss=0.09598, over 4278061.84 frames. ], batch size: 702, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:15:41,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=426594.0, ans=0.125 2023-06-19 22:16:37,442 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.81 vs. limit=22.5 2023-06-19 22:16:49,115 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=426714.0, ans=0.2 2023-06-19 22:17:13,929 INFO [train.py:996] (2/4) Epoch 3, batch 10150, loss[loss=0.2915, simple_loss=0.3543, pruned_loss=0.1143, over 21636.00 frames. ], tot_loss[loss=0.2645, simple_loss=0.3285, pruned_loss=0.1002, over 4280286.59 frames. ], batch size: 441, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:17:58,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=426954.0, ans=0.125 2023-06-19 22:18:09,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=427014.0, ans=0.0 2023-06-19 22:18:11,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=427014.0, ans=0.125 2023-06-19 22:18:11,304 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 22:18:14,142 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=427014.0, ans=0.125 2023-06-19 22:18:15,037 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.995e+02 2.616e+02 3.111e+02 3.928e+02 6.144e+02, threshold=6.222e+02, percent-clipped=2.0 2023-06-19 22:18:51,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=427074.0, ans=0.0 2023-06-19 22:18:54,598 INFO [train.py:996] (2/4) Epoch 3, batch 10200, loss[loss=0.2326, simple_loss=0.3166, pruned_loss=0.07425, over 21700.00 frames. ], tot_loss[loss=0.2588, simple_loss=0.3251, pruned_loss=0.09628, over 4277373.58 frames. ], batch size: 247, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:19:34,395 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.66 vs. limit=22.5 2023-06-19 22:20:31,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=427374.0, ans=0.125 2023-06-19 22:20:43,636 INFO [train.py:996] (2/4) Epoch 3, batch 10250, loss[loss=0.217, simple_loss=0.3037, pruned_loss=0.06518, over 21892.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3198, pruned_loss=0.08939, over 4281522.09 frames. ], batch size: 317, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:21:57,823 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.697e+02 2.458e+02 2.783e+02 3.278e+02 6.025e+02, threshold=5.566e+02, percent-clipped=0.0 2023-06-19 22:22:16,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=427674.0, ans=0.0 2023-06-19 22:22:30,854 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=427674.0, ans=0.125 2023-06-19 22:22:34,958 INFO [train.py:996] (2/4) Epoch 3, batch 10300, loss[loss=0.2951, simple_loss=0.3617, pruned_loss=0.1142, over 21329.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3227, pruned_loss=0.09044, over 4285595.55 frames. ], batch size: 549, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:22:41,668 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=427734.0, ans=0.2 2023-06-19 22:22:52,751 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=10.70 vs. limit=15.0 2023-06-19 22:22:53,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=427794.0, ans=0.2 2023-06-19 22:23:13,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=427794.0, ans=0.125 2023-06-19 22:23:22,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=427854.0, ans=0.1 2023-06-19 22:24:16,103 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=427914.0, ans=0.1 2023-06-19 22:24:35,486 INFO [train.py:996] (2/4) Epoch 3, batch 10350, loss[loss=0.2724, simple_loss=0.3492, pruned_loss=0.09778, over 21639.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3244, pruned_loss=0.09112, over 4284061.23 frames. ], batch size: 414, lr: 1.13e-02, grad_scale: 16.0 2023-06-19 22:25:08,193 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.21 vs. limit=15.0 2023-06-19 22:25:26,451 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=428154.0, ans=0.1 2023-06-19 22:25:34,029 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=428154.0, ans=0.1 2023-06-19 22:26:05,028 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.214e+02 2.731e+02 3.156e+02 3.865e+02 6.319e+02, threshold=6.313e+02, percent-clipped=4.0 2023-06-19 22:26:11,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=428214.0, ans=0.05 2023-06-19 22:26:20,356 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=428274.0, ans=0.025 2023-06-19 22:26:38,494 INFO [train.py:996] (2/4) Epoch 3, batch 10400, loss[loss=0.1757, simple_loss=0.2253, pruned_loss=0.06306, over 21118.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3173, pruned_loss=0.08928, over 4278765.52 frames. ], batch size: 143, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:27:08,054 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=22.5 2023-06-19 22:27:09,722 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.54 vs. limit=15.0 2023-06-19 22:27:46,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=428454.0, ans=0.0 2023-06-19 22:28:15,101 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.05 vs. limit=22.5 2023-06-19 22:28:42,385 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.11 vs. limit=15.0 2023-06-19 22:28:45,451 INFO [train.py:996] (2/4) Epoch 3, batch 10450, loss[loss=0.2994, simple_loss=0.3713, pruned_loss=0.1138, over 21837.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.3236, pruned_loss=0.09378, over 4275450.56 frames. ], batch size: 371, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:29:16,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=428634.0, ans=0.07 2023-06-19 22:29:32,044 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.40 vs. limit=15.0 2023-06-19 22:30:01,101 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.01 vs. limit=15.0 2023-06-19 22:30:16,373 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.703e+02 2.823e+02 3.547e+02 4.478e+02 9.217e+02, threshold=7.094e+02, percent-clipped=7.0 2023-06-19 22:30:21,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=428814.0, ans=0.0 2023-06-19 22:30:47,236 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.52 vs. limit=15.0 2023-06-19 22:30:55,063 INFO [train.py:996] (2/4) Epoch 3, batch 10500, loss[loss=0.2243, simple_loss=0.3056, pruned_loss=0.07145, over 20040.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.3224, pruned_loss=0.09226, over 4265892.01 frames. ], batch size: 704, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:31:16,548 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.31 vs. limit=15.0 2023-06-19 22:31:36,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=428994.0, ans=0.0 2023-06-19 22:32:26,329 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.53 vs. limit=6.0 2023-06-19 22:32:27,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=429174.0, ans=0.2 2023-06-19 22:32:45,447 INFO [train.py:996] (2/4) Epoch 3, batch 10550, loss[loss=0.2557, simple_loss=0.322, pruned_loss=0.09472, over 21839.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3163, pruned_loss=0.09222, over 4262477.29 frames. ], batch size: 98, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:32:54,847 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=429234.0, ans=10.0 2023-06-19 22:33:15,155 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.14 vs. limit=15.0 2023-06-19 22:34:01,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=429414.0, ans=0.125 2023-06-19 22:34:03,943 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.938e+02 2.386e+02 2.842e+02 3.360e+02 5.942e+02, threshold=5.684e+02, percent-clipped=0.0 2023-06-19 22:34:08,797 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=429414.0, ans=0.0 2023-06-19 22:34:49,234 INFO [train.py:996] (2/4) Epoch 3, batch 10600, loss[loss=0.2135, simple_loss=0.2979, pruned_loss=0.0646, over 21678.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3123, pruned_loss=0.08993, over 4263893.93 frames. ], batch size: 298, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:35:01,348 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=429534.0, ans=0.2 2023-06-19 22:35:39,041 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=429654.0, ans=0.125 2023-06-19 22:35:40,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=429654.0, ans=0.0 2023-06-19 22:36:52,424 INFO [train.py:996] (2/4) Epoch 3, batch 10650, loss[loss=0.178, simple_loss=0.2329, pruned_loss=0.06153, over 21237.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3137, pruned_loss=0.08769, over 4262351.95 frames. ], batch size: 143, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 22:37:16,546 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.02 vs. limit=15.0 2023-06-19 22:38:19,922 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.896e+02 2.760e+02 3.791e+02 4.826e+02 9.694e+02, threshold=7.582e+02, percent-clipped=13.0 2023-06-19 22:38:20,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=430014.0, ans=0.1 2023-06-19 22:38:56,299 INFO [train.py:996] (2/4) Epoch 3, batch 10700, loss[loss=0.2441, simple_loss=0.3138, pruned_loss=0.08723, over 21469.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.312, pruned_loss=0.08733, over 4261535.56 frames. ], batch size: 194, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 22:40:06,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=430254.0, ans=0.2 2023-06-19 22:41:02,112 INFO [train.py:996] (2/4) Epoch 3, batch 10750, loss[loss=0.3076, simple_loss=0.4019, pruned_loss=0.1067, over 19854.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.325, pruned_loss=0.09315, over 4266846.11 frames. ], batch size: 702, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 22:41:11,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=430434.0, ans=0.2 2023-06-19 22:41:25,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=430494.0, ans=0.125 2023-06-19 22:41:46,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=430494.0, ans=0.125 2023-06-19 22:41:53,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=430554.0, ans=0.2 2023-06-19 22:42:22,447 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.932e+02 2.840e+02 3.445e+02 4.011e+02 6.659e+02, threshold=6.891e+02, percent-clipped=0.0 2023-06-19 22:42:22,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=430614.0, ans=0.125 2023-06-19 22:43:03,919 INFO [train.py:996] (2/4) Epoch 3, batch 10800, loss[loss=0.2895, simple_loss=0.3584, pruned_loss=0.1102, over 19901.00 frames. ], tot_loss[loss=0.2591, simple_loss=0.3299, pruned_loss=0.09415, over 4265625.41 frames. ], batch size: 702, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 22:44:32,179 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.04 vs. limit=6.0 2023-06-19 22:44:33,188 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.31 vs. limit=22.5 2023-06-19 22:45:00,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=431034.0, ans=0.2 2023-06-19 22:45:01,152 INFO [train.py:996] (2/4) Epoch 3, batch 10850, loss[loss=0.2379, simple_loss=0.3022, pruned_loss=0.08674, over 21766.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.3304, pruned_loss=0.09449, over 4268806.96 frames. ], batch size: 102, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 22:46:06,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=431214.0, ans=0.1 2023-06-19 22:46:22,968 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.020e+02 2.541e+02 3.011e+02 3.427e+02 5.318e+02, threshold=6.021e+02, percent-clipped=0.0 2023-06-19 22:46:26,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=431214.0, ans=0.125 2023-06-19 22:46:46,205 INFO [train.py:996] (2/4) Epoch 3, batch 10900, loss[loss=0.2865, simple_loss=0.3717, pruned_loss=0.1006, over 21406.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3233, pruned_loss=0.09235, over 4276695.40 frames. ], batch size: 471, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 22:46:50,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=431334.0, ans=0.125 2023-06-19 22:47:43,293 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=431454.0, ans=0.125 2023-06-19 22:48:26,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=431514.0, ans=0.0 2023-06-19 22:48:32,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=431574.0, ans=0.125 2023-06-19 22:48:39,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=431574.0, ans=0.05 2023-06-19 22:48:40,602 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 22:48:42,865 INFO [train.py:996] (2/4) Epoch 3, batch 10950, loss[loss=0.2535, simple_loss=0.3011, pruned_loss=0.103, over 21450.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3164, pruned_loss=0.09053, over 4269698.49 frames. ], batch size: 441, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 22:48:49,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=431634.0, ans=0.0 2023-06-19 22:49:09,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=431694.0, ans=0.0 2023-06-19 22:49:42,494 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=431754.0, ans=0.125 2023-06-19 22:50:09,158 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.876e+02 2.597e+02 3.255e+02 3.821e+02 6.519e+02, threshold=6.510e+02, percent-clipped=2.0 2023-06-19 22:50:29,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=431874.0, ans=0.1 2023-06-19 22:50:45,268 INFO [train.py:996] (2/4) Epoch 3, batch 11000, loss[loss=0.2847, simple_loss=0.3346, pruned_loss=0.1174, over 21751.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3163, pruned_loss=0.09116, over 4277983.87 frames. ], batch size: 441, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 22:51:47,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=432054.0, ans=0.0 2023-06-19 22:52:26,897 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 22:52:30,487 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2023-06-19 22:52:33,731 INFO [train.py:996] (2/4) Epoch 3, batch 11050, loss[loss=0.2254, simple_loss=0.2838, pruned_loss=0.08348, over 21594.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3146, pruned_loss=0.09232, over 4283515.25 frames. ], batch size: 298, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 22:52:59,202 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.52 vs. limit=10.0 2023-06-19 22:53:23,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=432354.0, ans=0.1 2023-06-19 22:53:46,625 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.996e+02 2.920e+02 3.396e+02 4.201e+02 8.716e+02, threshold=6.791e+02, percent-clipped=3.0 2023-06-19 22:54:10,306 INFO [train.py:996] (2/4) Epoch 3, batch 11100, loss[loss=0.3157, simple_loss=0.3796, pruned_loss=0.1259, over 21407.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3136, pruned_loss=0.09317, over 4286253.04 frames. ], batch size: 471, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 22:54:21,817 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.20 vs. limit=5.0 2023-06-19 22:56:14,884 INFO [train.py:996] (2/4) Epoch 3, batch 11150, loss[loss=0.2246, simple_loss=0.3032, pruned_loss=0.07295, over 21512.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3112, pruned_loss=0.09247, over 4271238.96 frames. ], batch size: 230, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 22:56:24,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=432834.0, ans=0.2 2023-06-19 22:57:02,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=432954.0, ans=0.125 2023-06-19 22:57:12,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=432954.0, ans=0.0 2023-06-19 22:57:37,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=433014.0, ans=0.125 2023-06-19 22:57:41,396 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.038e+02 2.651e+02 2.885e+02 3.515e+02 7.934e+02, threshold=5.769e+02, percent-clipped=2.0 2023-06-19 22:57:51,384 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.18 vs. limit=15.0 2023-06-19 22:57:53,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=433074.0, ans=0.1 2023-06-19 22:58:09,300 INFO [train.py:996] (2/4) Epoch 3, batch 11200, loss[loss=0.2409, simple_loss=0.2935, pruned_loss=0.09408, over 22020.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3112, pruned_loss=0.09115, over 4267922.71 frames. ], batch size: 103, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 22:59:46,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=433374.0, ans=0.125 2023-06-19 23:00:02,714 INFO [train.py:996] (2/4) Epoch 3, batch 11250, loss[loss=0.2283, simple_loss=0.2903, pruned_loss=0.08313, over 21189.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.31, pruned_loss=0.09077, over 4265740.69 frames. ], batch size: 176, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 23:00:11,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=433434.0, ans=0.1 2023-06-19 23:00:42,194 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.64 vs. limit=15.0 2023-06-19 23:01:31,838 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.67 vs. limit=10.0 2023-06-19 23:01:37,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=433614.0, ans=0.2 2023-06-19 23:01:40,267 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.923e+02 2.431e+02 2.742e+02 3.352e+02 9.461e+02, threshold=5.484e+02, percent-clipped=5.0 2023-06-19 23:01:43,947 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=433614.0, ans=0.0 2023-06-19 23:01:45,434 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=433614.0, ans=0.0 2023-06-19 23:02:04,706 INFO [train.py:996] (2/4) Epoch 3, batch 11300, loss[loss=0.2304, simple_loss=0.2957, pruned_loss=0.08259, over 21539.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3108, pruned_loss=0.09097, over 4276553.23 frames. ], batch size: 212, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 23:02:07,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn2.whiten.whitening_limit, batch_count=433734.0, ans=22.5 2023-06-19 23:02:09,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=433734.0, ans=0.1 2023-06-19 23:03:14,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=433854.0, ans=0.2 2023-06-19 23:03:51,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=433974.0, ans=0.125 2023-06-19 23:04:00,911 INFO [train.py:996] (2/4) Epoch 3, batch 11350, loss[loss=0.3387, simple_loss=0.3906, pruned_loss=0.1434, over 21420.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.3146, pruned_loss=0.09154, over 4278741.33 frames. ], batch size: 471, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 23:04:57,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=434154.0, ans=0.125 2023-06-19 23:05:18,856 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.01 vs. limit=15.0 2023-06-19 23:05:21,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=434214.0, ans=0.125 2023-06-19 23:05:25,135 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.948e+02 2.877e+02 3.365e+02 4.461e+02 8.303e+02, threshold=6.730e+02, percent-clipped=12.0 2023-06-19 23:05:39,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=434214.0, ans=0.125 2023-06-19 23:06:05,029 INFO [train.py:996] (2/4) Epoch 3, batch 11400, loss[loss=0.2868, simple_loss=0.3496, pruned_loss=0.112, over 21715.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.3204, pruned_loss=0.09409, over 4278474.43 frames. ], batch size: 124, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 23:06:31,149 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=434334.0, ans=0.2 2023-06-19 23:06:31,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=434334.0, ans=0.125 2023-06-19 23:06:45,178 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=434394.0, ans=0.125 2023-06-19 23:06:52,503 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=434394.0, ans=0.125 2023-06-19 23:07:18,457 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=434454.0, ans=0.1 2023-06-19 23:07:41,154 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.40 vs. limit=15.0 2023-06-19 23:08:04,087 INFO [train.py:996] (2/4) Epoch 3, batch 11450, loss[loss=0.356, simple_loss=0.3963, pruned_loss=0.1578, over 21409.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3223, pruned_loss=0.09347, over 4274022.69 frames. ], batch size: 508, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 23:08:30,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=434634.0, ans=0.125 2023-06-19 23:09:26,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=434814.0, ans=0.2 2023-06-19 23:09:32,769 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.847e+02 2.596e+02 3.109e+02 3.660e+02 5.880e+02, threshold=6.218e+02, percent-clipped=0.0 2023-06-19 23:09:35,466 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.30 vs. limit=22.5 2023-06-19 23:09:58,420 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=434874.0, ans=0.125 2023-06-19 23:10:05,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=434874.0, ans=0.1 2023-06-19 23:10:08,102 INFO [train.py:996] (2/4) Epoch 3, batch 11500, loss[loss=0.248, simple_loss=0.3396, pruned_loss=0.07822, over 21647.00 frames. ], tot_loss[loss=0.2577, simple_loss=0.3251, pruned_loss=0.09512, over 4267416.95 frames. ], batch size: 389, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 23:11:38,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=435114.0, ans=0.5 2023-06-19 23:11:45,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=435114.0, ans=0.125 2023-06-19 23:12:15,584 INFO [train.py:996] (2/4) Epoch 3, batch 11550, loss[loss=0.2609, simple_loss=0.3407, pruned_loss=0.09059, over 21378.00 frames. ], tot_loss[loss=0.2593, simple_loss=0.3299, pruned_loss=0.09437, over 4269230.73 frames. ], batch size: 194, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 23:13:01,436 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.05 vs. limit=15.0 2023-06-19 23:13:24,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=435354.0, ans=0.0 2023-06-19 23:13:29,118 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.61 vs. limit=22.5 2023-06-19 23:13:53,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=435414.0, ans=0.0 2023-06-19 23:13:54,258 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 2.665e+02 3.265e+02 4.118e+02 8.231e+02, threshold=6.531e+02, percent-clipped=5.0 2023-06-19 23:13:54,729 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=435414.0, ans=0.125 2023-06-19 23:14:29,450 INFO [train.py:996] (2/4) Epoch 3, batch 11600, loss[loss=0.2582, simple_loss=0.3516, pruned_loss=0.08241, over 21616.00 frames. ], tot_loss[loss=0.2668, simple_loss=0.3415, pruned_loss=0.09602, over 4271020.91 frames. ], batch size: 230, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 23:14:59,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=435594.0, ans=0.0 2023-06-19 23:15:08,475 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 23:15:50,110 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.27 vs. limit=15.0 2023-06-19 23:16:00,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=435774.0, ans=0.125 2023-06-19 23:16:14,738 INFO [train.py:996] (2/4) Epoch 3, batch 11650, loss[loss=0.3758, simple_loss=0.449, pruned_loss=0.1513, over 21519.00 frames. ], tot_loss[loss=0.2714, simple_loss=0.3488, pruned_loss=0.09698, over 4261924.77 frames. ], batch size: 471, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 23:16:43,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=435834.0, ans=0.0 2023-06-19 23:16:59,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=435954.0, ans=0.2 2023-06-19 23:17:04,307 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.18 vs. limit=22.5 2023-06-19 23:17:23,501 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=435954.0, ans=0.0 2023-06-19 23:17:47,549 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.847e+02 2.535e+02 2.971e+02 3.665e+02 6.378e+02, threshold=5.943e+02, percent-clipped=0.0 2023-06-19 23:17:52,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=436014.0, ans=0.125 2023-06-19 23:18:18,003 INFO [train.py:996] (2/4) Epoch 3, batch 11700, loss[loss=0.2317, simple_loss=0.2892, pruned_loss=0.08711, over 21313.00 frames. ], tot_loss[loss=0.2668, simple_loss=0.34, pruned_loss=0.09674, over 4264056.03 frames. ], batch size: 160, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 23:18:43,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=436194.0, ans=0.125 2023-06-19 23:18:46,597 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.28 vs. limit=15.0 2023-06-19 23:19:01,114 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=436254.0, ans=0.0 2023-06-19 23:19:50,347 INFO [train.py:996] (2/4) Epoch 3, batch 11750, loss[loss=0.2208, simple_loss=0.2784, pruned_loss=0.08159, over 21855.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.3306, pruned_loss=0.09622, over 4261226.26 frames. ], batch size: 250, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 23:19:50,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=436434.0, ans=0.0 2023-06-19 23:20:06,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=436434.0, ans=0.125 2023-06-19 23:20:11,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=436494.0, ans=0.2 2023-06-19 23:20:26,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=436554.0, ans=0.125 2023-06-19 23:20:51,335 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=436614.0, ans=0.1 2023-06-19 23:20:53,728 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.944e+02 2.589e+02 3.123e+02 3.553e+02 5.230e+02, threshold=6.245e+02, percent-clipped=0.0 2023-06-19 23:21:07,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=436674.0, ans=0.125 2023-06-19 23:21:23,186 INFO [train.py:996] (2/4) Epoch 3, batch 11800, loss[loss=0.2626, simple_loss=0.3228, pruned_loss=0.1012, over 21207.00 frames. ], tot_loss[loss=0.2657, simple_loss=0.3334, pruned_loss=0.09898, over 4264777.40 frames. ], batch size: 143, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 23:23:20,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=436974.0, ans=0.125 2023-06-19 23:23:33,139 INFO [train.py:996] (2/4) Epoch 3, batch 11850, loss[loss=0.2536, simple_loss=0.3327, pruned_loss=0.0872, over 21676.00 frames. ], tot_loss[loss=0.266, simple_loss=0.3365, pruned_loss=0.09769, over 4263367.20 frames. ], batch size: 230, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 23:24:15,799 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=437094.0, ans=0.125 2023-06-19 23:24:54,513 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.011e+02 2.604e+02 3.025e+02 3.520e+02 5.085e+02, threshold=6.049e+02, percent-clipped=0.0 2023-06-19 23:24:57,325 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.14 vs. limit=15.0 2023-06-19 23:25:26,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=437274.0, ans=0.1 2023-06-19 23:25:30,664 INFO [train.py:996] (2/4) Epoch 3, batch 11900, loss[loss=0.2534, simple_loss=0.3303, pruned_loss=0.08823, over 21822.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.3371, pruned_loss=0.09563, over 4260618.20 frames. ], batch size: 371, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 23:25:53,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=437394.0, ans=0.125 2023-06-19 23:26:15,637 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=437454.0, ans=15.0 2023-06-19 23:26:22,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=437454.0, ans=0.2 2023-06-19 23:26:59,523 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=15.12 vs. limit=15.0 2023-06-19 23:27:08,737 INFO [train.py:996] (2/4) Epoch 3, batch 11950, loss[loss=0.1921, simple_loss=0.2755, pruned_loss=0.05435, over 21596.00 frames. ], tot_loss[loss=0.2594, simple_loss=0.3358, pruned_loss=0.09151, over 4248871.66 frames. ], batch size: 230, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 23:27:14,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=437634.0, ans=0.0 2023-06-19 23:27:31,325 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.45 vs. limit=15.0 2023-06-19 23:28:03,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=437754.0, ans=0.2 2023-06-19 23:28:28,550 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.900e+02 2.557e+02 3.205e+02 3.971e+02 7.967e+02, threshold=6.411e+02, percent-clipped=3.0 2023-06-19 23:28:29,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=437814.0, ans=0.02 2023-06-19 23:28:33,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=437814.0, ans=0.025 2023-06-19 23:28:35,132 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=437814.0, ans=0.125 2023-06-19 23:28:58,287 INFO [train.py:996] (2/4) Epoch 3, batch 12000, loss[loss=0.242, simple_loss=0.2949, pruned_loss=0.09452, over 21616.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.3329, pruned_loss=0.09024, over 4248932.77 frames. ], batch size: 332, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 23:28:58,288 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-19 23:29:56,194 INFO [train.py:1028] (2/4) Epoch 3, validation: loss=0.2725, simple_loss=0.3684, pruned_loss=0.08831, over 1796401.00 frames. 2023-06-19 23:29:56,195 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-19 23:30:34,514 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=437994.0, ans=0.125 2023-06-19 23:30:51,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=438054.0, ans=0.0 2023-06-19 23:31:34,949 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=438174.0, ans=0.0 2023-06-19 23:31:52,699 INFO [train.py:996] (2/4) Epoch 3, batch 12050, loss[loss=0.2392, simple_loss=0.292, pruned_loss=0.09319, over 21571.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.3286, pruned_loss=0.09242, over 4258765.84 frames. ], batch size: 212, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 23:31:53,217 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=438234.0, ans=0.0 2023-06-19 23:32:04,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=438234.0, ans=0.125 2023-06-19 23:32:07,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=438234.0, ans=0.125 2023-06-19 23:32:38,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=438354.0, ans=0.2 2023-06-19 23:32:40,073 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.80 vs. limit=15.0 2023-06-19 23:33:06,227 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.141e+02 2.716e+02 3.222e+02 3.934e+02 6.549e+02, threshold=6.444e+02, percent-clipped=1.0 2023-06-19 23:33:14,134 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=438474.0, ans=0.2 2023-06-19 23:33:14,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=438474.0, ans=0.125 2023-06-19 23:33:45,632 INFO [train.py:996] (2/4) Epoch 3, batch 12100, loss[loss=0.2162, simple_loss=0.3304, pruned_loss=0.051, over 19861.00 frames. ], tot_loss[loss=0.2647, simple_loss=0.3339, pruned_loss=0.0977, over 4267015.66 frames. ], batch size: 703, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 23:34:28,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=438594.0, ans=0.2 2023-06-19 23:34:35,386 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=438654.0, ans=0.0 2023-06-19 23:34:45,471 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=438654.0, ans=0.125 2023-06-19 23:35:16,061 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=438714.0, ans=0.125 2023-06-19 23:36:03,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=438774.0, ans=0.0 2023-06-19 23:36:06,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=438774.0, ans=0.0 2023-06-19 23:36:07,952 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=438834.0, ans=0.1 2023-06-19 23:36:08,828 INFO [train.py:996] (2/4) Epoch 3, batch 12150, loss[loss=0.2822, simple_loss=0.3788, pruned_loss=0.09281, over 21668.00 frames. ], tot_loss[loss=0.2653, simple_loss=0.3359, pruned_loss=0.09734, over 4264177.91 frames. ], batch size: 414, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 23:36:19,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=438834.0, ans=0.125 2023-06-19 23:36:20,433 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=438834.0, ans=0.0 2023-06-19 23:37:37,836 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.196e+02 3.132e+02 3.694e+02 4.434e+02 6.753e+02, threshold=7.387e+02, percent-clipped=1.0 2023-06-19 23:38:11,589 INFO [train.py:996] (2/4) Epoch 3, batch 12200, loss[loss=0.2723, simple_loss=0.3134, pruned_loss=0.1156, over 21496.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.3298, pruned_loss=0.09679, over 4255006.52 frames. ], batch size: 441, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 23:38:34,333 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=439194.0, ans=0.125 2023-06-19 23:38:50,982 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=439254.0, ans=0.1 2023-06-19 23:39:12,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=439314.0, ans=0.125 2023-06-19 23:39:59,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=439374.0, ans=0.0 2023-06-19 23:40:09,930 INFO [train.py:996] (2/4) Epoch 3, batch 12250, loss[loss=0.2094, simple_loss=0.2899, pruned_loss=0.06444, over 21740.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3204, pruned_loss=0.09233, over 4255603.61 frames. ], batch size: 351, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 23:40:15,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=439434.0, ans=0.0 2023-06-19 23:40:22,702 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.52 vs. limit=15.0 2023-06-19 23:40:59,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=439614.0, ans=0.125 2023-06-19 23:41:12,268 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.612e+02 2.432e+02 2.824e+02 3.455e+02 5.989e+02, threshold=5.649e+02, percent-clipped=0.0 2023-06-19 23:41:39,432 INFO [train.py:996] (2/4) Epoch 3, batch 12300, loss[loss=0.199, simple_loss=0.2721, pruned_loss=0.06291, over 21344.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3096, pruned_loss=0.0845, over 4261164.07 frames. ], batch size: 176, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 23:43:39,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=439974.0, ans=0.125 2023-06-19 23:43:49,788 INFO [train.py:996] (2/4) Epoch 3, batch 12350, loss[loss=0.2647, simple_loss=0.3314, pruned_loss=0.099, over 21294.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3142, pruned_loss=0.08422, over 4261805.96 frames. ], batch size: 176, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 23:44:02,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=440034.0, ans=0.5 2023-06-19 23:44:51,642 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.640e+02 2.628e+02 3.014e+02 3.788e+02 7.072e+02, threshold=6.028e+02, percent-clipped=4.0 2023-06-19 23:45:23,909 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.25 vs. limit=15.0 2023-06-19 23:45:29,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=440274.0, ans=0.125 2023-06-19 23:45:31,947 INFO [train.py:996] (2/4) Epoch 3, batch 12400, loss[loss=0.3446, simple_loss=0.3734, pruned_loss=0.1579, over 21755.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3176, pruned_loss=0.08956, over 4266121.25 frames. ], batch size: 508, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 23:46:02,120 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=440394.0, ans=0.125 2023-06-19 23:46:31,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=440454.0, ans=0.125 2023-06-19 23:46:45,983 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=440454.0, ans=0.0 2023-06-19 23:47:47,501 INFO [train.py:996] (2/4) Epoch 3, batch 12450, loss[loss=0.2729, simple_loss=0.3316, pruned_loss=0.1072, over 20854.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3221, pruned_loss=0.09309, over 4274605.69 frames. ], batch size: 608, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 23:47:56,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=440634.0, ans=0.0 2023-06-19 23:48:04,964 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=440634.0, ans=0.025 2023-06-19 23:48:12,583 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=440694.0, ans=0.125 2023-06-19 23:48:35,637 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.80 vs. limit=15.0 2023-06-19 23:48:57,892 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.102e+02 2.789e+02 3.094e+02 3.461e+02 5.452e+02, threshold=6.188e+02, percent-clipped=0.0 2023-06-19 23:49:07,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=440814.0, ans=0.125 2023-06-19 23:49:31,674 INFO [train.py:996] (2/4) Epoch 3, batch 12500, loss[loss=0.2973, simple_loss=0.4009, pruned_loss=0.09682, over 21641.00 frames. ], tot_loss[loss=0.2643, simple_loss=0.3339, pruned_loss=0.09739, over 4273143.43 frames. ], batch size: 389, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 23:49:34,881 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=440934.0, ans=0.09899494936611666 2023-06-19 23:49:36,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=440934.0, ans=0.125 2023-06-19 23:51:13,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=441114.0, ans=0.125 2023-06-19 23:51:13,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=441114.0, ans=0.09899494936611666 2023-06-19 23:51:45,197 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.31 vs. limit=22.5 2023-06-19 23:51:48,448 INFO [train.py:996] (2/4) Epoch 3, batch 12550, loss[loss=0.2314, simple_loss=0.2829, pruned_loss=0.08998, over 19942.00 frames. ], tot_loss[loss=0.2705, simple_loss=0.3398, pruned_loss=0.1006, over 4278542.35 frames. ], batch size: 703, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 23:52:05,582 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.36 vs. limit=15.0 2023-06-19 23:52:37,507 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.37 vs. limit=12.0 2023-06-19 23:53:15,583 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.054e+02 2.900e+02 3.527e+02 4.002e+02 7.299e+02, threshold=7.054e+02, percent-clipped=3.0 2023-06-19 23:54:00,524 INFO [train.py:996] (2/4) Epoch 3, batch 12600, loss[loss=0.2164, simple_loss=0.2946, pruned_loss=0.06909, over 21182.00 frames. ], tot_loss[loss=0.2665, simple_loss=0.3379, pruned_loss=0.09751, over 4277153.94 frames. ], batch size: 159, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 23:54:21,549 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=441594.0, ans=0.125 2023-06-19 23:54:41,800 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=441654.0, ans=0.125 2023-06-19 23:55:46,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=441774.0, ans=0.0 2023-06-19 23:55:49,629 INFO [train.py:996] (2/4) Epoch 3, batch 12650, loss[loss=0.2217, simple_loss=0.2842, pruned_loss=0.0796, over 21430.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3297, pruned_loss=0.0928, over 4279006.44 frames. ], batch size: 211, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 23:56:03,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=441834.0, ans=0.0 2023-06-19 23:56:20,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=441894.0, ans=0.125 2023-06-19 23:56:40,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=441954.0, ans=0.125 2023-06-19 23:57:05,141 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.544e+02 2.355e+02 2.861e+02 3.372e+02 5.218e+02, threshold=5.723e+02, percent-clipped=0.0 2023-06-19 23:57:37,729 INFO [train.py:996] (2/4) Epoch 3, batch 12700, loss[loss=0.3074, simple_loss=0.3605, pruned_loss=0.1271, over 21182.00 frames. ], tot_loss[loss=0.2601, simple_loss=0.33, pruned_loss=0.09506, over 4275704.15 frames. ], batch size: 143, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 23:58:32,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=442254.0, ans=0.2 2023-06-19 23:58:49,713 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-19 23:58:53,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=442314.0, ans=0.1 2023-06-19 23:58:57,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=442314.0, ans=0.0 2023-06-19 23:59:35,039 INFO [train.py:996] (2/4) Epoch 3, batch 12750, loss[loss=0.2621, simple_loss=0.3343, pruned_loss=0.09499, over 21894.00 frames. ], tot_loss[loss=0.2625, simple_loss=0.3326, pruned_loss=0.09616, over 4274097.33 frames. ], batch size: 316, lr: 1.11e-02, grad_scale: 16.0 2023-06-20 00:01:08,593 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.010e+02 2.647e+02 3.084e+02 3.611e+02 5.455e+02, threshold=6.168e+02, percent-clipped=0.0 2023-06-20 00:01:43,925 INFO [train.py:996] (2/4) Epoch 3, batch 12800, loss[loss=0.2556, simple_loss=0.3236, pruned_loss=0.09376, over 20754.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3311, pruned_loss=0.09639, over 4283610.88 frames. ], batch size: 607, lr: 1.11e-02, grad_scale: 32.0 2023-06-20 00:02:34,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=442854.0, ans=0.125 2023-06-20 00:02:34,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=442854.0, ans=0.125 2023-06-20 00:02:45,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=442854.0, ans=0.125 2023-06-20 00:03:41,817 INFO [train.py:996] (2/4) Epoch 3, batch 12850, loss[loss=0.2997, simple_loss=0.3902, pruned_loss=0.1046, over 20747.00 frames. ], tot_loss[loss=0.266, simple_loss=0.3343, pruned_loss=0.09885, over 4278882.52 frames. ], batch size: 607, lr: 1.11e-02, grad_scale: 32.0 2023-06-20 00:03:46,730 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=443034.0, ans=0.125 2023-06-20 00:04:04,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=443034.0, ans=0.125 2023-06-20 00:04:11,247 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.33 vs. limit=15.0 2023-06-20 00:04:57,444 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.04 vs. limit=15.0 2023-06-20 00:05:19,845 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 2.657e+02 3.038e+02 3.599e+02 6.295e+02, threshold=6.075e+02, percent-clipped=1.0 2023-06-20 00:05:49,491 INFO [train.py:996] (2/4) Epoch 3, batch 12900, loss[loss=0.2304, simple_loss=0.3085, pruned_loss=0.07614, over 21470.00 frames. ], tot_loss[loss=0.2601, simple_loss=0.331, pruned_loss=0.09458, over 4276573.90 frames. ], batch size: 212, lr: 1.11e-02, grad_scale: 32.0 2023-06-20 00:06:43,636 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.61 vs. limit=22.5 2023-06-20 00:07:04,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=443454.0, ans=10.0 2023-06-20 00:07:43,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=443574.0, ans=0.1 2023-06-20 00:07:52,952 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=443574.0, ans=0.125 2023-06-20 00:07:58,458 INFO [train.py:996] (2/4) Epoch 3, batch 12950, loss[loss=0.1945, simple_loss=0.2663, pruned_loss=0.06137, over 21269.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.3288, pruned_loss=0.09228, over 4272985.00 frames. ], batch size: 176, lr: 1.11e-02, grad_scale: 32.0 2023-06-20 00:08:12,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=443694.0, ans=0.125 2023-06-20 00:08:16,582 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=443694.0, ans=0.125 2023-06-20 00:08:51,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=443754.0, ans=0.04949747468305833 2023-06-20 00:09:09,940 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=15.0 2023-06-20 00:09:30,308 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.957e+02 2.524e+02 2.816e+02 3.145e+02 4.812e+02, threshold=5.631e+02, percent-clipped=0.0 2023-06-20 00:09:39,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=443874.0, ans=0.0 2023-06-20 00:09:57,392 INFO [train.py:996] (2/4) Epoch 3, batch 13000, loss[loss=0.202, simple_loss=0.2721, pruned_loss=0.06596, over 21283.00 frames. ], tot_loss[loss=0.2591, simple_loss=0.3321, pruned_loss=0.09302, over 4269490.76 frames. ], batch size: 176, lr: 1.11e-02, grad_scale: 16.0 2023-06-20 00:10:13,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=443934.0, ans=0.04949747468305833 2023-06-20 00:12:05,457 INFO [train.py:996] (2/4) Epoch 3, batch 13050, loss[loss=0.2577, simple_loss=0.3194, pruned_loss=0.09801, over 21516.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3255, pruned_loss=0.08926, over 4265201.88 frames. ], batch size: 548, lr: 1.11e-02, grad_scale: 16.0 2023-06-20 00:12:10,293 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=444234.0, ans=0.1 2023-06-20 00:12:22,787 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=444294.0, ans=0.125 2023-06-20 00:12:29,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=444294.0, ans=0.1 2023-06-20 00:12:56,257 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=444354.0, ans=0.125 2023-06-20 00:13:18,538 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 2.600e+02 3.229e+02 4.013e+02 6.675e+02, threshold=6.458e+02, percent-clipped=7.0 2023-06-20 00:13:41,580 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=444474.0, ans=0.125 2023-06-20 00:13:48,104 INFO [train.py:996] (2/4) Epoch 3, batch 13100, loss[loss=0.2946, simple_loss=0.365, pruned_loss=0.1121, over 21592.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3273, pruned_loss=0.09034, over 4270075.55 frames. ], batch size: 507, lr: 1.11e-02, grad_scale: 16.0 2023-06-20 00:14:25,934 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.91 vs. limit=15.0 2023-06-20 00:14:41,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=444594.0, ans=0.125 2023-06-20 00:14:50,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=444654.0, ans=0.0 2023-06-20 00:15:44,212 INFO [train.py:996] (2/4) Epoch 3, batch 13150, loss[loss=0.2076, simple_loss=0.2815, pruned_loss=0.06687, over 21725.00 frames. ], tot_loss[loss=0.2588, simple_loss=0.3296, pruned_loss=0.09396, over 4277321.94 frames. ], batch size: 282, lr: 1.11e-02, grad_scale: 16.0 2023-06-20 00:17:23,501 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.014e+02 2.624e+02 3.071e+02 3.652e+02 5.306e+02, threshold=6.141e+02, percent-clipped=0.0 2023-06-20 00:17:47,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=445134.0, ans=10.0 2023-06-20 00:17:48,156 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.41 vs. limit=15.0 2023-06-20 00:17:48,413 INFO [train.py:996] (2/4) Epoch 3, batch 13200, loss[loss=0.2615, simple_loss=0.3288, pruned_loss=0.09711, over 21229.00 frames. ], tot_loss[loss=0.2583, simple_loss=0.3277, pruned_loss=0.09447, over 4273247.10 frames. ], batch size: 143, lr: 1.11e-02, grad_scale: 32.0 2023-06-20 00:17:54,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=445134.0, ans=0.125 2023-06-20 00:17:59,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=445134.0, ans=0.125 2023-06-20 00:17:59,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=445134.0, ans=0.0 2023-06-20 00:18:27,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=445194.0, ans=0.0 2023-06-20 00:18:27,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=445194.0, ans=0.125 2023-06-20 00:19:31,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=445374.0, ans=0.0 2023-06-20 00:19:56,206 INFO [train.py:996] (2/4) Epoch 3, batch 13250, loss[loss=0.2451, simple_loss=0.3265, pruned_loss=0.08183, over 21566.00 frames. ], tot_loss[loss=0.2602, simple_loss=0.3287, pruned_loss=0.09581, over 4278750.46 frames. ], batch size: 263, lr: 1.11e-02, grad_scale: 32.0 2023-06-20 00:20:21,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=445494.0, ans=0.0 2023-06-20 00:20:35,615 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.92 vs. limit=15.0 2023-06-20 00:21:40,767 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.748e+02 2.563e+02 2.914e+02 3.301e+02 4.888e+02, threshold=5.829e+02, percent-clipped=0.0 2023-06-20 00:22:12,923 INFO [train.py:996] (2/4) Epoch 3, batch 13300, loss[loss=0.368, simple_loss=0.4168, pruned_loss=0.1596, over 21356.00 frames. ], tot_loss[loss=0.2623, simple_loss=0.3332, pruned_loss=0.09566, over 4273095.84 frames. ], batch size: 507, lr: 1.10e-02, grad_scale: 16.0 2023-06-20 00:23:38,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=445914.0, ans=0.0 2023-06-20 00:23:51,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=445974.0, ans=0.1 2023-06-20 00:24:11,649 INFO [train.py:996] (2/4) Epoch 3, batch 13350, loss[loss=0.2772, simple_loss=0.347, pruned_loss=0.1037, over 21405.00 frames. ], tot_loss[loss=0.2664, simple_loss=0.3369, pruned_loss=0.09799, over 4270904.99 frames. ], batch size: 194, lr: 1.10e-02, grad_scale: 16.0 2023-06-20 00:24:12,125 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=446034.0, ans=0.2 2023-06-20 00:25:11,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=446094.0, ans=0.125 2023-06-20 00:25:12,908 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=446154.0, ans=0.125 2023-06-20 00:25:19,671 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.79 vs. limit=15.0 2023-06-20 00:25:31,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=446154.0, ans=0.1 2023-06-20 00:25:50,619 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.040e+02 2.814e+02 3.239e+02 3.845e+02 6.415e+02, threshold=6.478e+02, percent-clipped=2.0 2023-06-20 00:26:05,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=446274.0, ans=0.125 2023-06-20 00:26:29,093 INFO [train.py:996] (2/4) Epoch 3, batch 13400, loss[loss=0.2868, simple_loss=0.3481, pruned_loss=0.1127, over 21797.00 frames. ], tot_loss[loss=0.2699, simple_loss=0.3391, pruned_loss=0.1003, over 4280454.17 frames. ], batch size: 112, lr: 1.10e-02, grad_scale: 16.0 2023-06-20 00:27:13,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=446454.0, ans=0.1 2023-06-20 00:28:25,119 INFO [train.py:996] (2/4) Epoch 3, batch 13450, loss[loss=0.2959, simple_loss=0.3681, pruned_loss=0.1118, over 21493.00 frames. ], tot_loss[loss=0.2737, simple_loss=0.3414, pruned_loss=0.1031, over 4276521.04 frames. ], batch size: 131, lr: 1.10e-02, grad_scale: 16.0 2023-06-20 00:29:03,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=446694.0, ans=0.125 2023-06-20 00:29:07,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=446694.0, ans=0.125 2023-06-20 00:29:08,167 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.52 vs. limit=12.0 2023-06-20 00:29:14,583 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=446754.0, ans=0.125 2023-06-20 00:29:36,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=446814.0, ans=0.0 2023-06-20 00:29:59,001 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.120e+02 2.743e+02 3.193e+02 4.040e+02 8.828e+02, threshold=6.385e+02, percent-clipped=3.0 2023-06-20 00:30:21,462 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=446874.0, ans=0.125 2023-06-20 00:30:24,582 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=446874.0, ans=0.0 2023-06-20 00:30:28,212 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.59 vs. limit=15.0 2023-06-20 00:30:34,589 INFO [train.py:996] (2/4) Epoch 3, batch 13500, loss[loss=0.2889, simple_loss=0.3566, pruned_loss=0.1106, over 21347.00 frames. ], tot_loss[loss=0.2622, simple_loss=0.328, pruned_loss=0.09824, over 4256460.40 frames. ], batch size: 549, lr: 1.10e-02, grad_scale: 16.0 2023-06-20 00:30:51,731 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.40 vs. limit=15.0 2023-06-20 00:32:31,753 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.61 vs. limit=10.0 2023-06-20 00:32:46,182 INFO [train.py:996] (2/4) Epoch 3, batch 13550, loss[loss=0.3205, simple_loss=0.4056, pruned_loss=0.1177, over 21720.00 frames. ], tot_loss[loss=0.2655, simple_loss=0.3335, pruned_loss=0.09874, over 4260190.26 frames. ], batch size: 414, lr: 1.10e-02, grad_scale: 16.0 2023-06-20 00:33:31,756 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=447294.0, ans=0.1 2023-06-20 00:33:37,759 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=447294.0, ans=0.0 2023-06-20 00:33:42,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=447354.0, ans=0.125 2023-06-20 00:34:23,613 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.201e+02 2.814e+02 3.268e+02 3.929e+02 6.587e+02, threshold=6.537e+02, percent-clipped=1.0 2023-06-20 00:34:54,399 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.87 vs. limit=15.0 2023-06-20 00:34:59,315 INFO [train.py:996] (2/4) Epoch 3, batch 13600, loss[loss=0.2725, simple_loss=0.3287, pruned_loss=0.1081, over 21594.00 frames. ], tot_loss[loss=0.2659, simple_loss=0.3337, pruned_loss=0.09904, over 4268383.05 frames. ], batch size: 548, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 00:36:49,738 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-06-20 00:36:55,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=447834.0, ans=0.0 2023-06-20 00:36:56,425 INFO [train.py:996] (2/4) Epoch 3, batch 13650, loss[loss=0.2289, simple_loss=0.296, pruned_loss=0.08086, over 21646.00 frames. ], tot_loss[loss=0.259, simple_loss=0.3285, pruned_loss=0.09474, over 4270793.88 frames. ], batch size: 332, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 00:37:01,566 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=447834.0, ans=0.125 2023-06-20 00:37:12,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=447834.0, ans=0.0 2023-06-20 00:37:47,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=447954.0, ans=0.2 2023-06-20 00:38:11,444 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.010e+02 2.579e+02 2.975e+02 3.634e+02 4.818e+02, threshold=5.950e+02, percent-clipped=0.0 2023-06-20 00:38:11,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=448014.0, ans=0.125 2023-06-20 00:38:42,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=448134.0, ans=0.1 2023-06-20 00:38:43,055 INFO [train.py:996] (2/4) Epoch 3, batch 13700, loss[loss=0.2736, simple_loss=0.3318, pruned_loss=0.1077, over 20735.00 frames. ], tot_loss[loss=0.2566, simple_loss=0.3235, pruned_loss=0.09486, over 4263991.19 frames. ], batch size: 607, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 00:38:44,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=448134.0, ans=0.1 2023-06-20 00:40:07,956 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=448314.0, ans=0.0 2023-06-20 00:40:24,142 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=448314.0, ans=0.0 2023-06-20 00:40:54,362 INFO [train.py:996] (2/4) Epoch 3, batch 13750, loss[loss=0.2618, simple_loss=0.3327, pruned_loss=0.09548, over 21621.00 frames. ], tot_loss[loss=0.253, simple_loss=0.32, pruned_loss=0.09299, over 4260021.59 frames. ], batch size: 389, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 00:41:24,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=448494.0, ans=0.0 2023-06-20 00:42:13,199 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.29 vs. limit=22.5 2023-06-20 00:42:18,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=448614.0, ans=0.125 2023-06-20 00:42:35,428 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.902e+02 2.755e+02 3.135e+02 3.702e+02 5.925e+02, threshold=6.269e+02, percent-clipped=0.0 2023-06-20 00:42:59,110 INFO [train.py:996] (2/4) Epoch 3, batch 13800, loss[loss=0.2816, simple_loss=0.3823, pruned_loss=0.09048, over 21846.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3248, pruned_loss=0.09167, over 4253242.37 frames. ], batch size: 371, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 00:43:01,449 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.72 vs. limit=12.0 2023-06-20 00:43:17,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=448734.0, ans=0.125 2023-06-20 00:43:25,114 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=448794.0, ans=0.0 2023-06-20 00:43:34,266 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=448854.0, ans=0.2 2023-06-20 00:44:19,820 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.95 vs. limit=10.0 2023-06-20 00:45:01,419 INFO [train.py:996] (2/4) Epoch 3, batch 13850, loss[loss=0.3388, simple_loss=0.4034, pruned_loss=0.1371, over 21726.00 frames. ], tot_loss[loss=0.2593, simple_loss=0.332, pruned_loss=0.09328, over 4261483.27 frames. ], batch size: 441, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 00:45:03,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=449034.0, ans=0.0 2023-06-20 00:45:07,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=449034.0, ans=0.0 2023-06-20 00:46:01,810 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=449154.0, ans=0.2 2023-06-20 00:46:03,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=449154.0, ans=0.0 2023-06-20 00:46:19,333 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=449154.0, ans=0.125 2023-06-20 00:46:39,795 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=449214.0, ans=0.125 2023-06-20 00:46:40,937 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.071e+02 2.873e+02 3.393e+02 4.011e+02 6.669e+02, threshold=6.786e+02, percent-clipped=1.0 2023-06-20 00:47:04,960 INFO [train.py:996] (2/4) Epoch 3, batch 13900, loss[loss=0.2814, simple_loss=0.3681, pruned_loss=0.09733, over 20793.00 frames. ], tot_loss[loss=0.2662, simple_loss=0.3369, pruned_loss=0.09776, over 4266261.76 frames. ], batch size: 608, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 00:47:09,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=449334.0, ans=0.2 2023-06-20 00:47:30,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=449394.0, ans=0.05 2023-06-20 00:48:27,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=449514.0, ans=0.125 2023-06-20 00:48:56,885 INFO [train.py:996] (2/4) Epoch 3, batch 13950, loss[loss=0.2808, simple_loss=0.3755, pruned_loss=0.09305, over 20818.00 frames. ], tot_loss[loss=0.2697, simple_loss=0.3383, pruned_loss=0.1006, over 4277921.49 frames. ], batch size: 608, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 00:48:58,724 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=449634.0, ans=0.1 2023-06-20 00:49:11,255 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=449634.0, ans=0.0 2023-06-20 00:50:21,453 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.94 vs. limit=12.0 2023-06-20 00:50:26,974 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.72 vs. limit=15.0 2023-06-20 00:50:39,157 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.955e+02 2.609e+02 3.095e+02 3.648e+02 5.597e+02, threshold=6.190e+02, percent-clipped=0.0 2023-06-20 00:50:42,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=449874.0, ans=0.1 2023-06-20 00:50:53,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=449874.0, ans=0.1 2023-06-20 00:51:07,768 INFO [train.py:996] (2/4) Epoch 3, batch 14000, loss[loss=0.2175, simple_loss=0.2932, pruned_loss=0.07092, over 21815.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3361, pruned_loss=0.09804, over 4276753.21 frames. ], batch size: 351, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 00:51:33,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=449994.0, ans=0.05 2023-06-20 00:51:34,789 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=449994.0, ans=0.125 2023-06-20 00:52:52,503 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.34 vs. limit=15.0 2023-06-20 00:53:00,097 INFO [train.py:996] (2/4) Epoch 3, batch 14050, loss[loss=0.227, simple_loss=0.2861, pruned_loss=0.08396, over 21614.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.3286, pruned_loss=0.09256, over 4267253.77 frames. ], batch size: 263, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 00:53:55,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=450354.0, ans=0.1 2023-06-20 00:54:08,925 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=450354.0, ans=0.125 2023-06-20 00:54:41,191 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.501e+02 2.482e+02 3.063e+02 3.822e+02 8.036e+02, threshold=6.126e+02, percent-clipped=3.0 2023-06-20 00:55:01,138 INFO [train.py:996] (2/4) Epoch 3, batch 14100, loss[loss=0.2347, simple_loss=0.289, pruned_loss=0.09025, over 21664.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3237, pruned_loss=0.09237, over 4268105.49 frames. ], batch size: 282, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 00:55:43,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=450594.0, ans=0.0 2023-06-20 00:55:48,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=450654.0, ans=0.0 2023-06-20 00:56:00,885 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.18 vs. limit=6.0 2023-06-20 00:56:21,800 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.60 vs. limit=15.0 2023-06-20 00:56:27,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=450714.0, ans=0.1 2023-06-20 00:56:50,637 INFO [train.py:996] (2/4) Epoch 3, batch 14150, loss[loss=0.2418, simple_loss=0.3231, pruned_loss=0.08027, over 21908.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3266, pruned_loss=0.09388, over 4254124.04 frames. ], batch size: 107, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 00:56:58,471 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=450834.0, ans=0.1 2023-06-20 00:57:05,776 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=450894.0, ans=0.0 2023-06-20 00:57:58,480 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.809e+02 2.371e+02 2.924e+02 3.889e+02 6.817e+02, threshold=5.848e+02, percent-clipped=2.0 2023-06-20 00:58:20,971 INFO [train.py:996] (2/4) Epoch 3, batch 14200, loss[loss=0.2398, simple_loss=0.3073, pruned_loss=0.08612, over 21462.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3236, pruned_loss=0.0916, over 4258342.36 frames. ], batch size: 131, lr: 1.10e-02, grad_scale: 16.0 2023-06-20 00:58:24,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=451134.0, ans=0.0 2023-06-20 00:58:26,571 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.90 vs. limit=5.0 2023-06-20 00:59:46,082 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=451374.0, ans=0.0 2023-06-20 00:59:49,704 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.38 vs. limit=22.5 2023-06-20 00:59:50,348 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=451374.0, ans=0.125 2023-06-20 01:00:03,147 INFO [train.py:996] (2/4) Epoch 3, batch 14250, loss[loss=0.2212, simple_loss=0.3134, pruned_loss=0.06451, over 21496.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3195, pruned_loss=0.09236, over 4266678.92 frames. ], batch size: 211, lr: 1.10e-02, grad_scale: 16.0 2023-06-20 01:00:10,673 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.98 vs. limit=10.0 2023-06-20 01:00:39,182 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.14 vs. limit=12.0 2023-06-20 01:01:33,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=451614.0, ans=0.125 2023-06-20 01:01:36,131 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.856e+02 2.661e+02 3.144e+02 3.972e+02 7.330e+02, threshold=6.288e+02, percent-clipped=4.0 2023-06-20 01:01:36,579 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=451674.0, ans=0.0 2023-06-20 01:02:01,271 INFO [train.py:996] (2/4) Epoch 3, batch 14300, loss[loss=0.2437, simple_loss=0.2936, pruned_loss=0.09686, over 21549.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.3202, pruned_loss=0.0915, over 4260183.92 frames. ], batch size: 247, lr: 1.10e-02, grad_scale: 16.0 2023-06-20 01:02:48,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=451794.0, ans=0.95 2023-06-20 01:03:00,751 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=451854.0, ans=0.1 2023-06-20 01:03:20,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=451914.0, ans=0.0 2023-06-20 01:03:32,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=451914.0, ans=0.0 2023-06-20 01:03:37,746 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.76 vs. limit=6.0 2023-06-20 01:03:57,043 INFO [train.py:996] (2/4) Epoch 3, batch 14350, loss[loss=0.2386, simple_loss=0.315, pruned_loss=0.0811, over 19976.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3239, pruned_loss=0.09211, over 4240960.18 frames. ], batch size: 703, lr: 1.10e-02, grad_scale: 16.0 2023-06-20 01:05:22,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=452214.0, ans=0.125 2023-06-20 01:05:27,937 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.913e+02 2.766e+02 3.278e+02 4.344e+02 6.485e+02, threshold=6.555e+02, percent-clipped=1.0 2023-06-20 01:05:49,277 INFO [train.py:996] (2/4) Epoch 3, batch 14400, loss[loss=0.2647, simple_loss=0.3111, pruned_loss=0.1092, over 21612.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3234, pruned_loss=0.09314, over 4252504.00 frames. ], batch size: 414, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 01:06:11,751 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.80 vs. limit=15.0 2023-06-20 01:06:52,873 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=452454.0, ans=0.125 2023-06-20 01:07:15,386 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=18.03 vs. limit=22.5 2023-06-20 01:07:36,874 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=452574.0, ans=0.0 2023-06-20 01:07:48,687 INFO [train.py:996] (2/4) Epoch 3, batch 14450, loss[loss=0.2474, simple_loss=0.3053, pruned_loss=0.0948, over 21764.00 frames. ], tot_loss[loss=0.2517, simple_loss=0.3174, pruned_loss=0.09298, over 4254638.02 frames. ], batch size: 351, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 01:07:55,643 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=452634.0, ans=0.2 2023-06-20 01:09:03,001 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.837e+02 2.591e+02 2.982e+02 3.373e+02 5.553e+02, threshold=5.963e+02, percent-clipped=0.0 2023-06-20 01:09:29,464 INFO [train.py:996] (2/4) Epoch 3, batch 14500, loss[loss=0.2564, simple_loss=0.3441, pruned_loss=0.08439, over 19802.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3159, pruned_loss=0.09283, over 4258610.72 frames. ], batch size: 702, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 01:09:42,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=452934.0, ans=0.0 2023-06-20 01:10:44,685 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.77 vs. limit=15.0 2023-06-20 01:11:21,638 INFO [train.py:996] (2/4) Epoch 3, batch 14550, loss[loss=0.3004, simple_loss=0.3623, pruned_loss=0.1193, over 21571.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3215, pruned_loss=0.09547, over 4267472.98 frames. ], batch size: 389, lr: 1.10e-02, grad_scale: 16.0 2023-06-20 01:11:24,992 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=453234.0, ans=0.2 2023-06-20 01:11:44,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=453234.0, ans=0.0 2023-06-20 01:12:05,219 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.53 vs. limit=10.0 2023-06-20 01:13:07,562 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.389e+02 2.890e+02 3.229e+02 4.188e+02 6.870e+02, threshold=6.458e+02, percent-clipped=5.0 2023-06-20 01:13:39,671 INFO [train.py:996] (2/4) Epoch 3, batch 14600, loss[loss=0.2646, simple_loss=0.3462, pruned_loss=0.09153, over 21409.00 frames. ], tot_loss[loss=0.2648, simple_loss=0.3304, pruned_loss=0.09964, over 4268666.63 frames. ], batch size: 211, lr: 1.10e-02, grad_scale: 16.0 2023-06-20 01:14:23,659 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.42 vs. limit=15.0 2023-06-20 01:14:44,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=453654.0, ans=0.125 2023-06-20 01:14:53,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=453714.0, ans=0.0 2023-06-20 01:14:56,459 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=453714.0, ans=0.2 2023-06-20 01:15:27,605 INFO [train.py:996] (2/4) Epoch 3, batch 14650, loss[loss=0.1789, simple_loss=0.2584, pruned_loss=0.0497, over 21388.00 frames. ], tot_loss[loss=0.2629, simple_loss=0.3301, pruned_loss=0.09784, over 4272252.09 frames. ], batch size: 211, lr: 1.09e-02, grad_scale: 16.0 2023-06-20 01:15:57,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=453894.0, ans=0.1 2023-06-20 01:16:43,770 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.18 vs. limit=15.0 2023-06-20 01:16:47,854 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=453954.0, ans=0.0 2023-06-20 01:17:12,668 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.451e+02 2.350e+02 2.833e+02 3.404e+02 5.520e+02, threshold=5.666e+02, percent-clipped=0.0 2023-06-20 01:17:27,823 INFO [train.py:996] (2/4) Epoch 3, batch 14700, loss[loss=0.2163, simple_loss=0.2996, pruned_loss=0.06647, over 21499.00 frames. ], tot_loss[loss=0.256, simple_loss=0.3262, pruned_loss=0.09286, over 4264853.48 frames. ], batch size: 212, lr: 1.09e-02, grad_scale: 16.0 2023-06-20 01:17:52,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=454194.0, ans=0.125 2023-06-20 01:18:49,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=454314.0, ans=0.1 2023-06-20 01:19:29,826 INFO [train.py:996] (2/4) Epoch 3, batch 14750, loss[loss=0.2839, simple_loss=0.3459, pruned_loss=0.111, over 20748.00 frames. ], tot_loss[loss=0.2609, simple_loss=0.3307, pruned_loss=0.09556, over 4266448.62 frames. ], batch size: 608, lr: 1.09e-02, grad_scale: 16.0 2023-06-20 01:20:22,556 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.64 vs. limit=22.5 2023-06-20 01:21:21,289 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.918e+02 2.805e+02 3.277e+02 4.115e+02 8.120e+02, threshold=6.554e+02, percent-clipped=5.0 2023-06-20 01:21:52,218 INFO [train.py:996] (2/4) Epoch 3, batch 14800, loss[loss=0.2787, simple_loss=0.3456, pruned_loss=0.1059, over 20646.00 frames. ], tot_loss[loss=0.2717, simple_loss=0.3413, pruned_loss=0.1011, over 4265960.39 frames. ], batch size: 607, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:22:24,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=454794.0, ans=0.125 2023-06-20 01:22:26,732 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.55 vs. limit=15.0 2023-06-20 01:22:50,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=454854.0, ans=0.125 2023-06-20 01:23:12,737 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.81 vs. limit=15.0 2023-06-20 01:23:48,738 INFO [train.py:996] (2/4) Epoch 3, batch 14850, loss[loss=0.2335, simple_loss=0.2928, pruned_loss=0.08707, over 21204.00 frames. ], tot_loss[loss=0.2681, simple_loss=0.3354, pruned_loss=0.1004, over 4264341.54 frames. ], batch size: 176, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:24:11,257 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=455094.0, ans=0.1 2023-06-20 01:24:22,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=455094.0, ans=0.2 2023-06-20 01:25:18,680 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.82 vs. limit=22.5 2023-06-20 01:25:33,480 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.172e+02 2.933e+02 3.352e+02 4.211e+02 7.293e+02, threshold=6.704e+02, percent-clipped=2.0 2023-06-20 01:26:01,556 INFO [train.py:996] (2/4) Epoch 3, batch 14900, loss[loss=0.2872, simple_loss=0.3454, pruned_loss=0.1145, over 21291.00 frames. ], tot_loss[loss=0.2697, simple_loss=0.3376, pruned_loss=0.1009, over 4260862.07 frames. ], batch size: 159, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:28:13,439 INFO [train.py:996] (2/4) Epoch 3, batch 14950, loss[loss=0.2343, simple_loss=0.289, pruned_loss=0.08978, over 21854.00 frames. ], tot_loss[loss=0.2694, simple_loss=0.3384, pruned_loss=0.1002, over 4270007.64 frames. ], batch size: 98, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:29:29,850 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.26 vs. limit=15.0 2023-06-20 01:29:32,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=455814.0, ans=0.125 2023-06-20 01:29:34,516 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.71 vs. limit=6.0 2023-06-20 01:29:42,501 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=455814.0, ans=0.125 2023-06-20 01:29:52,500 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.135e+02 2.902e+02 3.334e+02 3.960e+02 6.522e+02, threshold=6.669e+02, percent-clipped=0.0 2023-06-20 01:30:10,679 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=455874.0, ans=0.2 2023-06-20 01:30:13,465 INFO [train.py:996] (2/4) Epoch 3, batch 15000, loss[loss=0.233, simple_loss=0.3084, pruned_loss=0.07877, over 21774.00 frames. ], tot_loss[loss=0.2726, simple_loss=0.341, pruned_loss=0.1021, over 4262926.80 frames. ], batch size: 102, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:30:13,465 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-20 01:31:04,547 INFO [train.py:1028] (2/4) Epoch 3, validation: loss=0.2678, simple_loss=0.368, pruned_loss=0.08383, over 1796401.00 frames. 2023-06-20 01:31:04,548 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-20 01:31:06,609 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=455934.0, ans=0.125 2023-06-20 01:31:54,480 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 01:32:04,696 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 01:32:21,144 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.71 vs. limit=15.0 2023-06-20 01:32:42,364 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=456174.0, ans=0.125 2023-06-20 01:32:46,002 INFO [train.py:996] (2/4) Epoch 3, batch 15050, loss[loss=0.2356, simple_loss=0.2907, pruned_loss=0.09031, over 21899.00 frames. ], tot_loss[loss=0.2722, simple_loss=0.3394, pruned_loss=0.1025, over 4241140.79 frames. ], batch size: 107, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:32:50,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=456234.0, ans=0.125 2023-06-20 01:33:13,373 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.44 vs. limit=22.5 2023-06-20 01:33:32,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=456354.0, ans=0.1 2023-06-20 01:34:05,553 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=456414.0, ans=0.125 2023-06-20 01:34:12,176 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.328e+02 3.015e+02 3.596e+02 4.330e+02 8.348e+02, threshold=7.192e+02, percent-clipped=9.0 2023-06-20 01:34:48,995 INFO [train.py:996] (2/4) Epoch 3, batch 15100, loss[loss=0.2926, simple_loss=0.3647, pruned_loss=0.1103, over 21340.00 frames. ], tot_loss[loss=0.2732, simple_loss=0.3421, pruned_loss=0.1022, over 4254192.67 frames. ], batch size: 548, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:36:25,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=456774.0, ans=0.1 2023-06-20 01:36:34,689 INFO [train.py:996] (2/4) Epoch 3, batch 15150, loss[loss=0.2378, simple_loss=0.2956, pruned_loss=0.08995, over 21770.00 frames. ], tot_loss[loss=0.2719, simple_loss=0.3394, pruned_loss=0.1022, over 4253366.65 frames. ], batch size: 317, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:36:49,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=456834.0, ans=0.1 2023-06-20 01:37:28,549 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=12.0 2023-06-20 01:37:50,822 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.945e+02 2.762e+02 3.255e+02 4.111e+02 7.201e+02, threshold=6.510e+02, percent-clipped=1.0 2023-06-20 01:38:05,366 INFO [train.py:996] (2/4) Epoch 3, batch 15200, loss[loss=0.2831, simple_loss=0.3531, pruned_loss=0.1065, over 20127.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.3309, pruned_loss=0.09878, over 4255499.73 frames. ], batch size: 702, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:39:02,746 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.18 vs. limit=15.0 2023-06-20 01:39:09,557 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.90 vs. limit=22.5 2023-06-20 01:39:10,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=457254.0, ans=0.125 2023-06-20 01:40:23,928 INFO [train.py:996] (2/4) Epoch 3, batch 15250, loss[loss=0.2378, simple_loss=0.2957, pruned_loss=0.08996, over 21646.00 frames. ], tot_loss[loss=0.2595, simple_loss=0.325, pruned_loss=0.097, over 4258002.00 frames. ], batch size: 282, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:41:50,049 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.914e+02 2.712e+02 3.294e+02 3.861e+02 5.748e+02, threshold=6.588e+02, percent-clipped=0.0 2023-06-20 01:42:17,464 INFO [train.py:996] (2/4) Epoch 3, batch 15300, loss[loss=0.2816, simple_loss=0.3468, pruned_loss=0.1081, over 21956.00 frames. ], tot_loss[loss=0.264, simple_loss=0.3281, pruned_loss=0.09993, over 4269835.42 frames. ], batch size: 372, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:42:43,551 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.45 vs. limit=15.0 2023-06-20 01:42:47,114 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=457734.0, ans=0.0 2023-06-20 01:43:22,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=457854.0, ans=0.0 2023-06-20 01:44:21,161 INFO [train.py:996] (2/4) Epoch 3, batch 15350, loss[loss=0.2866, simple_loss=0.364, pruned_loss=0.1046, over 21821.00 frames. ], tot_loss[loss=0.2692, simple_loss=0.3322, pruned_loss=0.1031, over 4275745.60 frames. ], batch size: 118, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:44:42,646 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.19 vs. limit=15.0 2023-06-20 01:45:12,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_na.min_abs, batch_count=458154.0, ans=0.02 2023-06-20 01:45:27,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=458214.0, ans=0.0 2023-06-20 01:45:31,298 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.008e+02 2.699e+02 3.240e+02 4.013e+02 6.691e+02, threshold=6.480e+02, percent-clipped=1.0 2023-06-20 01:45:33,907 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.39 vs. limit=15.0 2023-06-20 01:45:46,122 INFO [train.py:996] (2/4) Epoch 3, batch 15400, loss[loss=0.2573, simple_loss=0.3313, pruned_loss=0.09163, over 21920.00 frames. ], tot_loss[loss=0.2673, simple_loss=0.3329, pruned_loss=0.1008, over 4275897.37 frames. ], batch size: 107, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:46:19,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=458394.0, ans=0.125 2023-06-20 01:46:19,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=458394.0, ans=0.2 2023-06-20 01:46:37,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=458454.0, ans=0.0 2023-06-20 01:46:57,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=458514.0, ans=0.95 2023-06-20 01:47:09,504 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.69 vs. limit=15.0 2023-06-20 01:47:33,370 INFO [train.py:996] (2/4) Epoch 3, batch 15450, loss[loss=0.2517, simple_loss=0.3096, pruned_loss=0.09697, over 21625.00 frames. ], tot_loss[loss=0.2653, simple_loss=0.3305, pruned_loss=0.1001, over 4286104.35 frames. ], batch size: 548, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:48:35,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=458754.0, ans=0.1 2023-06-20 01:48:54,286 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.039e+02 2.616e+02 3.368e+02 4.050e+02 6.110e+02, threshold=6.736e+02, percent-clipped=0.0 2023-06-20 01:49:30,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=458874.0, ans=0.125 2023-06-20 01:49:32,917 INFO [train.py:996] (2/4) Epoch 3, batch 15500, loss[loss=0.2985, simple_loss=0.3576, pruned_loss=0.1197, over 21671.00 frames. ], tot_loss[loss=0.2657, simple_loss=0.3339, pruned_loss=0.09878, over 4278481.83 frames. ], batch size: 351, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:49:46,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=458934.0, ans=0.0 2023-06-20 01:51:09,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=459114.0, ans=0.07 2023-06-20 01:51:48,341 INFO [train.py:996] (2/4) Epoch 3, batch 15550, loss[loss=0.2055, simple_loss=0.2678, pruned_loss=0.07162, over 21887.00 frames. ], tot_loss[loss=0.2618, simple_loss=0.332, pruned_loss=0.09581, over 4275306.89 frames. ], batch size: 98, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:51:51,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=459234.0, ans=0.125 2023-06-20 01:52:23,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=459354.0, ans=0.125 2023-06-20 01:52:29,666 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=459354.0, ans=0.125 2023-06-20 01:52:53,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=459414.0, ans=0.0 2023-06-20 01:53:01,963 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 01:53:14,714 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.749e+02 2.425e+02 2.682e+02 3.354e+02 5.431e+02, threshold=5.364e+02, percent-clipped=0.0 2023-06-20 01:53:15,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=459474.0, ans=0.0 2023-06-20 01:53:32,762 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=459474.0, ans=0.125 2023-06-20 01:53:35,284 INFO [train.py:996] (2/4) Epoch 3, batch 15600, loss[loss=0.2428, simple_loss=0.2966, pruned_loss=0.09445, over 21270.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.3242, pruned_loss=0.09364, over 4279243.62 frames. ], batch size: 160, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:53:37,146 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=459534.0, ans=0.125 2023-06-20 01:53:38,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=459534.0, ans=0.0 2023-06-20 01:54:05,072 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=459594.0, ans=0.125 2023-06-20 01:54:05,101 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 01:54:05,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=459594.0, ans=0.0 2023-06-20 01:54:29,186 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=459654.0, ans=0.125 2023-06-20 01:54:31,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=459654.0, ans=0.125 2023-06-20 01:54:36,724 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.56 vs. limit=22.5 2023-06-20 01:54:55,795 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=459714.0, ans=15.0 2023-06-20 01:55:34,235 INFO [train.py:996] (2/4) Epoch 3, batch 15650, loss[loss=0.2293, simple_loss=0.2861, pruned_loss=0.08627, over 21608.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3233, pruned_loss=0.09337, over 4279646.07 frames. ], batch size: 247, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:55:54,687 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=459894.0, ans=0.0 2023-06-20 01:56:03,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=459894.0, ans=0.125 2023-06-20 01:56:11,104 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.66 vs. limit=15.0 2023-06-20 01:56:11,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=459954.0, ans=0.95 2023-06-20 01:56:51,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=460014.0, ans=0.1 2023-06-20 01:56:56,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=460074.0, ans=0.2 2023-06-20 01:56:57,902 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.823e+02 2.476e+02 2.884e+02 3.400e+02 5.919e+02, threshold=5.768e+02, percent-clipped=1.0 2023-06-20 01:57:17,714 INFO [train.py:996] (2/4) Epoch 3, batch 15700, loss[loss=0.2848, simple_loss=0.3336, pruned_loss=0.118, over 21280.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.319, pruned_loss=0.09176, over 4272849.28 frames. ], batch size: 471, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:57:20,075 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.04 vs. limit=10.0 2023-06-20 01:58:12,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=460254.0, ans=0.125 2023-06-20 01:58:15,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=460254.0, ans=0.125 2023-06-20 01:59:18,032 INFO [train.py:996] (2/4) Epoch 3, batch 15750, loss[loss=0.2524, simple_loss=0.3092, pruned_loss=0.09773, over 21565.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3147, pruned_loss=0.09178, over 4266929.02 frames. ], batch size: 414, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:59:57,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=460494.0, ans=0.0 2023-06-20 02:00:54,929 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.788e+02 2.350e+02 2.605e+02 2.992e+02 4.357e+02, threshold=5.211e+02, percent-clipped=0.0 2023-06-20 02:01:21,482 INFO [train.py:996] (2/4) Epoch 3, batch 15800, loss[loss=0.2527, simple_loss=0.2993, pruned_loss=0.103, over 21518.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3111, pruned_loss=0.09205, over 4271951.22 frames. ], batch size: 442, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 02:01:22,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=460734.0, ans=0.125 2023-06-20 02:02:04,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=460854.0, ans=0.09899494936611666 2023-06-20 02:02:35,668 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=460914.0, ans=0.0 2023-06-20 02:03:13,428 INFO [train.py:996] (2/4) Epoch 3, batch 15850, loss[loss=0.2814, simple_loss=0.3472, pruned_loss=0.1077, over 21263.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3138, pruned_loss=0.09486, over 4275418.30 frames. ], batch size: 159, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 02:03:48,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=461154.0, ans=0.0 2023-06-20 02:03:51,509 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=461154.0, ans=0.125 2023-06-20 02:04:33,405 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.107e+02 2.753e+02 3.333e+02 4.254e+02 6.443e+02, threshold=6.666e+02, percent-clipped=4.0 2023-06-20 02:04:49,703 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=461274.0, ans=0.125 2023-06-20 02:04:57,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=461334.0, ans=0.1 2023-06-20 02:04:59,059 INFO [train.py:996] (2/4) Epoch 3, batch 15900, loss[loss=0.2623, simple_loss=0.3083, pruned_loss=0.1082, over 20168.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.311, pruned_loss=0.09534, over 4274800.01 frames. ], batch size: 707, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 02:05:27,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=461454.0, ans=0.0 2023-06-20 02:06:45,588 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.77 vs. limit=22.5 2023-06-20 02:06:52,748 INFO [train.py:996] (2/4) Epoch 3, batch 15950, loss[loss=0.1825, simple_loss=0.2824, pruned_loss=0.0413, over 21772.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3131, pruned_loss=0.09268, over 4267935.80 frames. ], batch size: 332, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 02:07:21,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=461694.0, ans=0.125 2023-06-20 02:07:30,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=461694.0, ans=0.035 2023-06-20 02:08:20,810 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=461814.0, ans=0.125 2023-06-20 02:08:30,945 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.640e+02 2.370e+02 2.785e+02 3.587e+02 6.147e+02, threshold=5.571e+02, percent-clipped=0.0 2023-06-20 02:08:32,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=461874.0, ans=0.1 2023-06-20 02:08:47,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=461874.0, ans=0.0 2023-06-20 02:08:51,762 INFO [train.py:996] (2/4) Epoch 3, batch 16000, loss[loss=0.2415, simple_loss=0.3314, pruned_loss=0.07578, over 21676.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3142, pruned_loss=0.0914, over 4268641.85 frames. ], batch size: 389, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 02:09:05,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=461934.0, ans=0.125 2023-06-20 02:09:12,233 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=461994.0, ans=0.2 2023-06-20 02:09:17,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=461994.0, ans=0.125 2023-06-20 02:10:13,129 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.43 vs. limit=12.0 2023-06-20 02:11:01,270 INFO [train.py:996] (2/4) Epoch 3, batch 16050, loss[loss=0.2038, simple_loss=0.2756, pruned_loss=0.06598, over 21410.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3197, pruned_loss=0.08949, over 4261762.06 frames. ], batch size: 176, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 02:11:17,628 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=462234.0, ans=0.0 2023-06-20 02:11:36,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=462354.0, ans=0.125 2023-06-20 02:12:28,229 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.903e+02 2.489e+02 3.002e+02 4.044e+02 8.008e+02, threshold=6.003e+02, percent-clipped=4.0 2023-06-20 02:12:49,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=462534.0, ans=0.125 2023-06-20 02:12:49,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=462534.0, ans=0.125 2023-06-20 02:12:50,216 INFO [train.py:996] (2/4) Epoch 3, batch 16100, loss[loss=0.2823, simple_loss=0.3405, pruned_loss=0.1121, over 21782.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3234, pruned_loss=0.09027, over 4266400.31 frames. ], batch size: 112, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:13:25,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=462594.0, ans=0.1 2023-06-20 02:14:22,178 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=462774.0, ans=0.07 2023-06-20 02:14:23,517 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=462774.0, ans=0.0 2023-06-20 02:14:49,968 INFO [train.py:996] (2/4) Epoch 3, batch 16150, loss[loss=0.2712, simple_loss=0.3372, pruned_loss=0.1026, over 21761.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3242, pruned_loss=0.09227, over 4280862.07 frames. ], batch size: 389, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:14:50,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=462834.0, ans=0.125 2023-06-20 02:15:03,509 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=462834.0, ans=0.125 2023-06-20 02:15:06,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=462834.0, ans=0.125 2023-06-20 02:15:46,500 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.79 vs. limit=15.0 2023-06-20 02:16:16,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=463074.0, ans=0.125 2023-06-20 02:16:17,573 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.879e+02 2.540e+02 2.925e+02 3.494e+02 4.930e+02, threshold=5.850e+02, percent-clipped=0.0 2023-06-20 02:16:18,641 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.05 vs. limit=15.0 2023-06-20 02:16:48,932 INFO [train.py:996] (2/4) Epoch 3, batch 16200, loss[loss=0.2352, simple_loss=0.332, pruned_loss=0.06922, over 21682.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3264, pruned_loss=0.09318, over 4287208.18 frames. ], batch size: 263, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:17:06,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=463134.0, ans=0.125 2023-06-20 02:17:20,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=463194.0, ans=0.1 2023-06-20 02:17:31,436 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.51 vs. limit=6.0 2023-06-20 02:18:02,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=463314.0, ans=0.1 2023-06-20 02:18:19,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=463374.0, ans=0.125 2023-06-20 02:18:31,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=463374.0, ans=0.0 2023-06-20 02:18:37,291 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 02:18:39,601 INFO [train.py:996] (2/4) Epoch 3, batch 16250, loss[loss=0.2239, simple_loss=0.3043, pruned_loss=0.07176, over 21761.00 frames. ], tot_loss[loss=0.2571, simple_loss=0.3265, pruned_loss=0.09387, over 4276209.13 frames. ], batch size: 371, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:18:53,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=463434.0, ans=0.125 2023-06-20 02:19:43,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=463614.0, ans=0.125 2023-06-20 02:20:10,411 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.786e+02 2.534e+02 3.093e+02 3.878e+02 6.087e+02, threshold=6.186e+02, percent-clipped=1.0 2023-06-20 02:20:23,627 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=463674.0, ans=0.125 2023-06-20 02:20:30,344 INFO [train.py:996] (2/4) Epoch 3, batch 16300, loss[loss=0.2039, simple_loss=0.2782, pruned_loss=0.06477, over 21601.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3199, pruned_loss=0.08978, over 4271227.52 frames. ], batch size: 263, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:20:54,318 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=463794.0, ans=0.125 2023-06-20 02:21:06,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=463854.0, ans=0.125 2023-06-20 02:21:38,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=463914.0, ans=0.0 2023-06-20 02:22:06,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=463974.0, ans=0.125 2023-06-20 02:22:25,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=463974.0, ans=0.125 2023-06-20 02:22:27,464 INFO [train.py:996] (2/4) Epoch 3, batch 16350, loss[loss=0.3712, simple_loss=0.4082, pruned_loss=0.1671, over 21425.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.319, pruned_loss=0.08985, over 4269224.01 frames. ], batch size: 510, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:23:06,520 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=464094.0, ans=0.125 2023-06-20 02:23:34,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=464154.0, ans=0.125 2023-06-20 02:23:50,609 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.65 vs. limit=22.5 2023-06-20 02:24:13,371 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.955e+02 2.570e+02 3.121e+02 3.955e+02 6.816e+02, threshold=6.242e+02, percent-clipped=2.0 2023-06-20 02:24:29,368 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.99 vs. limit=10.0 2023-06-20 02:24:32,107 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.36 vs. limit=15.0 2023-06-20 02:24:40,016 INFO [train.py:996] (2/4) Epoch 3, batch 16400, loss[loss=0.2687, simple_loss=0.3284, pruned_loss=0.1045, over 21862.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.3232, pruned_loss=0.09196, over 4278043.01 frames. ], batch size: 371, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:24:55,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=464394.0, ans=0.1 2023-06-20 02:25:47,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=464454.0, ans=0.0 2023-06-20 02:26:46,837 INFO [train.py:996] (2/4) Epoch 3, batch 16450, loss[loss=0.2571, simple_loss=0.3242, pruned_loss=0.09499, over 21852.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.3236, pruned_loss=0.09331, over 4279951.96 frames. ], batch size: 107, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:27:32,945 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=464754.0, ans=0.1 2023-06-20 02:27:35,879 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=464754.0, ans=0.125 2023-06-20 02:27:48,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=464754.0, ans=0.0 2023-06-20 02:28:14,018 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.34 vs. limit=12.0 2023-06-20 02:28:32,307 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.145e+02 2.737e+02 3.058e+02 3.975e+02 7.376e+02, threshold=6.116e+02, percent-clipped=5.0 2023-06-20 02:28:52,849 INFO [train.py:996] (2/4) Epoch 3, batch 16500, loss[loss=0.3258, simple_loss=0.3845, pruned_loss=0.1336, over 21510.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3222, pruned_loss=0.09373, over 4280936.48 frames. ], batch size: 508, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:29:21,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=464994.0, ans=0.125 2023-06-20 02:30:35,899 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=465174.0, ans=0.0 2023-06-20 02:30:51,346 INFO [train.py:996] (2/4) Epoch 3, batch 16550, loss[loss=0.2611, simple_loss=0.3259, pruned_loss=0.09815, over 21473.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3187, pruned_loss=0.0905, over 4283603.94 frames. ], batch size: 131, lr: 1.08e-02, grad_scale: 64.0 2023-06-20 02:31:30,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=465294.0, ans=0.0 2023-06-20 02:32:02,917 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=465354.0, ans=0.125 2023-06-20 02:32:38,357 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.971e+02 2.553e+02 2.978e+02 3.771e+02 6.936e+02, threshold=5.956e+02, percent-clipped=3.0 2023-06-20 02:33:16,485 INFO [train.py:996] (2/4) Epoch 3, batch 16600, loss[loss=0.3046, simple_loss=0.3937, pruned_loss=0.1078, over 21839.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.3284, pruned_loss=0.09438, over 4281979.53 frames. ], batch size: 371, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:34:06,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=465654.0, ans=0.0 2023-06-20 02:35:24,720 INFO [train.py:996] (2/4) Epoch 3, batch 16650, loss[loss=0.2804, simple_loss=0.3769, pruned_loss=0.09194, over 20758.00 frames. ], tot_loss[loss=0.2678, simple_loss=0.3405, pruned_loss=0.09757, over 4277463.47 frames. ], batch size: 607, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:36:02,243 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.62 vs. limit=22.5 2023-06-20 02:36:11,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=465954.0, ans=0.05 2023-06-20 02:37:10,059 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=466014.0, ans=0.1 2023-06-20 02:37:14,228 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.203e+02 2.861e+02 3.349e+02 3.962e+02 7.489e+02, threshold=6.698e+02, percent-clipped=1.0 2023-06-20 02:37:32,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=466134.0, ans=0.125 2023-06-20 02:37:33,542 INFO [train.py:996] (2/4) Epoch 3, batch 16700, loss[loss=0.2307, simple_loss=0.2934, pruned_loss=0.08401, over 21724.00 frames. ], tot_loss[loss=0.2701, simple_loss=0.3416, pruned_loss=0.09925, over 4280267.58 frames. ], batch size: 247, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:37:34,156 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=466134.0, ans=0.125 2023-06-20 02:37:35,746 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=466134.0, ans=0.125 2023-06-20 02:37:41,788 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=15.0 2023-06-20 02:37:52,820 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.54 vs. limit=12.0 2023-06-20 02:38:18,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=466194.0, ans=0.125 2023-06-20 02:38:51,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=466254.0, ans=0.125 2023-06-20 02:38:54,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=466254.0, ans=0.2 2023-06-20 02:39:08,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=466314.0, ans=0.125 2023-06-20 02:39:23,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=466374.0, ans=0.0 2023-06-20 02:39:35,628 INFO [train.py:996] (2/4) Epoch 3, batch 16750, loss[loss=0.2805, simple_loss=0.3495, pruned_loss=0.1058, over 20721.00 frames. ], tot_loss[loss=0.2733, simple_loss=0.3436, pruned_loss=0.1015, over 4279291.65 frames. ], batch size: 607, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:40:47,843 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=466554.0, ans=0.1 2023-06-20 02:41:14,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=466554.0, ans=0.125 2023-06-20 02:41:36,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=466614.0, ans=0.0 2023-06-20 02:41:41,629 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 02:41:44,390 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.115e+02 2.879e+02 3.240e+02 3.697e+02 5.648e+02, threshold=6.479e+02, percent-clipped=0.0 2023-06-20 02:41:50,557 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=466674.0, ans=0.125 2023-06-20 02:41:57,468 INFO [train.py:996] (2/4) Epoch 3, batch 16800, loss[loss=0.3015, simple_loss=0.364, pruned_loss=0.1195, over 21782.00 frames. ], tot_loss[loss=0.2756, simple_loss=0.3486, pruned_loss=0.1013, over 4275133.10 frames. ], batch size: 441, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:42:02,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=466734.0, ans=0.125 2023-06-20 02:43:39,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=466974.0, ans=0.125 2023-06-20 02:43:42,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=466974.0, ans=0.125 2023-06-20 02:44:01,336 INFO [train.py:996] (2/4) Epoch 3, batch 16850, loss[loss=0.2522, simple_loss=0.3139, pruned_loss=0.09524, over 21242.00 frames. ], tot_loss[loss=0.2733, simple_loss=0.3445, pruned_loss=0.1011, over 4278582.30 frames. ], batch size: 176, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:44:29,675 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.70 vs. limit=15.0 2023-06-20 02:44:39,313 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.28 vs. limit=15.0 2023-06-20 02:45:03,110 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.77 vs. limit=10.0 2023-06-20 02:45:22,550 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.682e+02 2.543e+02 2.978e+02 3.751e+02 8.288e+02, threshold=5.956e+02, percent-clipped=4.0 2023-06-20 02:45:37,906 INFO [train.py:996] (2/4) Epoch 3, batch 16900, loss[loss=0.2295, simple_loss=0.2869, pruned_loss=0.08605, over 21191.00 frames. ], tot_loss[loss=0.2683, simple_loss=0.3382, pruned_loss=0.09922, over 4281776.43 frames. ], batch size: 159, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:46:14,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=467394.0, ans=0.125 2023-06-20 02:46:17,099 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=467454.0, ans=0.125 2023-06-20 02:46:17,156 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=467454.0, ans=0.125 2023-06-20 02:46:21,583 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=467454.0, ans=0.0 2023-06-20 02:46:38,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=467454.0, ans=0.1 2023-06-20 02:47:13,025 INFO [train.py:996] (2/4) Epoch 3, batch 16950, loss[loss=0.2502, simple_loss=0.3139, pruned_loss=0.09322, over 21943.00 frames. ], tot_loss[loss=0.2631, simple_loss=0.331, pruned_loss=0.09762, over 4290585.59 frames. ], batch size: 316, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:47:32,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=467634.0, ans=0.125 2023-06-20 02:47:56,630 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.97 vs. limit=15.0 2023-06-20 02:48:35,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=467754.0, ans=0.0 2023-06-20 02:48:40,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=467814.0, ans=0.0 2023-06-20 02:48:54,928 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 2.520e+02 2.812e+02 3.324e+02 6.057e+02, threshold=5.623e+02, percent-clipped=0.0 2023-06-20 02:49:07,286 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=467934.0, ans=0.07 2023-06-20 02:49:08,233 INFO [train.py:996] (2/4) Epoch 3, batch 17000, loss[loss=0.2228, simple_loss=0.2523, pruned_loss=0.0967, over 20042.00 frames. ], tot_loss[loss=0.261, simple_loss=0.3268, pruned_loss=0.09758, over 4292566.62 frames. ], batch size: 704, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:50:01,260 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.85 vs. limit=6.0 2023-06-20 02:50:54,748 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.54 vs. limit=22.5 2023-06-20 02:50:55,619 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 02:50:59,625 INFO [train.py:996] (2/4) Epoch 3, batch 17050, loss[loss=0.2647, simple_loss=0.3351, pruned_loss=0.09714, over 21423.00 frames. ], tot_loss[loss=0.2665, simple_loss=0.334, pruned_loss=0.09948, over 4294693.73 frames. ], batch size: 131, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:51:00,156 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=468234.0, ans=0.125 2023-06-20 02:51:24,466 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=468294.0, ans=0.125 2023-06-20 02:51:27,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=468294.0, ans=0.125 2023-06-20 02:52:20,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=468474.0, ans=0.125 2023-06-20 02:52:22,713 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 2.663e+02 3.089e+02 3.979e+02 5.737e+02, threshold=6.177e+02, percent-clipped=2.0 2023-06-20 02:52:24,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=468474.0, ans=0.125 2023-06-20 02:52:29,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=468474.0, ans=0.125 2023-06-20 02:52:35,777 INFO [train.py:996] (2/4) Epoch 3, batch 17100, loss[loss=0.2619, simple_loss=0.3251, pruned_loss=0.09934, over 21952.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3325, pruned_loss=0.09989, over 4297454.83 frames. ], batch size: 333, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:52:47,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=468534.0, ans=0.0 2023-06-20 02:53:34,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=468654.0, ans=0.2 2023-06-20 02:53:49,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=468714.0, ans=0.125 2023-06-20 02:53:50,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=468774.0, ans=0.125 2023-06-20 02:53:58,969 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=468774.0, ans=0.1 2023-06-20 02:54:11,171 INFO [train.py:996] (2/4) Epoch 3, batch 17150, loss[loss=0.244, simple_loss=0.3051, pruned_loss=0.09149, over 21161.00 frames. ], tot_loss[loss=0.2639, simple_loss=0.3284, pruned_loss=0.09968, over 4301616.63 frames. ], batch size: 608, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:55:17,637 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=468954.0, ans=0.1 2023-06-20 02:55:18,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=468954.0, ans=0.125 2023-06-20 02:55:59,113 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.947e+02 2.624e+02 2.921e+02 3.624e+02 5.783e+02, threshold=5.842e+02, percent-clipped=0.0 2023-06-20 02:56:24,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=469074.0, ans=0.125 2023-06-20 02:56:28,762 INFO [train.py:996] (2/4) Epoch 3, batch 17200, loss[loss=0.2727, simple_loss=0.3387, pruned_loss=0.1033, over 21734.00 frames. ], tot_loss[loss=0.2627, simple_loss=0.327, pruned_loss=0.09918, over 4296386.34 frames. ], batch size: 332, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:56:35,857 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.86 vs. limit=6.0 2023-06-20 02:56:38,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=469134.0, ans=0.125 2023-06-20 02:57:12,207 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=469254.0, ans=0.125 2023-06-20 02:57:23,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=469254.0, ans=0.1 2023-06-20 02:57:43,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=469314.0, ans=0.1 2023-06-20 02:58:07,489 INFO [train.py:996] (2/4) Epoch 3, batch 17250, loss[loss=0.3024, simple_loss=0.3765, pruned_loss=0.1141, over 21843.00 frames. ], tot_loss[loss=0.2673, simple_loss=0.3318, pruned_loss=0.1014, over 4293521.84 frames. ], batch size: 124, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:58:17,873 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=469434.0, ans=0.125 2023-06-20 02:58:30,023 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-06-20 02:59:09,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=469614.0, ans=0.0 2023-06-20 02:59:22,066 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=469614.0, ans=0.125 2023-06-20 02:59:25,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=469614.0, ans=0.125 2023-06-20 02:59:28,817 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.84 vs. limit=15.0 2023-06-20 02:59:32,663 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.265e+02 2.790e+02 3.289e+02 4.103e+02 7.188e+02, threshold=6.577e+02, percent-clipped=6.0 2023-06-20 02:59:46,043 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff3.min_abs, batch_count=469674.0, ans=0.2 2023-06-20 02:59:56,842 INFO [train.py:996] (2/4) Epoch 3, batch 17300, loss[loss=0.3059, simple_loss=0.3627, pruned_loss=0.1246, over 21387.00 frames. ], tot_loss[loss=0.2751, simple_loss=0.3407, pruned_loss=0.1047, over 4286658.66 frames. ], batch size: 549, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 03:00:10,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=469794.0, ans=0.125 2023-06-20 03:00:24,061 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.10 vs. limit=15.0 2023-06-20 03:00:39,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=469854.0, ans=0.0 2023-06-20 03:00:50,002 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=469854.0, ans=0.2 2023-06-20 03:01:27,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=469974.0, ans=0.0 2023-06-20 03:01:33,722 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-06-20 03:01:37,009 INFO [train.py:996] (2/4) Epoch 3, batch 17350, loss[loss=0.3462, simple_loss=0.3989, pruned_loss=0.1467, over 21383.00 frames. ], tot_loss[loss=0.275, simple_loss=0.3414, pruned_loss=0.1043, over 4285583.76 frames. ], batch size: 508, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 03:02:32,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=470154.0, ans=0.125 2023-06-20 03:03:04,869 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.49 vs. limit=12.0 2023-06-20 03:03:09,775 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.973e+02 2.914e+02 3.379e+02 4.234e+02 7.402e+02, threshold=6.758e+02, percent-clipped=2.0 2023-06-20 03:03:23,321 INFO [train.py:996] (2/4) Epoch 3, batch 17400, loss[loss=0.2199, simple_loss=0.2639, pruned_loss=0.08797, over 21139.00 frames. ], tot_loss[loss=0.2678, simple_loss=0.3359, pruned_loss=0.09983, over 4269674.28 frames. ], batch size: 143, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 03:03:26,756 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=470334.0, ans=10.0 2023-06-20 03:04:08,715 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=470394.0, ans=0.125 2023-06-20 03:04:54,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=470514.0, ans=0.0 2023-06-20 03:05:25,724 INFO [train.py:996] (2/4) Epoch 3, batch 17450, loss[loss=0.2228, simple_loss=0.317, pruned_loss=0.06433, over 21154.00 frames. ], tot_loss[loss=0.2622, simple_loss=0.3313, pruned_loss=0.09651, over 4275299.42 frames. ], batch size: 548, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 03:06:07,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=470754.0, ans=0.0 2023-06-20 03:06:47,775 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.738e+02 2.479e+02 3.095e+02 3.749e+02 7.284e+02, threshold=6.191e+02, percent-clipped=2.0 2023-06-20 03:07:04,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=470874.0, ans=15.0 2023-06-20 03:07:06,290 INFO [train.py:996] (2/4) Epoch 3, batch 17500, loss[loss=0.2536, simple_loss=0.316, pruned_loss=0.09564, over 21265.00 frames. ], tot_loss[loss=0.2573, simple_loss=0.3263, pruned_loss=0.09413, over 4278889.63 frames. ], batch size: 176, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 03:07:27,525 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=470994.0, ans=0.0 2023-06-20 03:07:40,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=471054.0, ans=0.1 2023-06-20 03:08:12,686 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-06-20 03:08:15,813 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.43 vs. limit=15.0 2023-06-20 03:08:28,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=471174.0, ans=0.125 2023-06-20 03:08:41,876 INFO [train.py:996] (2/4) Epoch 3, batch 17550, loss[loss=0.2333, simple_loss=0.3242, pruned_loss=0.07123, over 21367.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.3271, pruned_loss=0.09231, over 4282635.91 frames. ], batch size: 131, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:09:18,980 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=471354.0, ans=0.0 2023-06-20 03:09:44,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=471414.0, ans=0.2 2023-06-20 03:09:54,211 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.620e+02 2.408e+02 2.804e+02 3.227e+02 5.442e+02, threshold=5.608e+02, percent-clipped=0.0 2023-06-20 03:10:14,339 INFO [train.py:996] (2/4) Epoch 3, batch 17600, loss[loss=0.3011, simple_loss=0.4119, pruned_loss=0.09516, over 19813.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.3299, pruned_loss=0.09282, over 4275136.35 frames. ], batch size: 703, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:10:37,285 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=471534.0, ans=0.2 2023-06-20 03:11:14,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=471654.0, ans=0.125 2023-06-20 03:11:24,115 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.81 vs. limit=15.0 2023-06-20 03:11:31,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=471714.0, ans=0.0 2023-06-20 03:11:57,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=471774.0, ans=0.125 2023-06-20 03:12:00,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=471834.0, ans=0.125 2023-06-20 03:12:01,249 INFO [train.py:996] (2/4) Epoch 3, batch 17650, loss[loss=0.1928, simple_loss=0.2665, pruned_loss=0.05958, over 21717.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3256, pruned_loss=0.09196, over 4269080.74 frames. ], batch size: 298, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:12:46,516 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=471894.0, ans=0.1 2023-06-20 03:13:08,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=471954.0, ans=10.0 2023-06-20 03:13:35,994 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.953e+02 2.510e+02 2.877e+02 3.404e+02 6.188e+02, threshold=5.753e+02, percent-clipped=2.0 2023-06-20 03:13:36,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=472074.0, ans=0.125 2023-06-20 03:13:44,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=472074.0, ans=0.125 2023-06-20 03:13:50,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=472074.0, ans=0.0 2023-06-20 03:13:54,786 INFO [train.py:996] (2/4) Epoch 3, batch 17700, loss[loss=0.2587, simple_loss=0.3261, pruned_loss=0.09562, over 21374.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3177, pruned_loss=0.08858, over 4260568.09 frames. ], batch size: 143, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:14:18,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=472134.0, ans=0.1 2023-06-20 03:15:38,917 INFO [train.py:996] (2/4) Epoch 3, batch 17750, loss[loss=0.2834, simple_loss=0.3546, pruned_loss=0.106, over 21846.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3277, pruned_loss=0.09336, over 4266537.06 frames. ], batch size: 282, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:15:50,886 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=472434.0, ans=0.125 2023-06-20 03:16:19,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=472554.0, ans=0.0 2023-06-20 03:16:32,256 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=15.0 2023-06-20 03:16:33,130 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=472554.0, ans=0.1 2023-06-20 03:16:43,618 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=472614.0, ans=0.2 2023-06-20 03:16:47,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=472614.0, ans=0.125 2023-06-20 03:16:49,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=472614.0, ans=0.125 2023-06-20 03:17:03,690 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.776e+02 2.594e+02 2.990e+02 3.478e+02 6.591e+02, threshold=5.980e+02, percent-clipped=3.0 2023-06-20 03:17:05,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=472674.0, ans=0.1 2023-06-20 03:17:27,834 INFO [train.py:996] (2/4) Epoch 3, batch 17800, loss[loss=0.2219, simple_loss=0.2887, pruned_loss=0.07757, over 21304.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.3281, pruned_loss=0.09277, over 4269032.67 frames. ], batch size: 159, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:18:17,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=472914.0, ans=0.125 2023-06-20 03:18:35,752 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 03:19:05,727 INFO [train.py:996] (2/4) Epoch 3, batch 17850, loss[loss=0.2912, simple_loss=0.3629, pruned_loss=0.1098, over 21435.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.3275, pruned_loss=0.09294, over 4270661.05 frames. ], batch size: 131, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:19:46,545 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.41 vs. limit=22.5 2023-06-20 03:20:42,270 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.887e+02 2.814e+02 3.255e+02 4.435e+02 8.070e+02, threshold=6.511e+02, percent-clipped=5.0 2023-06-20 03:20:42,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=473274.0, ans=0.0 2023-06-20 03:20:55,850 INFO [train.py:996] (2/4) Epoch 3, batch 17900, loss[loss=0.2342, simple_loss=0.3151, pruned_loss=0.07666, over 21272.00 frames. ], tot_loss[loss=0.2605, simple_loss=0.3318, pruned_loss=0.09466, over 4273806.26 frames. ], batch size: 159, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:21:27,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=473394.0, ans=0.125 2023-06-20 03:21:33,409 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=473394.0, ans=0.0 2023-06-20 03:21:57,889 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=473454.0, ans=0.125 2023-06-20 03:22:14,041 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=473454.0, ans=0.1 2023-06-20 03:22:49,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=473574.0, ans=0.1 2023-06-20 03:22:53,610 INFO [train.py:996] (2/4) Epoch 3, batch 17950, loss[loss=0.1964, simple_loss=0.27, pruned_loss=0.06138, over 21812.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3319, pruned_loss=0.09165, over 4269979.56 frames. ], batch size: 118, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:22:54,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=473634.0, ans=0.0 2023-06-20 03:23:34,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=473694.0, ans=0.0 2023-06-20 03:24:15,532 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.23 vs. limit=15.0 2023-06-20 03:24:20,285 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.10 vs. limit=15.0 2023-06-20 03:24:29,839 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.586e+02 2.407e+02 3.130e+02 3.720e+02 6.419e+02, threshold=6.259e+02, percent-clipped=0.0 2023-06-20 03:24:43,119 INFO [train.py:996] (2/4) Epoch 3, batch 18000, loss[loss=0.2657, simple_loss=0.3086, pruned_loss=0.1114, over 21244.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3258, pruned_loss=0.09087, over 4275232.87 frames. ], batch size: 471, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:24:43,120 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-20 03:25:40,128 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.0.layers.0.self_attn_weights, attn_weights_entropy = tensor([4.9552, 5.0870, 4.8868, 4.5880], device='cuda:2') 2023-06-20 03:25:44,241 INFO [train.py:1028] (2/4) Epoch 3, validation: loss=0.2752, simple_loss=0.3767, pruned_loss=0.08679, over 1796401.00 frames. 2023-06-20 03:25:44,241 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-20 03:25:58,339 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.45 vs. limit=22.5 2023-06-20 03:26:25,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=473994.0, ans=0.0 2023-06-20 03:26:27,361 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=473994.0, ans=0.125 2023-06-20 03:26:47,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=474054.0, ans=0.1 2023-06-20 03:26:57,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=474114.0, ans=0.125 2023-06-20 03:27:11,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=474174.0, ans=0.125 2023-06-20 03:27:23,082 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=474174.0, ans=0.125 2023-06-20 03:27:23,100 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=474174.0, ans=0.1 2023-06-20 03:27:26,969 INFO [train.py:996] (2/4) Epoch 3, batch 18050, loss[loss=0.2573, simple_loss=0.3199, pruned_loss=0.09737, over 21639.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3207, pruned_loss=0.09003, over 4267605.97 frames. ], batch size: 332, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:27:28,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=474234.0, ans=0.125 2023-06-20 03:27:30,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=474234.0, ans=0.0 2023-06-20 03:27:34,472 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.84 vs. limit=15.0 2023-06-20 03:28:26,719 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=474354.0, ans=0.0 2023-06-20 03:28:45,790 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.975e+02 2.808e+02 3.337e+02 3.938e+02 6.336e+02, threshold=6.674e+02, percent-clipped=1.0 2023-06-20 03:29:05,296 INFO [train.py:996] (2/4) Epoch 3, batch 18100, loss[loss=0.2765, simple_loss=0.3609, pruned_loss=0.09602, over 21691.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.3254, pruned_loss=0.09287, over 4268443.72 frames. ], batch size: 351, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:29:22,759 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.91 vs. limit=22.5 2023-06-20 03:29:41,142 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=474594.0, ans=0.1 2023-06-20 03:29:54,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=474654.0, ans=0.125 2023-06-20 03:30:38,264 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=474774.0, ans=0.1 2023-06-20 03:30:43,528 INFO [train.py:996] (2/4) Epoch 3, batch 18150, loss[loss=0.2379, simple_loss=0.2953, pruned_loss=0.09021, over 15638.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.3265, pruned_loss=0.09269, over 4260257.63 frames. ], batch size: 61, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:30:44,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=474834.0, ans=0.0 2023-06-20 03:31:49,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=475014.0, ans=0.0 2023-06-20 03:32:04,234 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.999e+02 2.533e+02 2.904e+02 3.319e+02 6.906e+02, threshold=5.807e+02, percent-clipped=1.0 2023-06-20 03:32:29,455 INFO [train.py:996] (2/4) Epoch 3, batch 18200, loss[loss=0.2242, simple_loss=0.2939, pruned_loss=0.07726, over 21532.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.322, pruned_loss=0.09293, over 4246585.54 frames. ], batch size: 195, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:32:31,459 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=475134.0, ans=0.2 2023-06-20 03:32:58,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=475194.0, ans=0.125 2023-06-20 03:32:59,244 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.98 vs. limit=22.5 2023-06-20 03:33:20,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=475254.0, ans=0.02 2023-06-20 03:33:37,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=475314.0, ans=0.125 2023-06-20 03:33:40,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=475314.0, ans=0.125 2023-06-20 03:33:48,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=475374.0, ans=0.0 2023-06-20 03:33:56,945 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=475374.0, ans=0.0 2023-06-20 03:34:01,010 INFO [train.py:996] (2/4) Epoch 3, batch 18250, loss[loss=0.1916, simple_loss=0.2558, pruned_loss=0.06376, over 21319.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3127, pruned_loss=0.08907, over 4247883.70 frames. ], batch size: 144, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:34:02,124 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.74 vs. limit=15.0 2023-06-20 03:35:06,230 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.78 vs. limit=22.5 2023-06-20 03:35:23,026 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.80 vs. limit=15.0 2023-06-20 03:35:24,974 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 2.363e+02 2.913e+02 3.587e+02 8.946e+02, threshold=5.827e+02, percent-clipped=7.0 2023-06-20 03:35:37,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=475734.0, ans=0.0 2023-06-20 03:35:37,906 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.43 vs. limit=15.0 2023-06-20 03:35:38,295 INFO [train.py:996] (2/4) Epoch 3, batch 18300, loss[loss=0.2805, simple_loss=0.3611, pruned_loss=0.09994, over 21699.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3127, pruned_loss=0.08909, over 4258425.58 frames. ], batch size: 389, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:35:43,149 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=475734.0, ans=0.2 2023-06-20 03:36:10,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=475794.0, ans=0.125 2023-06-20 03:36:44,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=475914.0, ans=0.0 2023-06-20 03:36:49,186 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 03:37:02,618 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 03:37:14,113 INFO [train.py:996] (2/4) Epoch 3, batch 18350, loss[loss=0.2339, simple_loss=0.2972, pruned_loss=0.08528, over 21236.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3194, pruned_loss=0.08929, over 4251521.35 frames. ], batch size: 159, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:38:43,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=476274.0, ans=0.125 2023-06-20 03:38:44,856 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.975e+02 2.546e+02 3.125e+02 3.968e+02 7.184e+02, threshold=6.249e+02, percent-clipped=4.0 2023-06-20 03:38:45,449 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=476274.0, ans=0.125 2023-06-20 03:38:54,700 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.04 vs. limit=6.0 2023-06-20 03:38:58,456 INFO [train.py:996] (2/4) Epoch 3, batch 18400, loss[loss=0.2179, simple_loss=0.2963, pruned_loss=0.06972, over 21620.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3154, pruned_loss=0.08811, over 4246207.92 frames. ], batch size: 391, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:39:10,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=476334.0, ans=0.125 2023-06-20 03:39:49,964 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=476454.0, ans=0.1 2023-06-20 03:40:12,171 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=476514.0, ans=0.2 2023-06-20 03:40:31,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=476574.0, ans=0.125 2023-06-20 03:40:35,492 INFO [train.py:996] (2/4) Epoch 3, batch 18450, loss[loss=0.1742, simple_loss=0.245, pruned_loss=0.05166, over 21198.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3107, pruned_loss=0.08415, over 4232487.02 frames. ], batch size: 143, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:40:55,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=476694.0, ans=0.125 2023-06-20 03:41:23,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=476754.0, ans=0.1 2023-06-20 03:41:57,540 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.580e+02 2.233e+02 2.788e+02 3.654e+02 6.686e+02, threshold=5.575e+02, percent-clipped=3.0 2023-06-20 03:42:20,996 INFO [train.py:996] (2/4) Epoch 3, batch 18500, loss[loss=0.2568, simple_loss=0.301, pruned_loss=0.1063, over 21520.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3054, pruned_loss=0.08225, over 4231664.77 frames. ], batch size: 442, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:42:24,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=476934.0, ans=0.5 2023-06-20 03:42:42,993 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=476934.0, ans=0.125 2023-06-20 03:43:31,593 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=12.0 2023-06-20 03:43:34,112 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=477114.0, ans=0.0 2023-06-20 03:44:04,051 INFO [train.py:996] (2/4) Epoch 3, batch 18550, loss[loss=0.2022, simple_loss=0.294, pruned_loss=0.0552, over 20773.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3033, pruned_loss=0.08126, over 4234540.79 frames. ], batch size: 608, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:44:47,781 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 03:45:17,935 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=477354.0, ans=10.0 2023-06-20 03:45:33,178 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 03:45:38,893 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.834e+02 2.443e+02 2.739e+02 3.357e+02 5.521e+02, threshold=5.479e+02, percent-clipped=0.0 2023-06-20 03:45:51,269 INFO [train.py:996] (2/4) Epoch 3, batch 18600, loss[loss=0.3373, simple_loss=0.3876, pruned_loss=0.1435, over 21434.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3015, pruned_loss=0.08232, over 4237876.03 frames. ], batch size: 508, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:45:55,494 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.73 vs. limit=15.0 2023-06-20 03:46:10,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=477594.0, ans=0.0 2023-06-20 03:46:21,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=477594.0, ans=0.125 2023-06-20 03:46:54,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=477654.0, ans=0.0 2023-06-20 03:47:06,875 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.05 vs. limit=15.0 2023-06-20 03:47:12,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=477774.0, ans=0.125 2023-06-20 03:47:20,288 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.50 vs. limit=12.0 2023-06-20 03:47:32,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=477834.0, ans=0.2 2023-06-20 03:47:33,512 INFO [train.py:996] (2/4) Epoch 3, batch 18650, loss[loss=0.2384, simple_loss=0.3022, pruned_loss=0.08736, over 21720.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3029, pruned_loss=0.08334, over 4248282.75 frames. ], batch size: 333, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:47:53,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=477894.0, ans=0.1 2023-06-20 03:48:51,748 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.834e+02 2.448e+02 2.806e+02 3.441e+02 5.689e+02, threshold=5.611e+02, percent-clipped=1.0 2023-06-20 03:49:03,414 INFO [train.py:996] (2/4) Epoch 3, batch 18700, loss[loss=0.245, simple_loss=0.3165, pruned_loss=0.08678, over 21801.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3016, pruned_loss=0.08556, over 4263213.95 frames. ], batch size: 124, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:49:35,310 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=478194.0, ans=0.125 2023-06-20 03:50:01,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=478254.0, ans=0.0 2023-06-20 03:50:07,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=478254.0, ans=0.125 2023-06-20 03:50:09,227 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-20 03:50:35,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=478374.0, ans=0.125 2023-06-20 03:50:35,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=478374.0, ans=0.2 2023-06-20 03:50:39,449 INFO [train.py:996] (2/4) Epoch 3, batch 18750, loss[loss=0.2701, simple_loss=0.3448, pruned_loss=0.09776, over 21596.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3026, pruned_loss=0.08743, over 4274428.17 frames. ], batch size: 230, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:51:31,283 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.51 vs. limit=15.0 2023-06-20 03:52:11,938 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.031e+02 2.724e+02 3.056e+02 3.518e+02 6.406e+02, threshold=6.113e+02, percent-clipped=0.0 2023-06-20 03:52:24,556 INFO [train.py:996] (2/4) Epoch 3, batch 18800, loss[loss=0.2083, simple_loss=0.2845, pruned_loss=0.06607, over 21289.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3107, pruned_loss=0.08941, over 4269745.82 frames. ], batch size: 176, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:52:25,926 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.84 vs. limit=12.0 2023-06-20 03:52:39,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=478734.0, ans=0.1 2023-06-20 03:53:18,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=478854.0, ans=0.1 2023-06-20 03:53:23,364 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=478854.0, ans=0.125 2023-06-20 03:54:08,162 INFO [train.py:996] (2/4) Epoch 3, batch 18850, loss[loss=0.2214, simple_loss=0.2874, pruned_loss=0.07771, over 21671.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.307, pruned_loss=0.08442, over 4268199.96 frames. ], batch size: 298, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:54:12,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=479034.0, ans=0.125 2023-06-20 03:55:11,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=479154.0, ans=0.125 2023-06-20 03:55:41,114 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.483e+02 2.197e+02 2.509e+02 3.227e+02 5.628e+02, threshold=5.018e+02, percent-clipped=1.0 2023-06-20 03:55:46,508 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-06-20 03:56:10,928 INFO [train.py:996] (2/4) Epoch 3, batch 18900, loss[loss=0.2552, simple_loss=0.3172, pruned_loss=0.09659, over 22026.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3024, pruned_loss=0.08373, over 4270269.10 frames. ], batch size: 119, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:56:32,952 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=479394.0, ans=0.125 2023-06-20 03:57:22,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=479514.0, ans=0.2 2023-06-20 03:57:26,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=479574.0, ans=0.2 2023-06-20 03:57:32,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=479574.0, ans=0.125 2023-06-20 03:57:54,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=479574.0, ans=0.125 2023-06-20 03:57:57,225 INFO [train.py:996] (2/4) Epoch 3, batch 18950, loss[loss=0.2854, simple_loss=0.3581, pruned_loss=0.1064, over 21304.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3056, pruned_loss=0.08731, over 4271291.66 frames. ], batch size: 176, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:58:24,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=479694.0, ans=0.125 2023-06-20 03:58:37,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=479754.0, ans=0.125 2023-06-20 03:59:36,600 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.907e+02 2.641e+02 3.073e+02 3.851e+02 7.117e+02, threshold=6.146e+02, percent-clipped=12.0 2023-06-20 03:59:48,123 INFO [train.py:996] (2/4) Epoch 3, batch 19000, loss[loss=0.365, simple_loss=0.4055, pruned_loss=0.1622, over 21321.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3165, pruned_loss=0.09003, over 4273832.07 frames. ], batch size: 507, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 04:00:20,553 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=479994.0, ans=0.0 2023-06-20 04:00:27,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=479994.0, ans=0.1 2023-06-20 04:00:29,480 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=479994.0, ans=0.125 2023-06-20 04:00:43,296 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=480054.0, ans=0.0 2023-06-20 04:00:50,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=480054.0, ans=0.125 2023-06-20 04:01:24,939 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.07 vs. limit=15.0 2023-06-20 04:01:35,656 INFO [train.py:996] (2/4) Epoch 3, batch 19050, loss[loss=0.2503, simple_loss=0.3107, pruned_loss=0.09493, over 21777.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.3217, pruned_loss=0.09424, over 4278749.68 frames. ], batch size: 247, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:02:26,104 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.96 vs. limit=15.0 2023-06-20 04:03:08,634 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.052e+02 2.699e+02 3.066e+02 3.577e+02 5.949e+02, threshold=6.132e+02, percent-clipped=0.0 2023-06-20 04:03:36,283 INFO [train.py:996] (2/4) Epoch 3, batch 19100, loss[loss=0.247, simple_loss=0.2991, pruned_loss=0.09746, over 21300.00 frames. ], tot_loss[loss=0.255, simple_loss=0.32, pruned_loss=0.09498, over 4284925.44 frames. ], batch size: 144, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:03:47,179 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=480534.0, ans=0.2 2023-06-20 04:03:48,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=480534.0, ans=0.125 2023-06-20 04:04:40,095 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.29 vs. limit=15.0 2023-06-20 04:05:13,339 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=480774.0, ans=0.1 2023-06-20 04:05:24,634 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.38 vs. limit=15.0 2023-06-20 04:05:32,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=480774.0, ans=0.0 2023-06-20 04:05:40,003 INFO [train.py:996] (2/4) Epoch 3, batch 19150, loss[loss=0.3658, simple_loss=0.4406, pruned_loss=0.1455, over 21496.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3224, pruned_loss=0.09597, over 4281968.61 frames. ], batch size: 471, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:05:58,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=480894.0, ans=0.125 2023-06-20 04:07:17,979 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.914e+02 2.726e+02 3.068e+02 3.801e+02 8.063e+02, threshold=6.136e+02, percent-clipped=5.0 2023-06-20 04:07:35,518 INFO [train.py:996] (2/4) Epoch 3, batch 19200, loss[loss=0.2672, simple_loss=0.3689, pruned_loss=0.08279, over 21780.00 frames. ], tot_loss[loss=0.2646, simple_loss=0.3344, pruned_loss=0.09737, over 4284616.96 frames. ], batch size: 316, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:07:43,282 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.35 vs. limit=15.0 2023-06-20 04:08:20,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=481254.0, ans=0.0 2023-06-20 04:08:40,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=481314.0, ans=0.1 2023-06-20 04:09:13,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=481434.0, ans=0.05 2023-06-20 04:09:14,903 INFO [train.py:996] (2/4) Epoch 3, batch 19250, loss[loss=0.1882, simple_loss=0.2714, pruned_loss=0.05244, over 21346.00 frames. ], tot_loss[loss=0.2583, simple_loss=0.3333, pruned_loss=0.09167, over 4277484.71 frames. ], batch size: 194, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:09:22,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=481434.0, ans=0.1 2023-06-20 04:09:46,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=481554.0, ans=0.125 2023-06-20 04:10:00,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=481554.0, ans=0.05 2023-06-20 04:10:28,328 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.404e+02 2.035e+02 2.594e+02 2.988e+02 5.303e+02, threshold=5.187e+02, percent-clipped=0.0 2023-06-20 04:10:33,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=481674.0, ans=0.125 2023-06-20 04:10:50,952 INFO [train.py:996] (2/4) Epoch 3, batch 19300, loss[loss=0.2262, simple_loss=0.2854, pruned_loss=0.08349, over 21291.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.329, pruned_loss=0.09025, over 4284363.54 frames. ], batch size: 159, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:12:29,533 INFO [train.py:996] (2/4) Epoch 3, batch 19350, loss[loss=0.1958, simple_loss=0.2668, pruned_loss=0.06237, over 21213.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3223, pruned_loss=0.08591, over 4284414.07 frames. ], batch size: 176, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:12:41,132 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.39 vs. limit=12.0 2023-06-20 04:12:46,383 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=482094.0, ans=0.2 2023-06-20 04:13:15,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=482154.0, ans=0.2 2023-06-20 04:13:45,952 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=482274.0, ans=0.125 2023-06-20 04:13:46,941 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.720e+02 2.454e+02 2.758e+02 3.131e+02 4.952e+02, threshold=5.516e+02, percent-clipped=0.0 2023-06-20 04:14:04,537 INFO [train.py:996] (2/4) Epoch 3, batch 19400, loss[loss=0.2336, simple_loss=0.2979, pruned_loss=0.08468, over 21242.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.321, pruned_loss=0.08557, over 4284131.20 frames. ], batch size: 143, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:15:52,988 INFO [train.py:996] (2/4) Epoch 3, batch 19450, loss[loss=0.2638, simple_loss=0.3327, pruned_loss=0.09748, over 14797.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3181, pruned_loss=0.08728, over 4280542.12 frames. ], batch size: 60, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:16:02,965 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 04:16:04,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=482634.0, ans=0.125 2023-06-20 04:17:12,629 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 2.711e+02 3.172e+02 3.955e+02 6.078e+02, threshold=6.345e+02, percent-clipped=2.0 2023-06-20 04:17:30,446 INFO [train.py:996] (2/4) Epoch 3, batch 19500, loss[loss=0.252, simple_loss=0.3296, pruned_loss=0.0872, over 21177.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.315, pruned_loss=0.08923, over 4268110.85 frames. ], batch size: 548, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:17:44,685 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=482994.0, ans=0.0 2023-06-20 04:18:30,680 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.64 vs. limit=22.5 2023-06-20 04:18:47,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=483114.0, ans=0.125 2023-06-20 04:19:27,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=483234.0, ans=0.125 2023-06-20 04:19:27,866 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.28 vs. limit=15.0 2023-06-20 04:19:28,233 INFO [train.py:996] (2/4) Epoch 3, batch 19550, loss[loss=0.2689, simple_loss=0.3498, pruned_loss=0.09399, over 21505.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3097, pruned_loss=0.08714, over 4255324.24 frames. ], batch size: 471, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:19:37,630 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=483234.0, ans=0.125 2023-06-20 04:19:59,910 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.38 vs. limit=15.0 2023-06-20 04:20:45,643 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=483414.0, ans=0.0 2023-06-20 04:21:03,748 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.934e+02 2.587e+02 3.203e+02 3.916e+02 8.201e+02, threshold=6.407e+02, percent-clipped=3.0 2023-06-20 04:21:15,868 INFO [train.py:996] (2/4) Epoch 3, batch 19600, loss[loss=0.2374, simple_loss=0.2956, pruned_loss=0.08958, over 21618.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.312, pruned_loss=0.08788, over 4269712.84 frames. ], batch size: 195, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:21:46,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=483594.0, ans=0.1 2023-06-20 04:22:09,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=483654.0, ans=0.125 2023-06-20 04:22:11,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=483714.0, ans=0.0 2023-06-20 04:22:12,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=483714.0, ans=0.0 2023-06-20 04:22:22,941 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=483714.0, ans=0.1 2023-06-20 04:22:53,022 INFO [train.py:996] (2/4) Epoch 3, batch 19650, loss[loss=0.279, simple_loss=0.3398, pruned_loss=0.1091, over 21784.00 frames. ], tot_loss[loss=0.252, simple_loss=0.318, pruned_loss=0.09304, over 4276923.17 frames. ], batch size: 414, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:23:05,926 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.53 vs. limit=22.5 2023-06-20 04:23:29,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=483894.0, ans=0.0 2023-06-20 04:23:44,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=483954.0, ans=0.125 2023-06-20 04:24:35,402 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.259e+02 2.819e+02 3.213e+02 4.152e+02 6.048e+02, threshold=6.427e+02, percent-clipped=0.0 2023-06-20 04:24:38,896 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=484074.0, ans=0.1 2023-06-20 04:25:06,166 INFO [train.py:996] (2/4) Epoch 3, batch 19700, loss[loss=0.2638, simple_loss=0.3232, pruned_loss=0.1023, over 20184.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.321, pruned_loss=0.09332, over 4265871.07 frames. ], batch size: 707, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:25:29,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=484194.0, ans=0.0 2023-06-20 04:26:13,078 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 04:26:41,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=484374.0, ans=0.0 2023-06-20 04:26:45,404 INFO [train.py:996] (2/4) Epoch 3, batch 19750, loss[loss=0.2817, simple_loss=0.3652, pruned_loss=0.09915, over 21847.00 frames. ], tot_loss[loss=0.2612, simple_loss=0.3315, pruned_loss=0.09544, over 4265768.71 frames. ], batch size: 351, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:28:14,338 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=484614.0, ans=0.125 2023-06-20 04:28:26,731 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.964e+02 2.683e+02 3.196e+02 4.002e+02 6.809e+02, threshold=6.392e+02, percent-clipped=1.0 2023-06-20 04:28:38,973 INFO [train.py:996] (2/4) Epoch 3, batch 19800, loss[loss=0.2182, simple_loss=0.2955, pruned_loss=0.07048, over 21808.00 frames. ], tot_loss[loss=0.2633, simple_loss=0.3332, pruned_loss=0.09676, over 4272885.20 frames. ], batch size: 316, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:29:13,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=484794.0, ans=0.5 2023-06-20 04:29:14,627 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=484794.0, ans=0.0 2023-06-20 04:29:30,627 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=484854.0, ans=0.0 2023-06-20 04:29:33,674 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=484854.0, ans=0.0 2023-06-20 04:29:36,989 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=484854.0, ans=0.0 2023-06-20 04:29:38,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=484854.0, ans=0.2 2023-06-20 04:30:06,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=484974.0, ans=0.0 2023-06-20 04:30:21,936 INFO [train.py:996] (2/4) Epoch 3, batch 19850, loss[loss=0.1906, simple_loss=0.2666, pruned_loss=0.05734, over 21394.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3238, pruned_loss=0.09083, over 4281211.61 frames. ], batch size: 211, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:31:14,459 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=485154.0, ans=0.125 2023-06-20 04:31:28,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=485214.0, ans=0.125 2023-06-20 04:31:31,094 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.80 vs. limit=6.0 2023-06-20 04:31:41,971 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.685e+02 2.292e+02 2.688e+02 3.533e+02 4.969e+02, threshold=5.376e+02, percent-clipped=0.0 2023-06-20 04:31:45,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=485274.0, ans=0.1 2023-06-20 04:31:59,296 INFO [train.py:996] (2/4) Epoch 3, batch 19900, loss[loss=0.2314, simple_loss=0.3415, pruned_loss=0.06065, over 19605.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3228, pruned_loss=0.08707, over 4275293.52 frames. ], batch size: 702, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:32:00,329 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-20 04:32:51,069 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 04:33:01,085 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=485514.0, ans=0.125 2023-06-20 04:33:07,192 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=485514.0, ans=0.0 2023-06-20 04:33:11,989 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=485514.0, ans=0.0 2023-06-20 04:33:27,078 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=485574.0, ans=0.2 2023-06-20 04:33:42,960 INFO [train.py:996] (2/4) Epoch 3, batch 19950, loss[loss=0.2624, simple_loss=0.3131, pruned_loss=0.1058, over 21855.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3165, pruned_loss=0.08606, over 4270361.74 frames. ], batch size: 98, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:34:24,201 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.93 vs. limit=10.0 2023-06-20 04:34:26,368 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=485754.0, ans=0.1 2023-06-20 04:34:36,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=485754.0, ans=0.0 2023-06-20 04:34:43,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=485814.0, ans=0.1 2023-06-20 04:35:02,035 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.757e+02 2.533e+02 3.149e+02 3.836e+02 5.627e+02, threshold=6.299e+02, percent-clipped=1.0 2023-06-20 04:35:19,347 INFO [train.py:996] (2/4) Epoch 3, batch 20000, loss[loss=0.2714, simple_loss=0.3647, pruned_loss=0.08903, over 20756.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3168, pruned_loss=0.08656, over 4268646.85 frames. ], batch size: 607, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:35:28,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=485934.0, ans=0.1 2023-06-20 04:36:10,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=486054.0, ans=0.07 2023-06-20 04:36:21,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=486114.0, ans=0.125 2023-06-20 04:36:27,348 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=486174.0, ans=0.125 2023-06-20 04:36:37,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=486174.0, ans=0.125 2023-06-20 04:36:53,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=486234.0, ans=0.125 2023-06-20 04:36:54,199 INFO [train.py:996] (2/4) Epoch 3, batch 20050, loss[loss=0.2601, simple_loss=0.322, pruned_loss=0.09907, over 21278.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3187, pruned_loss=0.08936, over 4276078.32 frames. ], batch size: 143, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:37:00,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=486234.0, ans=0.0 2023-06-20 04:37:28,212 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=486354.0, ans=0.0 2023-06-20 04:37:29,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=486354.0, ans=0.125 2023-06-20 04:37:29,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=486354.0, ans=0.0 2023-06-20 04:37:31,922 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-06-20 04:37:32,812 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 04:38:39,859 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.084e+02 2.657e+02 3.135e+02 3.679e+02 6.652e+02, threshold=6.270e+02, percent-clipped=1.0 2023-06-20 04:38:41,758 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=486474.0, ans=0.1 2023-06-20 04:38:49,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=486474.0, ans=0.0 2023-06-20 04:38:52,020 INFO [train.py:996] (2/4) Epoch 3, batch 20100, loss[loss=0.2658, simple_loss=0.3295, pruned_loss=0.101, over 21333.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.3218, pruned_loss=0.09275, over 4284542.73 frames. ], batch size: 176, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:38:59,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=486534.0, ans=0.2 2023-06-20 04:39:32,935 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=486654.0, ans=0.05 2023-06-20 04:39:42,129 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=486654.0, ans=0.09899494936611666 2023-06-20 04:40:08,409 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=486714.0, ans=0.125 2023-06-20 04:40:31,379 INFO [train.py:996] (2/4) Epoch 3, batch 20150, loss[loss=0.3023, simple_loss=0.3708, pruned_loss=0.1169, over 21952.00 frames. ], tot_loss[loss=0.2637, simple_loss=0.3334, pruned_loss=0.09705, over 4287769.10 frames. ], batch size: 372, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:40:40,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=486834.0, ans=0.125 2023-06-20 04:41:09,953 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=486894.0, ans=0.125 2023-06-20 04:41:27,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=486954.0, ans=0.125 2023-06-20 04:41:42,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=487014.0, ans=0.025 2023-06-20 04:41:45,340 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=487014.0, ans=0.1 2023-06-20 04:41:46,779 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=487014.0, ans=0.1 2023-06-20 04:42:13,751 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.13 vs. limit=15.0 2023-06-20 04:42:17,062 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.177e+02 3.160e+02 3.625e+02 4.244e+02 7.181e+02, threshold=7.250e+02, percent-clipped=1.0 2023-06-20 04:42:29,134 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=487074.0, ans=0.0 2023-06-20 04:42:37,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=487074.0, ans=0.0 2023-06-20 04:42:40,030 INFO [train.py:996] (2/4) Epoch 3, batch 20200, loss[loss=0.2714, simple_loss=0.3645, pruned_loss=0.08914, over 21716.00 frames. ], tot_loss[loss=0.2685, simple_loss=0.3372, pruned_loss=0.09988, over 4279565.83 frames. ], batch size: 298, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:42:52,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=487134.0, ans=0.0 2023-06-20 04:42:53,474 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.77 vs. limit=15.0 2023-06-20 04:42:54,365 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=487194.0, ans=0.125 2023-06-20 04:44:02,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=487374.0, ans=0.05 2023-06-20 04:44:29,122 INFO [train.py:996] (2/4) Epoch 3, batch 20250, loss[loss=0.2526, simple_loss=0.3216, pruned_loss=0.09178, over 21901.00 frames. ], tot_loss[loss=0.266, simple_loss=0.3369, pruned_loss=0.09757, over 4280682.83 frames. ], batch size: 124, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:45:08,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=487554.0, ans=0.1 2023-06-20 04:45:55,656 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=487614.0, ans=0.0 2023-06-20 04:46:07,806 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.741e+02 2.337e+02 2.707e+02 3.464e+02 5.836e+02, threshold=5.414e+02, percent-clipped=0.0 2023-06-20 04:46:19,805 INFO [train.py:996] (2/4) Epoch 3, batch 20300, loss[loss=0.2239, simple_loss=0.2991, pruned_loss=0.07438, over 21470.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.3342, pruned_loss=0.09425, over 4267283.12 frames. ], batch size: 195, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:46:48,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=487794.0, ans=0.125 2023-06-20 04:47:09,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=487854.0, ans=0.2 2023-06-20 04:47:56,605 INFO [train.py:996] (2/4) Epoch 3, batch 20350, loss[loss=0.2242, simple_loss=0.2739, pruned_loss=0.08726, over 19959.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.3342, pruned_loss=0.09466, over 4253661.01 frames. ], batch size: 703, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:48:07,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=488034.0, ans=0.2 2023-06-20 04:48:11,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=488094.0, ans=0.125 2023-06-20 04:48:13,448 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=488094.0, ans=0.2 2023-06-20 04:49:08,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=488214.0, ans=0.125 2023-06-20 04:49:28,272 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.856e+02 2.591e+02 2.979e+02 3.665e+02 6.122e+02, threshold=5.958e+02, percent-clipped=2.0 2023-06-20 04:49:40,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=488334.0, ans=0.0 2023-06-20 04:49:41,018 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.66 vs. limit=15.0 2023-06-20 04:49:41,478 INFO [train.py:996] (2/4) Epoch 3, batch 20400, loss[loss=0.3731, simple_loss=0.412, pruned_loss=0.1671, over 21439.00 frames. ], tot_loss[loss=0.2671, simple_loss=0.3378, pruned_loss=0.0982, over 4257052.57 frames. ], batch size: 508, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:49:42,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=488334.0, ans=0.125 2023-06-20 04:50:08,334 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.18 vs. limit=6.0 2023-06-20 04:51:02,210 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.39 vs. limit=15.0 2023-06-20 04:51:07,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=488514.0, ans=0.2 2023-06-20 04:51:46,586 INFO [train.py:996] (2/4) Epoch 3, batch 20450, loss[loss=0.304, simple_loss=0.3521, pruned_loss=0.1279, over 21538.00 frames. ], tot_loss[loss=0.2707, simple_loss=0.3394, pruned_loss=0.101, over 4242192.19 frames. ], batch size: 471, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:51:51,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=488634.0, ans=0.2 2023-06-20 04:52:00,912 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.21 vs. limit=15.0 2023-06-20 04:52:49,549 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=488814.0, ans=0.125 2023-06-20 04:52:54,060 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=488814.0, ans=0.0 2023-06-20 04:53:14,713 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.291e+02 2.891e+02 3.490e+02 4.118e+02 8.011e+02, threshold=6.980e+02, percent-clipped=6.0 2023-06-20 04:53:21,312 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=488874.0, ans=0.125 2023-06-20 04:53:25,564 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=488934.0, ans=0.125 2023-06-20 04:53:26,652 INFO [train.py:996] (2/4) Epoch 3, batch 20500, loss[loss=0.2721, simple_loss=0.3404, pruned_loss=0.1019, over 21455.00 frames. ], tot_loss[loss=0.2681, simple_loss=0.3345, pruned_loss=0.1009, over 4253494.46 frames. ], batch size: 131, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:53:41,338 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=488994.0, ans=0.1 2023-06-20 04:54:06,670 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=12.0 2023-06-20 04:54:09,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=489054.0, ans=0.125 2023-06-20 04:55:15,470 INFO [train.py:996] (2/4) Epoch 3, batch 20550, loss[loss=0.2809, simple_loss=0.3124, pruned_loss=0.1247, over 21406.00 frames. ], tot_loss[loss=0.2636, simple_loss=0.3283, pruned_loss=0.09942, over 4252549.22 frames. ], batch size: 508, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 04:55:55,903 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.54 vs. limit=15.0 2023-06-20 04:56:10,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=489354.0, ans=0.125 2023-06-20 04:56:30,537 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=489414.0, ans=0.125 2023-06-20 04:56:48,731 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.811e+02 2.383e+02 2.780e+02 3.165e+02 5.319e+02, threshold=5.560e+02, percent-clipped=0.0 2023-06-20 04:56:59,053 INFO [train.py:996] (2/4) Epoch 3, batch 20600, loss[loss=0.2605, simple_loss=0.3242, pruned_loss=0.09834, over 21844.00 frames. ], tot_loss[loss=0.2609, simple_loss=0.3291, pruned_loss=0.09635, over 4242488.96 frames. ], batch size: 332, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 04:57:14,141 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 04:58:27,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=489774.0, ans=0.2 2023-06-20 04:58:44,296 INFO [train.py:996] (2/4) Epoch 3, batch 20650, loss[loss=0.2395, simple_loss=0.3024, pruned_loss=0.08831, over 21683.00 frames. ], tot_loss[loss=0.2588, simple_loss=0.3244, pruned_loss=0.0966, over 4234254.26 frames. ], batch size: 282, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 04:58:47,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=489834.0, ans=0.125 2023-06-20 04:59:25,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=489954.0, ans=0.125 2023-06-20 04:59:35,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=489954.0, ans=0.0 2023-06-20 04:59:50,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=490014.0, ans=0.125 2023-06-20 05:00:10,410 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.708e+02 2.534e+02 3.079e+02 3.590e+02 6.191e+02, threshold=6.158e+02, percent-clipped=1.0 2023-06-20 05:00:21,343 INFO [train.py:996] (2/4) Epoch 3, batch 20700, loss[loss=0.2389, simple_loss=0.3111, pruned_loss=0.08337, over 21763.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.316, pruned_loss=0.09229, over 4248673.60 frames. ], batch size: 282, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:01:18,973 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=490254.0, ans=0.025 2023-06-20 05:01:30,486 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.70 vs. limit=15.0 2023-06-20 05:01:37,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=490314.0, ans=0.1 2023-06-20 05:01:52,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=490374.0, ans=0.125 2023-06-20 05:02:17,376 INFO [train.py:996] (2/4) Epoch 3, batch 20750, loss[loss=0.2667, simple_loss=0.3563, pruned_loss=0.08852, over 21222.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3203, pruned_loss=0.09195, over 4251050.26 frames. ], batch size: 548, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:02:54,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=490494.0, ans=0.2 2023-06-20 05:03:00,917 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.08 vs. limit=6.0 2023-06-20 05:03:16,706 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=490554.0, ans=0.125 2023-06-20 05:03:31,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=490554.0, ans=0.2 2023-06-20 05:03:51,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=490674.0, ans=0.125 2023-06-20 05:03:55,517 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=490674.0, ans=0.125 2023-06-20 05:03:56,441 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.113e+02 2.880e+02 3.382e+02 4.040e+02 6.281e+02, threshold=6.763e+02, percent-clipped=1.0 2023-06-20 05:04:10,634 INFO [train.py:996] (2/4) Epoch 3, batch 20800, loss[loss=0.2285, simple_loss=0.2891, pruned_loss=0.08394, over 21546.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3235, pruned_loss=0.09323, over 4242968.30 frames. ], batch size: 132, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:04:20,257 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=490734.0, ans=0.0 2023-06-20 05:04:24,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=490734.0, ans=0.125 2023-06-20 05:04:33,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=490794.0, ans=0.125 2023-06-20 05:05:08,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=490854.0, ans=0.0 2023-06-20 05:05:25,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=490914.0, ans=0.125 2023-06-20 05:05:31,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=490974.0, ans=0.125 2023-06-20 05:05:43,072 INFO [train.py:996] (2/4) Epoch 3, batch 20850, loss[loss=0.2108, simple_loss=0.2765, pruned_loss=0.0726, over 21639.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3159, pruned_loss=0.0908, over 4241433.18 frames. ], batch size: 230, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:06:03,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=491034.0, ans=0.125 2023-06-20 05:06:33,730 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=491154.0, ans=0.0 2023-06-20 05:06:47,442 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=491154.0, ans=0.125 2023-06-20 05:06:56,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=491214.0, ans=0.125 2023-06-20 05:06:58,916 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.82 vs. limit=15.0 2023-06-20 05:06:59,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=491214.0, ans=0.05 2023-06-20 05:07:09,894 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.988e+02 2.577e+02 3.141e+02 3.837e+02 5.556e+02, threshold=6.283e+02, percent-clipped=0.0 2023-06-20 05:07:25,945 INFO [train.py:996] (2/4) Epoch 3, batch 20900, loss[loss=0.2543, simple_loss=0.3218, pruned_loss=0.09337, over 21889.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3158, pruned_loss=0.09166, over 4253389.31 frames. ], batch size: 118, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:07:26,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=491334.0, ans=0.1 2023-06-20 05:07:57,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=491394.0, ans=0.125 2023-06-20 05:07:59,288 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=491394.0, ans=0.0 2023-06-20 05:08:16,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=491454.0, ans=0.0 2023-06-20 05:08:36,838 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=491514.0, ans=0.125 2023-06-20 05:08:55,354 INFO [train.py:996] (2/4) Epoch 3, batch 20950, loss[loss=0.1993, simple_loss=0.2775, pruned_loss=0.0606, over 21766.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3123, pruned_loss=0.08828, over 4254935.33 frames. ], batch size: 316, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:09:44,084 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.19 vs. limit=15.0 2023-06-20 05:10:05,719 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.42 vs. limit=15.0 2023-06-20 05:10:16,984 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=491874.0, ans=0.125 2023-06-20 05:10:19,390 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.871e+02 2.460e+02 2.871e+02 3.245e+02 5.204e+02, threshold=5.741e+02, percent-clipped=0.0 2023-06-20 05:10:29,540 INFO [train.py:996] (2/4) Epoch 3, batch 21000, loss[loss=0.2113, simple_loss=0.3134, pruned_loss=0.05466, over 19799.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3108, pruned_loss=0.08758, over 4254544.58 frames. ], batch size: 703, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:10:29,540 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-20 05:11:18,513 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.0.layers.0.self_attn_weights, attn_weights_entropy = tensor([5.3388, 5.4924, 5.2486, 5.0649], device='cuda:2') 2023-06-20 05:11:23,174 INFO [train.py:1028] (2/4) Epoch 3, validation: loss=0.2766, simple_loss=0.3765, pruned_loss=0.08831, over 1796401.00 frames. 2023-06-20 05:11:23,175 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-20 05:12:24,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=492114.0, ans=0.0 2023-06-20 05:12:28,878 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.40 vs. limit=15.0 2023-06-20 05:12:59,901 INFO [train.py:996] (2/4) Epoch 3, batch 21050, loss[loss=0.2764, simple_loss=0.314, pruned_loss=0.1194, over 21297.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.31, pruned_loss=0.0889, over 4262922.98 frames. ], batch size: 471, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:13:00,338 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=492234.0, ans=0.125 2023-06-20 05:13:06,248 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=492234.0, ans=0.05 2023-06-20 05:13:31,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=492294.0, ans=0.125 2023-06-20 05:13:57,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=492354.0, ans=0.125 2023-06-20 05:14:21,309 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.912e+02 2.359e+02 2.713e+02 3.302e+02 4.914e+02, threshold=5.427e+02, percent-clipped=0.0 2023-06-20 05:14:30,221 INFO [train.py:996] (2/4) Epoch 3, batch 21100, loss[loss=0.2565, simple_loss=0.3022, pruned_loss=0.1054, over 21627.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.306, pruned_loss=0.0883, over 4258670.10 frames. ], batch size: 416, lr: 1.05e-02, grad_scale: 16.0 2023-06-20 05:14:41,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=492534.0, ans=0.125 2023-06-20 05:15:06,003 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=492594.0, ans=0.125 2023-06-20 05:16:26,482 INFO [train.py:996] (2/4) Epoch 3, batch 21150, loss[loss=0.2181, simple_loss=0.2717, pruned_loss=0.08224, over 21444.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3019, pruned_loss=0.08807, over 4260639.26 frames. ], batch size: 212, lr: 1.05e-02, grad_scale: 16.0 2023-06-20 05:17:39,607 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.11 vs. limit=15.0 2023-06-20 05:17:41,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=493074.0, ans=0.2 2023-06-20 05:17:46,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=493074.0, ans=0.1 2023-06-20 05:17:48,819 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.014e+02 2.584e+02 3.052e+02 3.730e+02 6.126e+02, threshold=6.104e+02, percent-clipped=6.0 2023-06-20 05:18:03,239 INFO [train.py:996] (2/4) Epoch 3, batch 21200, loss[loss=0.1977, simple_loss=0.2705, pruned_loss=0.06243, over 21698.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.2975, pruned_loss=0.08672, over 4262155.88 frames. ], batch size: 298, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:18:39,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=493254.0, ans=0.0 2023-06-20 05:19:07,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=493314.0, ans=0.125 2023-06-20 05:19:31,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=493374.0, ans=0.125 2023-06-20 05:19:31,565 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.22 vs. limit=22.5 2023-06-20 05:19:39,363 INFO [train.py:996] (2/4) Epoch 3, batch 21250, loss[loss=0.2738, simple_loss=0.3416, pruned_loss=0.103, over 21668.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.2964, pruned_loss=0.08706, over 4261838.19 frames. ], batch size: 247, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:21:07,057 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.832e+02 2.549e+02 3.045e+02 3.638e+02 6.248e+02, threshold=6.090e+02, percent-clipped=1.0 2023-06-20 05:21:11,337 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=22.5 2023-06-20 05:21:21,840 INFO [train.py:996] (2/4) Epoch 3, batch 21300, loss[loss=0.248, simple_loss=0.3142, pruned_loss=0.09088, over 21323.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3046, pruned_loss=0.08996, over 4256437.24 frames. ], batch size: 159, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:21:55,758 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.40 vs. limit=15.0 2023-06-20 05:22:04,436 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 05:23:04,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=494034.0, ans=0.07 2023-06-20 05:23:05,084 INFO [train.py:996] (2/4) Epoch 3, batch 21350, loss[loss=0.2429, simple_loss=0.3238, pruned_loss=0.08099, over 21352.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3107, pruned_loss=0.09128, over 4258138.12 frames. ], batch size: 548, lr: 1.05e-02, grad_scale: 16.0 2023-06-20 05:23:39,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=494094.0, ans=0.125 2023-06-20 05:23:55,643 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.68 vs. limit=15.0 2023-06-20 05:24:21,778 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.47 vs. limit=15.0 2023-06-20 05:24:40,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=494214.0, ans=0.0 2023-06-20 05:24:57,874 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.675e+02 2.570e+02 2.839e+02 3.402e+02 4.557e+02, threshold=5.677e+02, percent-clipped=0.0 2023-06-20 05:25:16,188 INFO [train.py:996] (2/4) Epoch 3, batch 21400, loss[loss=0.2079, simple_loss=0.2897, pruned_loss=0.06302, over 21370.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.314, pruned_loss=0.09071, over 4264364.03 frames. ], batch size: 211, lr: 1.05e-02, grad_scale: 16.0 2023-06-20 05:25:26,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=494334.0, ans=0.125 2023-06-20 05:26:40,608 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=18.58 vs. limit=22.5 2023-06-20 05:27:20,261 INFO [train.py:996] (2/4) Epoch 3, batch 21450, loss[loss=0.2426, simple_loss=0.3128, pruned_loss=0.08624, over 21265.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3175, pruned_loss=0.0927, over 4272935.25 frames. ], batch size: 143, lr: 1.05e-02, grad_scale: 16.0 2023-06-20 05:27:37,337 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=494694.0, ans=0.0 2023-06-20 05:27:55,161 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.14 vs. limit=12.0 2023-06-20 05:28:02,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=494754.0, ans=0.1 2023-06-20 05:28:05,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=494754.0, ans=0.125 2023-06-20 05:28:37,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=494874.0, ans=0.2 2023-06-20 05:28:38,460 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.041e+02 2.516e+02 2.861e+02 3.352e+02 5.412e+02, threshold=5.722e+02, percent-clipped=0.0 2023-06-20 05:28:52,136 INFO [train.py:996] (2/4) Epoch 3, batch 21500, loss[loss=0.2358, simple_loss=0.2873, pruned_loss=0.09215, over 21453.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.3162, pruned_loss=0.09411, over 4270743.64 frames. ], batch size: 194, lr: 1.05e-02, grad_scale: 16.0 2023-06-20 05:29:20,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=494994.0, ans=0.0 2023-06-20 05:30:30,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=495174.0, ans=0.125 2023-06-20 05:30:42,577 INFO [train.py:996] (2/4) Epoch 3, batch 21550, loss[loss=0.1949, simple_loss=0.2544, pruned_loss=0.06771, over 21330.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3074, pruned_loss=0.09056, over 4263751.69 frames. ], batch size: 160, lr: 1.05e-02, grad_scale: 16.0 2023-06-20 05:31:00,569 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=495294.0, ans=0.125 2023-06-20 05:31:11,717 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=495294.0, ans=0.0 2023-06-20 05:31:28,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=495354.0, ans=0.1 2023-06-20 05:32:24,794 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.753e+02 2.424e+02 3.051e+02 3.633e+02 8.477e+02, threshold=6.101e+02, percent-clipped=5.0 2023-06-20 05:32:38,621 INFO [train.py:996] (2/4) Epoch 3, batch 21600, loss[loss=0.2176, simple_loss=0.3066, pruned_loss=0.06429, over 20768.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3017, pruned_loss=0.08831, over 4254019.03 frames. ], batch size: 607, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:33:45,662 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 05:34:27,996 INFO [train.py:996] (2/4) Epoch 3, batch 21650, loss[loss=0.243, simple_loss=0.3328, pruned_loss=0.07661, over 21625.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3067, pruned_loss=0.08652, over 4252704.32 frames. ], batch size: 230, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:35:24,710 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=496014.0, ans=0.0 2023-06-20 05:35:50,978 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.877e+02 2.412e+02 2.832e+02 3.352e+02 7.209e+02, threshold=5.664e+02, percent-clipped=3.0 2023-06-20 05:35:54,233 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=496074.0, ans=0.125 2023-06-20 05:35:58,261 INFO [train.py:996] (2/4) Epoch 3, batch 21700, loss[loss=0.2546, simple_loss=0.3093, pruned_loss=0.09992, over 21779.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3081, pruned_loss=0.08478, over 4254331.57 frames. ], batch size: 351, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:36:35,072 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=496254.0, ans=0.125 2023-06-20 05:36:45,302 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=496254.0, ans=0.0 2023-06-20 05:36:48,980 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.50 vs. limit=15.0 2023-06-20 05:36:58,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=496314.0, ans=0.125 2023-06-20 05:37:00,207 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=496314.0, ans=0.125 2023-06-20 05:37:33,331 INFO [train.py:996] (2/4) Epoch 3, batch 21750, loss[loss=0.1991, simple_loss=0.2605, pruned_loss=0.06888, over 21265.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.303, pruned_loss=0.0841, over 4246823.35 frames. ], batch size: 144, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:37:37,388 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.45 vs. limit=15.0 2023-06-20 05:37:43,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=496434.0, ans=0.05 2023-06-20 05:37:46,760 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=496494.0, ans=0.125 2023-06-20 05:37:51,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=496494.0, ans=0.0 2023-06-20 05:38:15,259 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.87 vs. limit=15.0 2023-06-20 05:38:22,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=496614.0, ans=0.125 2023-06-20 05:38:44,901 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 05:39:02,546 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.696e+02 2.434e+02 2.778e+02 3.312e+02 5.034e+02, threshold=5.556e+02, percent-clipped=0.0 2023-06-20 05:39:10,164 INFO [train.py:996] (2/4) Epoch 3, batch 21800, loss[loss=0.2738, simple_loss=0.3097, pruned_loss=0.1189, over 21328.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3014, pruned_loss=0.08539, over 4256873.59 frames. ], batch size: 473, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:39:31,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=496794.0, ans=0.125 2023-06-20 05:39:50,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=496854.0, ans=0.0 2023-06-20 05:39:51,537 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.57 vs. limit=22.5 2023-06-20 05:40:49,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=496974.0, ans=0.0 2023-06-20 05:40:51,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=496974.0, ans=0.0 2023-06-20 05:40:56,341 INFO [train.py:996] (2/4) Epoch 3, batch 21850, loss[loss=0.2199, simple_loss=0.3078, pruned_loss=0.06597, over 19835.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.307, pruned_loss=0.08622, over 4263917.48 frames. ], batch size: 702, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:41:00,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=497034.0, ans=0.5 2023-06-20 05:41:47,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=497154.0, ans=0.0 2023-06-20 05:42:29,531 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.62 vs. limit=15.0 2023-06-20 05:42:38,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=497274.0, ans=0.1 2023-06-20 05:42:40,775 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.811e+02 2.614e+02 3.041e+02 3.805e+02 7.327e+02, threshold=6.083e+02, percent-clipped=2.0 2023-06-20 05:42:43,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=497274.0, ans=0.0 2023-06-20 05:42:48,471 INFO [train.py:996] (2/4) Epoch 3, batch 21900, loss[loss=0.2413, simple_loss=0.2968, pruned_loss=0.09293, over 21707.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3089, pruned_loss=0.08786, over 4270407.12 frames. ], batch size: 112, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:44:27,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=497574.0, ans=0.0 2023-06-20 05:44:33,684 INFO [train.py:996] (2/4) Epoch 3, batch 21950, loss[loss=0.1736, simple_loss=0.2616, pruned_loss=0.04281, over 21719.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3037, pruned_loss=0.08632, over 4267090.48 frames. ], batch size: 333, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:44:45,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=497634.0, ans=0.125 2023-06-20 05:44:52,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=497694.0, ans=0.0 2023-06-20 05:44:55,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=497694.0, ans=0.125 2023-06-20 05:46:12,602 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.97 vs. limit=15.0 2023-06-20 05:46:13,308 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.613e+02 2.252e+02 2.673e+02 3.306e+02 5.194e+02, threshold=5.347e+02, percent-clipped=0.0 2023-06-20 05:46:21,085 INFO [train.py:996] (2/4) Epoch 3, batch 22000, loss[loss=0.2331, simple_loss=0.2949, pruned_loss=0.08568, over 21725.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.2981, pruned_loss=0.084, over 4276412.46 frames. ], batch size: 112, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:46:38,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=497934.0, ans=0.125 2023-06-20 05:46:41,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=497994.0, ans=0.2 2023-06-20 05:46:42,287 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.68 vs. limit=6.0 2023-06-20 05:47:11,675 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=498054.0, ans=0.0 2023-06-20 05:48:03,702 INFO [train.py:996] (2/4) Epoch 3, batch 22050, loss[loss=0.2631, simple_loss=0.3334, pruned_loss=0.09641, over 21166.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3039, pruned_loss=0.086, over 4263332.70 frames. ], batch size: 143, lr: 1.05e-02, grad_scale: 16.0 2023-06-20 05:48:47,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=498294.0, ans=0.125 2023-06-20 05:49:14,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=498354.0, ans=0.125 2023-06-20 05:49:38,309 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.75 vs. limit=6.0 2023-06-20 05:49:41,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=498414.0, ans=0.125 2023-06-20 05:49:58,730 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=498474.0, ans=0.125 2023-06-20 05:50:00,831 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.63 vs. limit=15.0 2023-06-20 05:50:01,054 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.043e+02 3.110e+02 3.918e+02 4.887e+02 9.595e+02, threshold=7.836e+02, percent-clipped=17.0 2023-06-20 05:50:07,366 INFO [train.py:996] (2/4) Epoch 3, batch 22100, loss[loss=0.2478, simple_loss=0.3073, pruned_loss=0.09416, over 21787.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3138, pruned_loss=0.09086, over 4265317.99 frames. ], batch size: 247, lr: 1.05e-02, grad_scale: 16.0 2023-06-20 05:50:27,457 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=498594.0, ans=0.125 2023-06-20 05:51:37,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=498774.0, ans=0.0 2023-06-20 05:51:50,231 INFO [train.py:996] (2/4) Epoch 3, batch 22150, loss[loss=0.2551, simple_loss=0.3308, pruned_loss=0.08965, over 21313.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3157, pruned_loss=0.09188, over 4272794.74 frames. ], batch size: 159, lr: 1.04e-02, grad_scale: 16.0 2023-06-20 05:52:15,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=498894.0, ans=0.025 2023-06-20 05:52:28,457 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.75 vs. limit=15.0 2023-06-20 05:52:49,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=498954.0, ans=0.125 2023-06-20 05:52:51,994 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-20 05:53:10,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=499014.0, ans=0.04949747468305833 2023-06-20 05:53:30,271 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.991e+02 2.905e+02 3.355e+02 4.260e+02 6.840e+02, threshold=6.709e+02, percent-clipped=0.0 2023-06-20 05:53:32,663 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.00 vs. limit=15.0 2023-06-20 05:53:41,751 INFO [train.py:996] (2/4) Epoch 3, batch 22200, loss[loss=0.2447, simple_loss=0.3346, pruned_loss=0.07743, over 21322.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3192, pruned_loss=0.09417, over 4277584.87 frames. ], batch size: 176, lr: 1.04e-02, grad_scale: 16.0 2023-06-20 05:53:49,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=499134.0, ans=0.0 2023-06-20 05:53:51,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=499134.0, ans=0.125 2023-06-20 05:53:59,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=499194.0, ans=0.2 2023-06-20 05:54:03,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=499194.0, ans=0.2 2023-06-20 05:54:22,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=499254.0, ans=0.1 2023-06-20 05:54:42,449 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=499314.0, ans=0.125 2023-06-20 05:54:45,457 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=499314.0, ans=0.2 2023-06-20 05:54:50,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=499314.0, ans=0.125 2023-06-20 05:55:18,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=499374.0, ans=0.2 2023-06-20 05:55:37,061 INFO [train.py:996] (2/4) Epoch 3, batch 22250, loss[loss=0.3067, simple_loss=0.3713, pruned_loss=0.121, over 21591.00 frames. ], tot_loss[loss=0.2595, simple_loss=0.3267, pruned_loss=0.09616, over 4272882.50 frames. ], batch size: 414, lr: 1.04e-02, grad_scale: 16.0 2023-06-20 05:56:15,185 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.63 vs. limit=22.5 2023-06-20 05:56:52,501 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=499674.0, ans=0.2 2023-06-20 05:56:56,408 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.238e+02 2.809e+02 3.391e+02 3.899e+02 6.756e+02, threshold=6.782e+02, percent-clipped=1.0 2023-06-20 05:57:08,134 INFO [train.py:996] (2/4) Epoch 3, batch 22300, loss[loss=0.2493, simple_loss=0.3378, pruned_loss=0.08035, over 19945.00 frames. ], tot_loss[loss=0.263, simple_loss=0.3287, pruned_loss=0.09862, over 4267722.61 frames. ], batch size: 702, lr: 1.04e-02, grad_scale: 16.0 2023-06-20 05:57:12,839 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=499734.0, ans=0.2 2023-06-20 05:57:34,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=499794.0, ans=0.0 2023-06-20 05:58:19,003 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=499914.0, ans=0.125 2023-06-20 05:58:42,884 INFO [train.py:996] (2/4) Epoch 3, batch 22350, loss[loss=0.3013, simple_loss=0.3482, pruned_loss=0.1272, over 21692.00 frames. ], tot_loss[loss=0.2633, simple_loss=0.3274, pruned_loss=0.09957, over 4280626.93 frames. ], batch size: 473, lr: 1.04e-02, grad_scale: 16.0 2023-06-20 05:58:49,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=500034.0, ans=0.125 2023-06-20 05:58:57,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=500094.0, ans=0.125 2023-06-20 06:00:13,795 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.725e+02 2.473e+02 2.755e+02 3.372e+02 7.896e+02, threshold=5.510e+02, percent-clipped=3.0 2023-06-20 06:00:19,818 INFO [train.py:996] (2/4) Epoch 3, batch 22400, loss[loss=0.268, simple_loss=0.3419, pruned_loss=0.09701, over 21632.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.323, pruned_loss=0.09538, over 4282696.01 frames. ], batch size: 414, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:00:49,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=500394.0, ans=0.2 2023-06-20 06:00:56,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=500394.0, ans=0.0 2023-06-20 06:01:21,544 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.56 vs. limit=15.0 2023-06-20 06:02:01,808 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=500574.0, ans=0.125 2023-06-20 06:02:04,299 INFO [train.py:996] (2/4) Epoch 3, batch 22450, loss[loss=0.2381, simple_loss=0.2845, pruned_loss=0.09589, over 21241.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.3167, pruned_loss=0.09401, over 4282899.53 frames. ], batch size: 144, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:02:33,366 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 06:03:11,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=500754.0, ans=0.0 2023-06-20 06:03:38,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=500814.0, ans=0.125 2023-06-20 06:03:38,706 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=500814.0, ans=0.0 2023-06-20 06:03:46,450 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=500874.0, ans=0.05 2023-06-20 06:03:50,256 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.997e+02 2.596e+02 2.879e+02 3.313e+02 5.071e+02, threshold=5.757e+02, percent-clipped=0.0 2023-06-20 06:03:52,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=500874.0, ans=0.09899494936611666 2023-06-20 06:03:56,320 INFO [train.py:996] (2/4) Epoch 3, batch 22500, loss[loss=0.2295, simple_loss=0.3023, pruned_loss=0.07836, over 21711.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3126, pruned_loss=0.09351, over 4275268.21 frames. ], batch size: 124, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:03:59,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=500934.0, ans=0.125 2023-06-20 06:04:07,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=500934.0, ans=0.125 2023-06-20 06:04:43,177 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=501054.0, ans=0.0 2023-06-20 06:05:37,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=501174.0, ans=0.125 2023-06-20 06:05:39,778 INFO [train.py:996] (2/4) Epoch 3, batch 22550, loss[loss=0.2323, simple_loss=0.2992, pruned_loss=0.0827, over 21519.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3154, pruned_loss=0.09298, over 4279811.79 frames. ], batch size: 194, lr: 1.04e-02, grad_scale: 16.0 2023-06-20 06:06:24,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=501294.0, ans=0.125 2023-06-20 06:06:28,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=501294.0, ans=0.125 2023-06-20 06:07:02,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=501414.0, ans=0.125 2023-06-20 06:07:23,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=501474.0, ans=0.1 2023-06-20 06:07:33,203 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.019e+02 2.773e+02 3.379e+02 4.240e+02 8.103e+02, threshold=6.757e+02, percent-clipped=8.0 2023-06-20 06:07:43,153 INFO [train.py:996] (2/4) Epoch 3, batch 22600, loss[loss=0.3168, simple_loss=0.3871, pruned_loss=0.1233, over 21547.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3185, pruned_loss=0.09336, over 4276961.79 frames. ], batch size: 471, lr: 1.04e-02, grad_scale: 16.0 2023-06-20 06:07:45,269 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=501534.0, ans=0.125 2023-06-20 06:07:56,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=501534.0, ans=0.0 2023-06-20 06:08:49,574 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 06:08:55,454 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=501774.0, ans=0.05 2023-06-20 06:09:16,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=501774.0, ans=0.125 2023-06-20 06:09:25,247 INFO [train.py:996] (2/4) Epoch 3, batch 22650, loss[loss=0.2286, simple_loss=0.2895, pruned_loss=0.08382, over 21842.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3154, pruned_loss=0.09251, over 4270435.81 frames. ], batch size: 98, lr: 1.04e-02, grad_scale: 16.0 2023-06-20 06:09:39,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=501834.0, ans=0.035 2023-06-20 06:10:17,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=501954.0, ans=0.2 2023-06-20 06:11:00,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=502074.0, ans=0.0 2023-06-20 06:11:08,602 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.835e+02 2.541e+02 2.941e+02 3.391e+02 5.583e+02, threshold=5.883e+02, percent-clipped=0.0 2023-06-20 06:11:18,222 INFO [train.py:996] (2/4) Epoch 3, batch 22700, loss[loss=0.2079, simple_loss=0.2745, pruned_loss=0.0706, over 21807.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3092, pruned_loss=0.09174, over 4267324.75 frames. ], batch size: 112, lr: 1.04e-02, grad_scale: 16.0 2023-06-20 06:11:46,656 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 06:12:11,191 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=502254.0, ans=0.1 2023-06-20 06:12:56,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=502374.0, ans=0.0 2023-06-20 06:13:09,028 INFO [train.py:996] (2/4) Epoch 3, batch 22750, loss[loss=0.2816, simple_loss=0.3439, pruned_loss=0.1096, over 21751.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.311, pruned_loss=0.09399, over 4268816.69 frames. ], batch size: 351, lr: 1.04e-02, grad_scale: 16.0 2023-06-20 06:14:14,171 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=502614.0, ans=0.125 2023-06-20 06:14:17,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=502614.0, ans=0.2 2023-06-20 06:14:20,159 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 06:14:32,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=502674.0, ans=0.1 2023-06-20 06:14:51,656 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.316e+02 2.900e+02 3.287e+02 3.902e+02 7.614e+02, threshold=6.575e+02, percent-clipped=5.0 2023-06-20 06:15:01,368 INFO [train.py:996] (2/4) Epoch 3, batch 22800, loss[loss=0.2435, simple_loss=0.3135, pruned_loss=0.0867, over 21997.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3153, pruned_loss=0.09655, over 4276067.85 frames. ], batch size: 103, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:15:13,085 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.77 vs. limit=22.5 2023-06-20 06:15:48,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=502854.0, ans=0.125 2023-06-20 06:16:00,115 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=502914.0, ans=0.125 2023-06-20 06:16:03,105 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=502914.0, ans=0.125 2023-06-20 06:16:20,475 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 06:16:32,951 INFO [train.py:996] (2/4) Epoch 3, batch 22850, loss[loss=0.2035, simple_loss=0.2676, pruned_loss=0.06969, over 21777.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.3106, pruned_loss=0.09519, over 4266325.15 frames. ], batch size: 124, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:17:14,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=503154.0, ans=0.0 2023-06-20 06:17:23,152 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=503154.0, ans=0.2 2023-06-20 06:17:47,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=503274.0, ans=0.1 2023-06-20 06:17:59,983 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.129e+02 2.770e+02 3.456e+02 4.106e+02 7.202e+02, threshold=6.912e+02, percent-clipped=4.0 2023-06-20 06:18:00,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=503274.0, ans=0.1 2023-06-20 06:18:10,138 INFO [train.py:996] (2/4) Epoch 3, batch 22900, loss[loss=0.2787, simple_loss=0.3687, pruned_loss=0.09433, over 20790.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3137, pruned_loss=0.09465, over 4266603.73 frames. ], batch size: 608, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:18:50,178 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=503394.0, ans=15.0 2023-06-20 06:18:53,147 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=503454.0, ans=0.2 2023-06-20 06:18:54,409 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=503454.0, ans=0.04949747468305833 2023-06-20 06:20:01,147 INFO [train.py:996] (2/4) Epoch 3, batch 22950, loss[loss=0.2591, simple_loss=0.3775, pruned_loss=0.07036, over 21315.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.3286, pruned_loss=0.09257, over 4273256.08 frames. ], batch size: 548, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:20:19,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=503634.0, ans=0.0 2023-06-20 06:20:21,590 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.48 vs. limit=5.0 2023-06-20 06:20:44,224 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.37 vs. limit=15.0 2023-06-20 06:21:57,158 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.977e+02 2.454e+02 2.874e+02 3.736e+02 7.174e+02, threshold=5.748e+02, percent-clipped=1.0 2023-06-20 06:22:01,662 INFO [train.py:996] (2/4) Epoch 3, batch 23000, loss[loss=0.2409, simple_loss=0.303, pruned_loss=0.08938, over 21250.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3256, pruned_loss=0.0897, over 4273284.53 frames. ], batch size: 159, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:22:03,537 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=503934.0, ans=0.125 2023-06-20 06:22:49,392 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.16 vs. limit=15.0 2023-06-20 06:23:48,137 INFO [train.py:996] (2/4) Epoch 3, batch 23050, loss[loss=0.3401, simple_loss=0.3847, pruned_loss=0.1478, over 21416.00 frames. ], tot_loss[loss=0.257, simple_loss=0.3281, pruned_loss=0.09293, over 4275752.43 frames. ], batch size: 471, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:23:48,725 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 06:24:19,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=504354.0, ans=0.1 2023-06-20 06:24:24,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=504354.0, ans=0.1 2023-06-20 06:25:03,047 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.66 vs. limit=15.0 2023-06-20 06:25:23,655 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 2.688e+02 2.976e+02 3.405e+02 5.912e+02, threshold=5.952e+02, percent-clipped=1.0 2023-06-20 06:25:28,345 INFO [train.py:996] (2/4) Epoch 3, batch 23100, loss[loss=0.228, simple_loss=0.2827, pruned_loss=0.08664, over 21539.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3235, pruned_loss=0.09306, over 4271920.81 frames. ], batch size: 391, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:27:07,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=504774.0, ans=0.125 2023-06-20 06:27:11,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=504774.0, ans=0.04949747468305833 2023-06-20 06:27:26,730 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.95 vs. limit=22.5 2023-06-20 06:27:30,100 INFO [train.py:996] (2/4) Epoch 3, batch 23150, loss[loss=0.2519, simple_loss=0.3086, pruned_loss=0.09765, over 21303.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3169, pruned_loss=0.09171, over 4269922.92 frames. ], batch size: 176, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:27:31,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=504834.0, ans=0.1 2023-06-20 06:27:39,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=504834.0, ans=0.0 2023-06-20 06:27:57,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=504894.0, ans=0.0 2023-06-20 06:28:06,339 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=504894.0, ans=0.1 2023-06-20 06:28:24,644 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=504954.0, ans=0.125 2023-06-20 06:28:43,448 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=505014.0, ans=0.0 2023-06-20 06:29:20,474 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.863e+02 2.598e+02 2.935e+02 3.793e+02 5.216e+02, threshold=5.870e+02, percent-clipped=0.0 2023-06-20 06:29:31,070 INFO [train.py:996] (2/4) Epoch 3, batch 23200, loss[loss=0.2422, simple_loss=0.3038, pruned_loss=0.09023, over 21726.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3159, pruned_loss=0.09275, over 4278810.38 frames. ], batch size: 230, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:29:37,357 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=505134.0, ans=0.2 2023-06-20 06:30:15,263 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.00 vs. limit=15.0 2023-06-20 06:31:27,838 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=505374.0, ans=0.2 2023-06-20 06:31:31,759 INFO [train.py:996] (2/4) Epoch 3, batch 23250, loss[loss=0.2385, simple_loss=0.2994, pruned_loss=0.08884, over 21876.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3163, pruned_loss=0.09436, over 4288997.93 frames. ], batch size: 298, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:32:01,470 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.62 vs. limit=15.0 2023-06-20 06:32:24,550 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.47 vs. limit=15.0 2023-06-20 06:33:25,527 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.860e+02 3.314e+02 4.094e+02 6.959e+02, threshold=6.628e+02, percent-clipped=4.0 2023-06-20 06:33:30,043 INFO [train.py:996] (2/4) Epoch 3, batch 23300, loss[loss=0.4042, simple_loss=0.4711, pruned_loss=0.1686, over 21457.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.3245, pruned_loss=0.0965, over 4289741.34 frames. ], batch size: 507, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:33:53,539 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.54 vs. limit=15.0 2023-06-20 06:34:14,299 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.70 vs. limit=22.5 2023-06-20 06:34:34,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=505854.0, ans=0.1 2023-06-20 06:35:13,144 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=505914.0, ans=0.07 2023-06-20 06:35:36,333 INFO [train.py:996] (2/4) Epoch 3, batch 23350, loss[loss=0.2899, simple_loss=0.3728, pruned_loss=0.1035, over 21624.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.3275, pruned_loss=0.09466, over 4278933.04 frames. ], batch size: 414, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:35:39,738 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 06:35:41,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=506034.0, ans=0.125 2023-06-20 06:36:08,982 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=506034.0, ans=0.0 2023-06-20 06:36:28,121 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-06-20 06:36:29,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=506094.0, ans=0.025 2023-06-20 06:37:22,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=506274.0, ans=0.0 2023-06-20 06:37:27,392 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.87 vs. limit=22.5 2023-06-20 06:37:28,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=506274.0, ans=0.1 2023-06-20 06:37:32,143 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.703e+02 2.468e+02 2.769e+02 3.183e+02 5.356e+02, threshold=5.538e+02, percent-clipped=0.0 2023-06-20 06:37:36,528 INFO [train.py:996] (2/4) Epoch 3, batch 23400, loss[loss=0.2372, simple_loss=0.297, pruned_loss=0.08874, over 21148.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3194, pruned_loss=0.08996, over 4274356.28 frames. ], batch size: 607, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:38:06,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=506334.0, ans=0.125 2023-06-20 06:38:24,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=506394.0, ans=0.04949747468305833 2023-06-20 06:38:56,847 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=506514.0, ans=0.125 2023-06-20 06:39:03,415 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.47 vs. limit=15.0 2023-06-20 06:39:05,449 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=506514.0, ans=0.1 2023-06-20 06:39:17,564 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=506574.0, ans=0.0 2023-06-20 06:39:24,467 INFO [train.py:996] (2/4) Epoch 3, batch 23450, loss[loss=0.2706, simple_loss=0.3377, pruned_loss=0.1017, over 21897.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.3209, pruned_loss=0.09276, over 4279664.37 frames. ], batch size: 334, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:39:27,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=506634.0, ans=0.2 2023-06-20 06:39:35,060 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.07 vs. limit=12.0 2023-06-20 06:39:35,369 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.80 vs. limit=8.0 2023-06-20 06:40:06,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=506694.0, ans=0.125 2023-06-20 06:40:44,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=506754.0, ans=0.1 2023-06-20 06:40:46,754 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.83 vs. limit=10.0 2023-06-20 06:41:05,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=506874.0, ans=0.0 2023-06-20 06:41:16,375 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.043e+02 2.915e+02 3.484e+02 4.358e+02 6.885e+02, threshold=6.969e+02, percent-clipped=11.0 2023-06-20 06:41:20,583 INFO [train.py:996] (2/4) Epoch 3, batch 23500, loss[loss=0.2418, simple_loss=0.3069, pruned_loss=0.0883, over 21566.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.321, pruned_loss=0.09477, over 4276757.36 frames. ], batch size: 212, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:41:41,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=506934.0, ans=0.2 2023-06-20 06:42:32,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=507114.0, ans=0.2 2023-06-20 06:43:16,474 INFO [train.py:996] (2/4) Epoch 3, batch 23550, loss[loss=0.2302, simple_loss=0.285, pruned_loss=0.08771, over 21625.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3167, pruned_loss=0.0936, over 4264288.01 frames. ], batch size: 247, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:43:43,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=507294.0, ans=0.1 2023-06-20 06:44:25,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=507354.0, ans=0.125 2023-06-20 06:44:45,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=507474.0, ans=0.125 2023-06-20 06:44:52,572 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.954e+02 2.559e+02 3.076e+02 4.076e+02 7.256e+02, threshold=6.152e+02, percent-clipped=1.0 2023-06-20 06:44:53,830 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.44 vs. limit=15.0 2023-06-20 06:45:02,172 INFO [train.py:996] (2/4) Epoch 3, batch 23600, loss[loss=0.2763, simple_loss=0.3518, pruned_loss=0.1004, over 21784.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.3186, pruned_loss=0.0944, over 4263577.86 frames. ], batch size: 118, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:45:43,960 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.13 vs. limit=15.0 2023-06-20 06:46:35,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=507714.0, ans=0.07 2023-06-20 06:47:23,359 INFO [train.py:996] (2/4) Epoch 3, batch 23650, loss[loss=0.3596, simple_loss=0.4061, pruned_loss=0.1565, over 21351.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3176, pruned_loss=0.0924, over 4267957.80 frames. ], batch size: 507, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:47:51,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=507894.0, ans=0.0 2023-06-20 06:47:59,383 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=507894.0, ans=0.95 2023-06-20 06:48:15,438 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.04 vs. limit=15.0 2023-06-20 06:48:55,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=508014.0, ans=0.125 2023-06-20 06:49:20,962 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.816e+02 3.255e+02 3.884e+02 5.582e+02, threshold=6.510e+02, percent-clipped=0.0 2023-06-20 06:49:25,358 INFO [train.py:996] (2/4) Epoch 3, batch 23700, loss[loss=0.22, simple_loss=0.288, pruned_loss=0.07597, over 21404.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3205, pruned_loss=0.09157, over 4265909.64 frames. ], batch size: 211, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:49:25,729 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=508134.0, ans=0.04949747468305833 2023-06-20 06:49:27,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=508134.0, ans=0.07 2023-06-20 06:49:55,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=508134.0, ans=0.125 2023-06-20 06:50:08,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=508194.0, ans=0.04949747468305833 2023-06-20 06:50:49,798 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=508314.0, ans=0.0 2023-06-20 06:51:12,779 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.95 vs. limit=15.0 2023-06-20 06:51:16,196 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=508374.0, ans=0.125 2023-06-20 06:51:38,351 INFO [train.py:996] (2/4) Epoch 3, batch 23750, loss[loss=0.2245, simple_loss=0.3216, pruned_loss=0.06372, over 21701.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3232, pruned_loss=0.09256, over 4274522.56 frames. ], batch size: 351, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 06:51:58,866 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=508494.0, ans=0.1 2023-06-20 06:52:05,269 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-20 06:52:59,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=508614.0, ans=0.0 2023-06-20 06:53:26,063 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.98 vs. limit=10.0 2023-06-20 06:53:28,257 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=508674.0, ans=0.0 2023-06-20 06:53:32,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=508674.0, ans=0.125 2023-06-20 06:53:41,941 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.910e+02 2.562e+02 2.929e+02 3.446e+02 6.270e+02, threshold=5.857e+02, percent-clipped=0.0 2023-06-20 06:53:42,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=508674.0, ans=0.125 2023-06-20 06:53:44,620 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.80 vs. limit=15.0 2023-06-20 06:53:46,592 INFO [train.py:996] (2/4) Epoch 3, batch 23800, loss[loss=0.2332, simple_loss=0.2967, pruned_loss=0.08481, over 21746.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3197, pruned_loss=0.08959, over 4265811.01 frames. ], batch size: 112, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 06:55:11,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=508914.0, ans=0.2 2023-06-20 06:55:13,269 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=508914.0, ans=0.125 2023-06-20 06:55:16,571 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=508914.0, ans=0.2 2023-06-20 06:55:43,886 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-06-20 06:55:47,553 INFO [train.py:996] (2/4) Epoch 3, batch 23850, loss[loss=0.288, simple_loss=0.3529, pruned_loss=0.1116, over 21860.00 frames. ], tot_loss[loss=0.2598, simple_loss=0.3323, pruned_loss=0.09368, over 4265736.11 frames. ], batch size: 371, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 06:57:01,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=509214.0, ans=0.0 2023-06-20 06:57:18,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=509214.0, ans=0.0 2023-06-20 06:57:31,020 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.014e+02 2.767e+02 3.308e+02 4.321e+02 7.651e+02, threshold=6.615e+02, percent-clipped=11.0 2023-06-20 06:57:35,368 INFO [train.py:996] (2/4) Epoch 3, batch 23900, loss[loss=0.2707, simple_loss=0.3419, pruned_loss=0.09975, over 21858.00 frames. ], tot_loss[loss=0.2656, simple_loss=0.3396, pruned_loss=0.09579, over 4271075.67 frames. ], batch size: 98, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 06:58:03,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=509394.0, ans=0.125 2023-06-20 06:58:44,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=509514.0, ans=0.1 2023-06-20 06:58:45,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=509514.0, ans=0.0 2023-06-20 06:59:14,338 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=509574.0, ans=15.0 2023-06-20 06:59:14,398 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.63 vs. limit=15.0 2023-06-20 06:59:19,417 INFO [train.py:996] (2/4) Epoch 3, batch 23950, loss[loss=0.2686, simple_loss=0.3259, pruned_loss=0.1056, over 21246.00 frames. ], tot_loss[loss=0.2608, simple_loss=0.3316, pruned_loss=0.09501, over 4268998.86 frames. ], batch size: 159, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:00:18,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=509754.0, ans=0.125 2023-06-20 07:00:24,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=509814.0, ans=0.0 2023-06-20 07:00:34,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=509814.0, ans=0.0 2023-06-20 07:00:40,935 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.46 vs. limit=22.5 2023-06-20 07:00:44,505 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=509874.0, ans=0.125 2023-06-20 07:00:51,004 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.898e+02 2.718e+02 3.082e+02 3.773e+02 7.054e+02, threshold=6.164e+02, percent-clipped=1.0 2023-06-20 07:00:55,597 INFO [train.py:996] (2/4) Epoch 3, batch 24000, loss[loss=0.285, simple_loss=0.3508, pruned_loss=0.1096, over 21286.00 frames. ], tot_loss[loss=0.2654, simple_loss=0.3338, pruned_loss=0.09851, over 4269634.71 frames. ], batch size: 159, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:00:55,597 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-20 07:01:48,510 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.5.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([2.3207, 1.9081, 3.5454, 3.4921], device='cuda:2') 2023-06-20 07:01:50,149 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.0.layers.1.self_attn_weights, attn_weights_entropy = tensor([5.3569, 4.8364, 5.1259, 4.6174], device='cuda:2') 2023-06-20 07:02:00,876 INFO [train.py:1028] (2/4) Epoch 3, validation: loss=0.2795, simple_loss=0.3782, pruned_loss=0.09043, over 1796401.00 frames. 2023-06-20 07:02:00,877 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-20 07:02:29,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=509994.0, ans=0.125 2023-06-20 07:02:58,097 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.02 vs. limit=10.0 2023-06-20 07:03:01,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=510114.0, ans=0.1 2023-06-20 07:04:02,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=510174.0, ans=0.0 2023-06-20 07:04:06,968 INFO [train.py:996] (2/4) Epoch 3, batch 24050, loss[loss=0.2498, simple_loss=0.3304, pruned_loss=0.08458, over 21638.00 frames. ], tot_loss[loss=0.2662, simple_loss=0.3345, pruned_loss=0.09897, over 4272239.72 frames. ], batch size: 414, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:04:26,258 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 07:05:15,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=510414.0, ans=0.125 2023-06-20 07:05:23,945 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.28 vs. limit=15.0 2023-06-20 07:05:28,470 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.55 vs. limit=12.0 2023-06-20 07:05:29,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=510414.0, ans=0.125 2023-06-20 07:05:55,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=510474.0, ans=0.125 2023-06-20 07:06:02,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=510474.0, ans=0.0 2023-06-20 07:06:07,956 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.765e+02 2.566e+02 3.148e+02 3.843e+02 5.773e+02, threshold=6.296e+02, percent-clipped=0.0 2023-06-20 07:06:17,655 INFO [train.py:996] (2/4) Epoch 3, batch 24100, loss[loss=0.2731, simple_loss=0.3163, pruned_loss=0.115, over 20111.00 frames. ], tot_loss[loss=0.2633, simple_loss=0.3334, pruned_loss=0.09656, over 4269352.14 frames. ], batch size: 702, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:06:50,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=510594.0, ans=0.0 2023-06-20 07:07:39,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=510774.0, ans=0.0 2023-06-20 07:08:05,129 INFO [train.py:996] (2/4) Epoch 3, batch 24150, loss[loss=0.3146, simple_loss=0.3623, pruned_loss=0.1334, over 21696.00 frames. ], tot_loss[loss=0.2664, simple_loss=0.3342, pruned_loss=0.09931, over 4275553.36 frames. ], batch size: 389, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:08:08,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=510834.0, ans=0.0 2023-06-20 07:08:43,262 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=510894.0, ans=0.025 2023-06-20 07:09:53,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=511074.0, ans=0.125 2023-06-20 07:10:00,380 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.937e+02 2.945e+02 3.371e+02 4.147e+02 7.109e+02, threshold=6.741e+02, percent-clipped=1.0 2023-06-20 07:10:05,089 INFO [train.py:996] (2/4) Epoch 3, batch 24200, loss[loss=0.394, simple_loss=0.4374, pruned_loss=0.1753, over 21502.00 frames. ], tot_loss[loss=0.2678, simple_loss=0.3352, pruned_loss=0.1002, over 4271509.44 frames. ], batch size: 508, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:10:19,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=511134.0, ans=0.2 2023-06-20 07:12:05,997 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=511374.0, ans=0.1 2023-06-20 07:12:08,268 INFO [train.py:996] (2/4) Epoch 3, batch 24250, loss[loss=0.1983, simple_loss=0.2994, pruned_loss=0.04867, over 21671.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.3307, pruned_loss=0.09253, over 4275267.44 frames. ], batch size: 263, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:12:59,969 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=511554.0, ans=0.2 2023-06-20 07:13:07,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=511554.0, ans=0.0 2023-06-20 07:13:20,355 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=511614.0, ans=0.0 2023-06-20 07:13:51,207 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.452e+02 2.243e+02 2.741e+02 3.193e+02 5.760e+02, threshold=5.481e+02, percent-clipped=0.0 2023-06-20 07:14:00,288 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.55 vs. limit=12.0 2023-06-20 07:14:00,783 INFO [train.py:996] (2/4) Epoch 3, batch 24300, loss[loss=0.1582, simple_loss=0.2503, pruned_loss=0.03305, over 21739.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3212, pruned_loss=0.08578, over 4274904.30 frames. ], batch size: 316, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:14:10,792 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.00 vs. limit=15.0 2023-06-20 07:14:56,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=511854.0, ans=0.0 2023-06-20 07:16:04,843 INFO [train.py:996] (2/4) Epoch 3, batch 24350, loss[loss=0.2564, simple_loss=0.3291, pruned_loss=0.09188, over 21844.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3178, pruned_loss=0.08621, over 4274771.52 frames. ], batch size: 332, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:17:10,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=512154.0, ans=0.07 2023-06-20 07:17:32,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=512214.0, ans=0.125 2023-06-20 07:17:33,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=512214.0, ans=0.0 2023-06-20 07:18:15,539 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.793e+02 2.742e+02 3.210e+02 4.189e+02 6.509e+02, threshold=6.419e+02, percent-clipped=5.0 2023-06-20 07:18:20,252 INFO [train.py:996] (2/4) Epoch 3, batch 24400, loss[loss=0.2484, simple_loss=0.3302, pruned_loss=0.08332, over 21752.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3243, pruned_loss=0.09056, over 4274942.50 frames. ], batch size: 247, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:20:11,822 INFO [train.py:996] (2/4) Epoch 3, batch 24450, loss[loss=0.2502, simple_loss=0.3112, pruned_loss=0.09461, over 21155.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.3295, pruned_loss=0.0928, over 4268594.43 frames. ], batch size: 143, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:20:37,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=512634.0, ans=0.125 2023-06-20 07:20:45,393 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.63 vs. limit=15.0 2023-06-20 07:21:08,491 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=512754.0, ans=0.035 2023-06-20 07:22:18,357 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.573e+02 2.557e+02 2.863e+02 3.394e+02 4.374e+02, threshold=5.725e+02, percent-clipped=0.0 2023-06-20 07:22:28,443 INFO [train.py:996] (2/4) Epoch 3, batch 24500, loss[loss=0.274, simple_loss=0.3316, pruned_loss=0.1082, over 21896.00 frames. ], tot_loss[loss=0.257, simple_loss=0.3287, pruned_loss=0.09265, over 4273362.10 frames. ], batch size: 107, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:22:46,182 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.59 vs. limit=15.0 2023-06-20 07:23:09,177 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=513054.0, ans=0.04949747468305833 2023-06-20 07:23:32,411 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.45 vs. limit=12.0 2023-06-20 07:23:47,848 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=513174.0, ans=0.1 2023-06-20 07:24:10,078 INFO [train.py:996] (2/4) Epoch 3, batch 24550, loss[loss=0.3043, simple_loss=0.3734, pruned_loss=0.1176, over 21855.00 frames. ], tot_loss[loss=0.2606, simple_loss=0.3307, pruned_loss=0.09522, over 4271436.24 frames. ], batch size: 124, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:24:15,943 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.85 vs. limit=12.0 2023-06-20 07:24:32,270 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=513294.0, ans=0.1 2023-06-20 07:24:45,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=513294.0, ans=0.125 2023-06-20 07:24:54,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=513354.0, ans=0.125 2023-06-20 07:25:11,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=513354.0, ans=0.125 2023-06-20 07:25:26,142 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=513414.0, ans=0.2 2023-06-20 07:25:29,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=513414.0, ans=0.125 2023-06-20 07:25:56,876 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.04 vs. limit=15.0 2023-06-20 07:25:57,166 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.946e+02 2.731e+02 3.290e+02 3.893e+02 7.449e+02, threshold=6.579e+02, percent-clipped=2.0 2023-06-20 07:26:00,007 INFO [train.py:996] (2/4) Epoch 3, batch 24600, loss[loss=0.2193, simple_loss=0.2857, pruned_loss=0.07643, over 21743.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.3254, pruned_loss=0.09526, over 4272694.90 frames. ], batch size: 282, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:26:35,572 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.01 vs. limit=15.0 2023-06-20 07:27:11,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=513654.0, ans=0.125 2023-06-20 07:27:46,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=513774.0, ans=0.125 2023-06-20 07:28:06,818 INFO [train.py:996] (2/4) Epoch 3, batch 24650, loss[loss=0.2415, simple_loss=0.288, pruned_loss=0.09751, over 21560.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3174, pruned_loss=0.09379, over 4275804.50 frames. ], batch size: 442, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:28:46,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=513894.0, ans=0.125 2023-06-20 07:28:46,852 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=15.0 2023-06-20 07:29:25,696 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=514014.0, ans=0.125 2023-06-20 07:29:55,436 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.705e+02 2.706e+02 3.131e+02 3.721e+02 9.290e+02, threshold=6.262e+02, percent-clipped=1.0 2023-06-20 07:29:58,418 INFO [train.py:996] (2/4) Epoch 3, batch 24700, loss[loss=0.2372, simple_loss=0.3022, pruned_loss=0.08611, over 21485.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3171, pruned_loss=0.09206, over 4268066.03 frames. ], batch size: 389, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:31:30,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=514314.0, ans=0.0 2023-06-20 07:31:41,477 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=12.0 2023-06-20 07:31:44,442 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.24 vs. limit=15.0 2023-06-20 07:31:48,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=514374.0, ans=0.0 2023-06-20 07:31:53,323 INFO [train.py:996] (2/4) Epoch 3, batch 24750, loss[loss=0.2744, simple_loss=0.3162, pruned_loss=0.1163, over 21510.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3097, pruned_loss=0.08904, over 4261747.25 frames. ], batch size: 441, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:32:02,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=514434.0, ans=0.2 2023-06-20 07:32:08,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=514434.0, ans=0.0 2023-06-20 07:33:00,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=514554.0, ans=0.0 2023-06-20 07:33:52,501 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.007e+02 2.345e+02 2.611e+02 2.913e+02 4.887e+02, threshold=5.223e+02, percent-clipped=0.0 2023-06-20 07:34:06,595 INFO [train.py:996] (2/4) Epoch 3, batch 24800, loss[loss=0.2783, simple_loss=0.339, pruned_loss=0.1088, over 21863.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3038, pruned_loss=0.08839, over 4262329.66 frames. ], batch size: 118, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:34:11,637 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=514734.0, ans=0.07 2023-06-20 07:34:40,822 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.00 vs. limit=15.0 2023-06-20 07:35:17,480 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=514914.0, ans=0.125 2023-06-20 07:35:48,017 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.53 vs. limit=15.0 2023-06-20 07:35:52,708 INFO [train.py:996] (2/4) Epoch 3, batch 24850, loss[loss=0.215, simple_loss=0.2646, pruned_loss=0.08266, over 21321.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3055, pruned_loss=0.09044, over 4267502.78 frames. ], batch size: 176, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:36:23,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=515034.0, ans=0.125 2023-06-20 07:36:25,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=515094.0, ans=0.125 2023-06-20 07:37:46,544 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.060e+02 2.912e+02 3.452e+02 3.888e+02 6.528e+02, threshold=6.903e+02, percent-clipped=3.0 2023-06-20 07:37:49,498 INFO [train.py:996] (2/4) Epoch 3, batch 24900, loss[loss=0.2755, simple_loss=0.3375, pruned_loss=0.1067, over 21937.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3091, pruned_loss=0.09132, over 4272755.19 frames. ], batch size: 316, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:37:50,100 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=515334.0, ans=0.015 2023-06-20 07:38:54,808 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-06-20 07:38:54,884 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.70 vs. limit=6.0 2023-06-20 07:39:39,334 INFO [train.py:996] (2/4) Epoch 3, batch 24950, loss[loss=0.2804, simple_loss=0.3518, pruned_loss=0.1045, over 21224.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3169, pruned_loss=0.09558, over 4269086.93 frames. ], batch size: 143, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:39:39,984 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=515634.0, ans=0.125 2023-06-20 07:39:43,413 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.64 vs. limit=15.0 2023-06-20 07:39:45,679 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=515634.0, ans=0.05 2023-06-20 07:40:08,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=515694.0, ans=0.2 2023-06-20 07:40:56,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=515874.0, ans=0.125 2023-06-20 07:40:58,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=515874.0, ans=0.2 2023-06-20 07:41:31,270 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.00 vs. limit=10.0 2023-06-20 07:41:33,067 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.257e+02 2.916e+02 3.619e+02 4.634e+02 7.027e+02, threshold=7.237e+02, percent-clipped=1.0 2023-06-20 07:41:36,104 INFO [train.py:996] (2/4) Epoch 3, batch 25000, loss[loss=0.2524, simple_loss=0.3162, pruned_loss=0.09433, over 21863.00 frames. ], tot_loss[loss=0.2596, simple_loss=0.3239, pruned_loss=0.09762, over 4271228.54 frames. ], batch size: 118, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:41:52,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=515934.0, ans=0.0 2023-06-20 07:41:56,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=515994.0, ans=0.1 2023-06-20 07:43:05,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=516114.0, ans=0.1 2023-06-20 07:43:25,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff3.min_abs, batch_count=516174.0, ans=0.2 2023-06-20 07:43:28,176 INFO [train.py:996] (2/4) Epoch 3, batch 25050, loss[loss=0.2266, simple_loss=0.2835, pruned_loss=0.08487, over 21317.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3166, pruned_loss=0.0957, over 4271411.23 frames. ], batch size: 160, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:44:46,403 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=516414.0, ans=0.125 2023-06-20 07:44:46,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=516414.0, ans=0.0 2023-06-20 07:45:30,414 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.908e+02 2.526e+02 2.799e+02 3.395e+02 4.701e+02, threshold=5.598e+02, percent-clipped=0.0 2023-06-20 07:45:33,105 INFO [train.py:996] (2/4) Epoch 3, batch 25100, loss[loss=0.22, simple_loss=0.2795, pruned_loss=0.08029, over 21655.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3099, pruned_loss=0.09408, over 4270046.82 frames. ], batch size: 282, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:45:33,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=516534.0, ans=0.0 2023-06-20 07:46:03,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=516594.0, ans=0.125 2023-06-20 07:46:14,115 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=516654.0, ans=0.0 2023-06-20 07:47:26,339 INFO [train.py:996] (2/4) Epoch 3, batch 25150, loss[loss=0.2176, simple_loss=0.3085, pruned_loss=0.06336, over 21449.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3147, pruned_loss=0.09193, over 4274767.31 frames. ], batch size: 211, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:48:02,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=516954.0, ans=0.125 2023-06-20 07:48:05,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=516954.0, ans=0.125 2023-06-20 07:48:28,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=517014.0, ans=0.125 2023-06-20 07:48:29,936 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=517014.0, ans=0.125 2023-06-20 07:49:12,220 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.847e+02 2.389e+02 2.624e+02 3.346e+02 4.774e+02, threshold=5.249e+02, percent-clipped=0.0 2023-06-20 07:49:14,782 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.24 vs. limit=15.0 2023-06-20 07:49:15,105 INFO [train.py:996] (2/4) Epoch 3, batch 25200, loss[loss=0.2052, simple_loss=0.2949, pruned_loss=0.05777, over 21562.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3149, pruned_loss=0.08978, over 4267580.79 frames. ], batch size: 230, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:49:28,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=517134.0, ans=0.07 2023-06-20 07:49:28,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=517134.0, ans=0.125 2023-06-20 07:49:36,637 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.06 vs. limit=15.0 2023-06-20 07:49:54,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=517254.0, ans=0.125 2023-06-20 07:50:02,949 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=517254.0, ans=0.0 2023-06-20 07:51:12,226 INFO [train.py:996] (2/4) Epoch 3, batch 25250, loss[loss=0.2091, simple_loss=0.2732, pruned_loss=0.07255, over 21655.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3129, pruned_loss=0.0877, over 4273339.20 frames. ], batch size: 282, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:51:24,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=517434.0, ans=0.0 2023-06-20 07:51:45,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=517554.0, ans=0.1 2023-06-20 07:51:50,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=517554.0, ans=0.125 2023-06-20 07:51:58,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=517554.0, ans=0.125 2023-06-20 07:52:14,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=517614.0, ans=0.125 2023-06-20 07:53:05,054 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=517674.0, ans=0.125 2023-06-20 07:53:09,345 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.005e+02 2.525e+02 2.860e+02 3.493e+02 8.717e+02, threshold=5.720e+02, percent-clipped=4.0 2023-06-20 07:53:12,429 INFO [train.py:996] (2/4) Epoch 3, batch 25300, loss[loss=0.2385, simple_loss=0.3173, pruned_loss=0.07991, over 21608.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3107, pruned_loss=0.08709, over 4275255.48 frames. ], batch size: 414, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:53:20,628 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=517734.0, ans=0.125 2023-06-20 07:53:39,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=517794.0, ans=0.125 2023-06-20 07:53:44,052 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=517794.0, ans=0.125 2023-06-20 07:54:14,750 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.55 vs. limit=15.0 2023-06-20 07:54:50,342 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.31 vs. limit=15.0 2023-06-20 07:54:58,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=518034.0, ans=0.125 2023-06-20 07:54:59,684 INFO [train.py:996] (2/4) Epoch 3, batch 25350, loss[loss=0.2131, simple_loss=0.2874, pruned_loss=0.06937, over 21382.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3136, pruned_loss=0.08716, over 4265352.38 frames. ], batch size: 131, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:56:36,802 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 07:56:48,721 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.16 vs. limit=10.0 2023-06-20 07:56:51,068 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.743e+02 2.620e+02 3.050e+02 3.855e+02 6.289e+02, threshold=6.099e+02, percent-clipped=1.0 2023-06-20 07:56:53,738 INFO [train.py:996] (2/4) Epoch 3, batch 25400, loss[loss=0.248, simple_loss=0.3044, pruned_loss=0.09574, over 21226.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.308, pruned_loss=0.08605, over 4271212.38 frames. ], batch size: 159, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:57:18,817 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.13 vs. limit=15.0 2023-06-20 07:57:22,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=518394.0, ans=0.125 2023-06-20 07:58:15,082 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.76 vs. limit=10.0 2023-06-20 07:58:22,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=518574.0, ans=0.125 2023-06-20 07:58:31,273 INFO [train.py:996] (2/4) Epoch 3, batch 25450, loss[loss=0.2796, simple_loss=0.3591, pruned_loss=0.1001, over 21492.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3092, pruned_loss=0.08897, over 4259531.90 frames. ], batch size: 471, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 07:58:53,037 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.64 vs. limit=10.0 2023-06-20 07:59:23,568 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=518754.0, ans=0.2 2023-06-20 08:00:13,845 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.672e+02 2.261e+02 2.543e+02 3.250e+02 4.751e+02, threshold=5.087e+02, percent-clipped=0.0 2023-06-20 08:00:16,653 INFO [train.py:996] (2/4) Epoch 3, batch 25500, loss[loss=0.2576, simple_loss=0.331, pruned_loss=0.09214, over 16559.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3097, pruned_loss=0.08565, over 4241412.16 frames. ], batch size: 60, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:00:32,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=518934.0, ans=0.0 2023-06-20 08:00:48,333 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=518934.0, ans=0.09899494936611666 2023-06-20 08:01:32,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=519054.0, ans=0.0 2023-06-20 08:01:32,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=519054.0, ans=0.2 2023-06-20 08:01:44,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=519114.0, ans=0.2 2023-06-20 08:01:53,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=519114.0, ans=0.1 2023-06-20 08:02:07,935 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.63 vs. limit=12.0 2023-06-20 08:02:39,014 INFO [train.py:996] (2/4) Epoch 3, batch 25550, loss[loss=0.2394, simple_loss=0.3323, pruned_loss=0.07326, over 21719.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3171, pruned_loss=0.08618, over 4251979.58 frames. ], batch size: 332, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:03:00,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=519294.0, ans=0.0 2023-06-20 08:03:17,908 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=519294.0, ans=0.125 2023-06-20 08:04:40,728 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.809e+02 2.576e+02 2.895e+02 3.439e+02 5.948e+02, threshold=5.790e+02, percent-clipped=2.0 2023-06-20 08:04:43,682 INFO [train.py:996] (2/4) Epoch 3, batch 25600, loss[loss=0.2803, simple_loss=0.3323, pruned_loss=0.1141, over 20133.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3208, pruned_loss=0.08691, over 4256078.09 frames. ], batch size: 707, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:05:04,169 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 08:05:21,749 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=519594.0, ans=0.125 2023-06-20 08:05:26,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=519594.0, ans=0.2 2023-06-20 08:05:56,114 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 08:05:56,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=519654.0, ans=0.125 2023-06-20 08:06:03,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=519714.0, ans=0.0 2023-06-20 08:06:22,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=519774.0, ans=0.0 2023-06-20 08:06:33,949 INFO [train.py:996] (2/4) Epoch 3, batch 25650, loss[loss=0.2364, simple_loss=0.2936, pruned_loss=0.0896, over 21345.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3226, pruned_loss=0.09144, over 4257454.19 frames. ], batch size: 211, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:07:46,802 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.01 vs. limit=15.0 2023-06-20 08:08:07,370 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.135e+02 2.861e+02 3.374e+02 3.830e+02 5.312e+02, threshold=6.747e+02, percent-clipped=0.0 2023-06-20 08:08:10,528 INFO [train.py:996] (2/4) Epoch 3, batch 25700, loss[loss=0.2383, simple_loss=0.3178, pruned_loss=0.07943, over 21885.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3197, pruned_loss=0.09222, over 4253600.43 frames. ], batch size: 316, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:08:18,568 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 08:08:39,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=520194.0, ans=0.0 2023-06-20 08:08:41,046 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.65 vs. limit=15.0 2023-06-20 08:08:45,348 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=520254.0, ans=0.2 2023-06-20 08:08:59,480 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=520254.0, ans=0.125 2023-06-20 08:09:47,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=520374.0, ans=0.0 2023-06-20 08:10:17,143 INFO [train.py:996] (2/4) Epoch 3, batch 25750, loss[loss=0.2431, simple_loss=0.3105, pruned_loss=0.08787, over 20735.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3248, pruned_loss=0.09479, over 4259489.45 frames. ], batch size: 607, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:10:48,471 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=520494.0, ans=0.0 2023-06-20 08:10:49,093 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.21 vs. limit=15.0 2023-06-20 08:11:45,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=520614.0, ans=0.1 2023-06-20 08:11:54,163 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=520614.0, ans=0.1 2023-06-20 08:12:04,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=520674.0, ans=0.0 2023-06-20 08:12:19,561 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.217e+02 2.970e+02 3.422e+02 4.154e+02 6.514e+02, threshold=6.844e+02, percent-clipped=0.0 2023-06-20 08:12:22,612 INFO [train.py:996] (2/4) Epoch 3, batch 25800, loss[loss=0.3736, simple_loss=0.4182, pruned_loss=0.1644, over 21407.00 frames. ], tot_loss[loss=0.2681, simple_loss=0.3371, pruned_loss=0.09957, over 4261327.11 frames. ], batch size: 471, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:12:58,147 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.60 vs. limit=15.0 2023-06-20 08:12:58,291 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.19 vs. limit=22.5 2023-06-20 08:13:15,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=520794.0, ans=0.125 2023-06-20 08:13:18,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=520794.0, ans=10.0 2023-06-20 08:13:48,336 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=520854.0, ans=15.0 2023-06-20 08:14:42,707 INFO [train.py:996] (2/4) Epoch 3, batch 25850, loss[loss=0.2751, simple_loss=0.3407, pruned_loss=0.1048, over 21866.00 frames. ], tot_loss[loss=0.2687, simple_loss=0.3383, pruned_loss=0.09952, over 4265237.50 frames. ], batch size: 414, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:14:43,654 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.21 vs. limit=10.0 2023-06-20 08:15:31,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=521094.0, ans=0.0 2023-06-20 08:16:01,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=521214.0, ans=0.125 2023-06-20 08:16:38,441 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=521274.0, ans=0.025 2023-06-20 08:16:42,373 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 2.635e+02 3.168e+02 4.552e+02 6.616e+02, threshold=6.336e+02, percent-clipped=0.0 2023-06-20 08:16:45,274 INFO [train.py:996] (2/4) Epoch 3, batch 25900, loss[loss=0.2529, simple_loss=0.3318, pruned_loss=0.08702, over 21194.00 frames. ], tot_loss[loss=0.2687, simple_loss=0.3387, pruned_loss=0.0993, over 4269365.37 frames. ], batch size: 143, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:16:45,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=521334.0, ans=0.2 2023-06-20 08:17:35,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=521394.0, ans=0.125 2023-06-20 08:18:08,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=521514.0, ans=15.0 2023-06-20 08:18:14,630 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=521514.0, ans=0.025 2023-06-20 08:18:41,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=521574.0, ans=0.0 2023-06-20 08:18:54,002 INFO [train.py:996] (2/4) Epoch 3, batch 25950, loss[loss=0.2742, simple_loss=0.3567, pruned_loss=0.09585, over 21516.00 frames. ], tot_loss[loss=0.2746, simple_loss=0.3445, pruned_loss=0.1024, over 4265686.75 frames. ], batch size: 131, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:20:00,495 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-06-20 08:20:12,089 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.06 vs. limit=15.0 2023-06-20 08:20:46,275 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.785e+02 2.642e+02 3.151e+02 3.673e+02 6.319e+02, threshold=6.303e+02, percent-clipped=0.0 2023-06-20 08:20:49,273 INFO [train.py:996] (2/4) Epoch 3, batch 26000, loss[loss=0.3015, simple_loss=0.3682, pruned_loss=0.1173, over 21750.00 frames. ], tot_loss[loss=0.2747, simple_loss=0.3466, pruned_loss=0.1014, over 4258942.53 frames. ], batch size: 441, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:21:24,006 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.85 vs. limit=15.0 2023-06-20 08:21:33,808 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=522054.0, ans=0.035 2023-06-20 08:21:38,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=522054.0, ans=0.0 2023-06-20 08:22:01,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=522114.0, ans=0.0 2023-06-20 08:22:09,210 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.35 vs. limit=15.0 2023-06-20 08:22:35,217 INFO [train.py:996] (2/4) Epoch 3, batch 26050, loss[loss=0.2805, simple_loss=0.3327, pruned_loss=0.1142, over 21366.00 frames. ], tot_loss[loss=0.2748, simple_loss=0.3459, pruned_loss=0.1018, over 4260699.81 frames. ], batch size: 159, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:23:00,433 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=522234.0, ans=0.2 2023-06-20 08:23:19,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=522294.0, ans=0.125 2023-06-20 08:24:03,499 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=522414.0, ans=0.2 2023-06-20 08:24:32,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=522474.0, ans=0.2 2023-06-20 08:24:39,850 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.119e+02 2.804e+02 3.203e+02 3.918e+02 6.790e+02, threshold=6.407e+02, percent-clipped=4.0 2023-06-20 08:24:42,771 INFO [train.py:996] (2/4) Epoch 3, batch 26100, loss[loss=0.2484, simple_loss=0.2984, pruned_loss=0.09924, over 21591.00 frames. ], tot_loss[loss=0.2704, simple_loss=0.3392, pruned_loss=0.1008, over 4272359.85 frames. ], batch size: 548, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:24:51,355 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=522534.0, ans=0.1 2023-06-20 08:25:07,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff3.min_abs, batch_count=522594.0, ans=0.2 2023-06-20 08:25:21,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=522654.0, ans=0.0 2023-06-20 08:25:43,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=522654.0, ans=0.09899494936611666 2023-06-20 08:25:52,270 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 08:26:45,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=522834.0, ans=0.0 2023-06-20 08:26:46,812 INFO [train.py:996] (2/4) Epoch 3, batch 26150, loss[loss=0.2688, simple_loss=0.3324, pruned_loss=0.1026, over 21666.00 frames. ], tot_loss[loss=0.2685, simple_loss=0.3354, pruned_loss=0.1008, over 4278440.84 frames. ], batch size: 230, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:27:09,053 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=522894.0, ans=0.125 2023-06-20 08:27:54,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=522954.0, ans=0.1 2023-06-20 08:28:15,740 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=523014.0, ans=0.05 2023-06-20 08:28:33,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=523074.0, ans=0.125 2023-06-20 08:28:39,693 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.927e+02 2.726e+02 3.009e+02 3.723e+02 5.538e+02, threshold=6.017e+02, percent-clipped=0.0 2023-06-20 08:28:49,609 INFO [train.py:996] (2/4) Epoch 3, batch 26200, loss[loss=0.2399, simple_loss=0.3274, pruned_loss=0.07621, over 21286.00 frames. ], tot_loss[loss=0.2673, simple_loss=0.3368, pruned_loss=0.09891, over 4279226.66 frames. ], batch size: 159, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:29:09,491 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=523134.0, ans=0.1 2023-06-20 08:30:05,793 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.73 vs. limit=6.0 2023-06-20 08:30:22,566 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=523314.0, ans=0.05 2023-06-20 08:30:59,893 INFO [train.py:996] (2/4) Epoch 3, batch 26250, loss[loss=0.2505, simple_loss=0.3196, pruned_loss=0.0907, over 21431.00 frames. ], tot_loss[loss=0.2683, simple_loss=0.3408, pruned_loss=0.09785, over 4282178.07 frames. ], batch size: 211, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:31:00,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=523434.0, ans=0.0 2023-06-20 08:31:53,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=523554.0, ans=0.0 2023-06-20 08:31:55,442 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=523554.0, ans=0.1 2023-06-20 08:32:12,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=523554.0, ans=0.125 2023-06-20 08:32:44,788 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=523674.0, ans=0.0 2023-06-20 08:33:04,808 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.312e+02 2.693e+02 3.337e+02 4.034e+02 6.745e+02, threshold=6.673e+02, percent-clipped=1.0 2023-06-20 08:33:07,764 INFO [train.py:996] (2/4) Epoch 3, batch 26300, loss[loss=0.2423, simple_loss=0.297, pruned_loss=0.09374, over 20170.00 frames. ], tot_loss[loss=0.2657, simple_loss=0.3362, pruned_loss=0.09754, over 4289530.48 frames. ], batch size: 703, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:33:24,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=523734.0, ans=0.0 2023-06-20 08:34:14,802 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.03 vs. limit=15.0 2023-06-20 08:34:17,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=523854.0, ans=0.2 2023-06-20 08:34:47,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=523914.0, ans=0.0 2023-06-20 08:34:47,077 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=523914.0, ans=0.0 2023-06-20 08:35:13,487 INFO [train.py:996] (2/4) Epoch 3, batch 26350, loss[loss=0.2683, simple_loss=0.3323, pruned_loss=0.1022, over 21788.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.3334, pruned_loss=0.09756, over 4287741.31 frames. ], batch size: 332, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:37:02,443 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.050e+02 2.700e+02 3.038e+02 3.604e+02 6.055e+02, threshold=6.077e+02, percent-clipped=0.0 2023-06-20 08:37:05,307 INFO [train.py:996] (2/4) Epoch 3, batch 26400, loss[loss=0.2231, simple_loss=0.272, pruned_loss=0.08708, over 21159.00 frames. ], tot_loss[loss=0.2614, simple_loss=0.3275, pruned_loss=0.0976, over 4283356.37 frames. ], batch size: 143, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:37:26,761 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.84 vs. limit=12.0 2023-06-20 08:37:35,710 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=524334.0, ans=0.125 2023-06-20 08:37:58,569 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=524454.0, ans=0.125 2023-06-20 08:39:02,890 INFO [train.py:996] (2/4) Epoch 3, batch 26450, loss[loss=0.2576, simple_loss=0.3541, pruned_loss=0.08057, over 19776.00 frames. ], tot_loss[loss=0.2625, simple_loss=0.3281, pruned_loss=0.09849, over 4273577.72 frames. ], batch size: 702, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:39:20,351 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=524634.0, ans=0.0 2023-06-20 08:40:27,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=524814.0, ans=0.125 2023-06-20 08:41:08,735 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.035e+02 2.874e+02 3.456e+02 4.321e+02 8.810e+02, threshold=6.911e+02, percent-clipped=7.0 2023-06-20 08:41:11,680 INFO [train.py:996] (2/4) Epoch 3, batch 26500, loss[loss=0.1843, simple_loss=0.2391, pruned_loss=0.06474, over 21816.00 frames. ], tot_loss[loss=0.2601, simple_loss=0.3276, pruned_loss=0.09634, over 4273559.47 frames. ], batch size: 107, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:41:31,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=524994.0, ans=0.2 2023-06-20 08:43:26,055 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=525174.0, ans=0.09899494936611666 2023-06-20 08:43:28,416 INFO [train.py:996] (2/4) Epoch 3, batch 26550, loss[loss=0.2099, simple_loss=0.2964, pruned_loss=0.06173, over 21667.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.3247, pruned_loss=0.09302, over 4273182.00 frames. ], batch size: 298, lr: 1.02e-02, grad_scale: 64.0 2023-06-20 08:43:31,557 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=525234.0, ans=0.0 2023-06-20 08:44:23,652 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=525354.0, ans=0.125 2023-06-20 08:45:32,866 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.67 vs. limit=15.0 2023-06-20 08:45:37,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=525474.0, ans=0.025 2023-06-20 08:45:39,078 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.810e+02 2.519e+02 3.120e+02 3.991e+02 8.354e+02, threshold=6.239e+02, percent-clipped=2.0 2023-06-20 08:45:40,623 INFO [train.py:996] (2/4) Epoch 3, batch 26600, loss[loss=0.2575, simple_loss=0.3139, pruned_loss=0.1006, over 21736.00 frames. ], tot_loss[loss=0.253, simple_loss=0.3238, pruned_loss=0.09112, over 4265396.90 frames. ], batch size: 112, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:45:50,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=525534.0, ans=0.125 2023-06-20 08:46:18,794 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=525594.0, ans=0.125 2023-06-20 08:46:36,137 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=525654.0, ans=0.125 2023-06-20 08:47:41,375 INFO [train.py:996] (2/4) Epoch 3, batch 26650, loss[loss=0.2075, simple_loss=0.2888, pruned_loss=0.06315, over 21621.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3182, pruned_loss=0.09007, over 4263575.18 frames. ], batch size: 391, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:47:59,229 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.31 vs. limit=10.0 2023-06-20 08:48:25,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=525954.0, ans=0.125 2023-06-20 08:49:16,479 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.669e+02 2.277e+02 2.544e+02 2.972e+02 4.413e+02, threshold=5.088e+02, percent-clipped=0.0 2023-06-20 08:49:23,174 INFO [train.py:996] (2/4) Epoch 3, batch 26700, loss[loss=0.2479, simple_loss=0.3122, pruned_loss=0.09182, over 21428.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3119, pruned_loss=0.08689, over 4267083.41 frames. ], batch size: 131, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:50:55,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=526314.0, ans=0.0 2023-06-20 08:51:36,257 INFO [train.py:996] (2/4) Epoch 3, batch 26750, loss[loss=0.2124, simple_loss=0.2916, pruned_loss=0.06663, over 21465.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3111, pruned_loss=0.08474, over 4278647.78 frames. ], batch size: 194, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:52:46,381 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.21 vs. limit=22.5 2023-06-20 08:52:50,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=526614.0, ans=0.09899494936611666 2023-06-20 08:52:53,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=526614.0, ans=0.125 2023-06-20 08:53:10,457 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.85 vs. limit=22.5 2023-06-20 08:53:12,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=526614.0, ans=0.0 2023-06-20 08:53:55,875 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.184e+02 2.827e+02 3.440e+02 3.863e+02 5.872e+02, threshold=6.879e+02, percent-clipped=7.0 2023-06-20 08:54:02,656 INFO [train.py:996] (2/4) Epoch 3, batch 26800, loss[loss=0.3609, simple_loss=0.3981, pruned_loss=0.1619, over 21327.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3202, pruned_loss=0.0901, over 4275522.55 frames. ], batch size: 507, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:54:38,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=526854.0, ans=0.125 2023-06-20 08:55:15,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=526914.0, ans=0.125 2023-06-20 08:55:17,059 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=526914.0, ans=0.2 2023-06-20 08:55:22,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=526914.0, ans=0.125 2023-06-20 08:56:05,475 INFO [train.py:996] (2/4) Epoch 3, batch 26850, loss[loss=0.2354, simple_loss=0.295, pruned_loss=0.08794, over 21718.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.3216, pruned_loss=0.09296, over 4279782.01 frames. ], batch size: 351, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:56:35,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=527094.0, ans=0.2 2023-06-20 08:56:39,386 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=527154.0, ans=0.2 2023-06-20 08:57:36,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=527274.0, ans=0.125 2023-06-20 08:57:39,245 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.906e+02 2.442e+02 3.000e+02 3.641e+02 8.761e+02, threshold=6.000e+02, percent-clipped=1.0 2023-06-20 08:57:40,680 INFO [train.py:996] (2/4) Epoch 3, batch 26900, loss[loss=0.2441, simple_loss=0.2977, pruned_loss=0.09525, over 21713.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3134, pruned_loss=0.09226, over 4267677.49 frames. ], batch size: 124, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:58:26,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=527394.0, ans=0.0 2023-06-20 08:58:29,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=527394.0, ans=0.025 2023-06-20 08:58:34,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=527454.0, ans=0.125 2023-06-20 08:59:31,383 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.55 vs. limit=15.0 2023-06-20 08:59:45,738 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.44 vs. limit=15.0 2023-06-20 08:59:47,429 INFO [train.py:996] (2/4) Epoch 3, batch 26950, loss[loss=0.2728, simple_loss=0.3609, pruned_loss=0.0924, over 21773.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3138, pruned_loss=0.09264, over 4272324.96 frames. ], batch size: 351, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 09:00:15,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=527694.0, ans=0.0 2023-06-20 09:01:34,619 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.64 vs. limit=22.5 2023-06-20 09:01:39,290 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.867e+02 2.583e+02 3.256e+02 3.814e+02 7.772e+02, threshold=6.512e+02, percent-clipped=3.0 2023-06-20 09:01:52,871 INFO [train.py:996] (2/4) Epoch 3, batch 27000, loss[loss=0.2008, simple_loss=0.2883, pruned_loss=0.05665, over 21779.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3128, pruned_loss=0.08973, over 4260851.33 frames. ], batch size: 282, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 09:01:52,872 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-20 09:02:49,148 INFO [train.py:1028] (2/4) Epoch 3, validation: loss=0.2585, simple_loss=0.355, pruned_loss=0.081, over 1796401.00 frames. 2023-06-20 09:02:49,151 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-20 09:02:56,646 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.76 vs. limit=15.0 2023-06-20 09:03:09,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=527994.0, ans=0.125 2023-06-20 09:03:35,857 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=528054.0, ans=0.0 2023-06-20 09:03:47,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=528114.0, ans=0.125 2023-06-20 09:04:15,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=528174.0, ans=0.125 2023-06-20 09:04:29,069 INFO [train.py:996] (2/4) Epoch 3, batch 27050, loss[loss=0.2546, simple_loss=0.3225, pruned_loss=0.09333, over 21336.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3154, pruned_loss=0.08641, over 4272102.78 frames. ], batch size: 159, lr: 1.02e-02, grad_scale: 16.0 2023-06-20 09:04:31,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=528234.0, ans=0.0 2023-06-20 09:04:42,509 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=528294.0, ans=0.0 2023-06-20 09:06:17,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=528474.0, ans=0.0 2023-06-20 09:06:31,038 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.841e+02 2.379e+02 2.798e+02 3.211e+02 4.498e+02, threshold=5.597e+02, percent-clipped=0.0 2023-06-20 09:06:31,062 INFO [train.py:996] (2/4) Epoch 3, batch 27100, loss[loss=0.2683, simple_loss=0.3765, pruned_loss=0.08003, over 19817.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3181, pruned_loss=0.08824, over 4283252.22 frames. ], batch size: 702, lr: 1.02e-02, grad_scale: 16.0 2023-06-20 09:06:31,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=528534.0, ans=0.125 2023-06-20 09:07:49,571 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=528654.0, ans=0.0 2023-06-20 09:08:32,573 INFO [train.py:996] (2/4) Epoch 3, batch 27150, loss[loss=0.34, simple_loss=0.4146, pruned_loss=0.1327, over 21694.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.3298, pruned_loss=0.0916, over 4283939.15 frames. ], batch size: 441, lr: 1.01e-02, grad_scale: 16.0 2023-06-20 09:09:54,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=529014.0, ans=0.125 2023-06-20 09:10:00,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=529014.0, ans=0.04949747468305833 2023-06-20 09:10:34,631 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 09:10:50,518 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.089e+02 2.751e+02 3.147e+02 3.770e+02 6.870e+02, threshold=6.294e+02, percent-clipped=5.0 2023-06-20 09:10:50,541 INFO [train.py:996] (2/4) Epoch 3, batch 27200, loss[loss=0.283, simple_loss=0.348, pruned_loss=0.1091, over 21742.00 frames. ], tot_loss[loss=0.2637, simple_loss=0.3382, pruned_loss=0.09459, over 4277704.49 frames. ], batch size: 247, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:10:51,649 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=22.5 2023-06-20 09:11:13,893 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-06-20 09:12:13,636 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=529314.0, ans=0.0 2023-06-20 09:12:44,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=529374.0, ans=0.125 2023-06-20 09:12:57,438 INFO [train.py:996] (2/4) Epoch 3, batch 27250, loss[loss=0.2744, simple_loss=0.3477, pruned_loss=0.1006, over 21566.00 frames. ], tot_loss[loss=0.2697, simple_loss=0.341, pruned_loss=0.09918, over 4275823.36 frames. ], batch size: 389, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:13:38,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=529554.0, ans=0.2 2023-06-20 09:14:59,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=529674.0, ans=0.95 2023-06-20 09:15:03,696 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.174e+02 2.889e+02 3.267e+02 4.171e+02 5.993e+02, threshold=6.535e+02, percent-clipped=0.0 2023-06-20 09:15:03,719 INFO [train.py:996] (2/4) Epoch 3, batch 27300, loss[loss=0.3136, simple_loss=0.3841, pruned_loss=0.1216, over 21740.00 frames. ], tot_loss[loss=0.272, simple_loss=0.3427, pruned_loss=0.1007, over 4276717.44 frames. ], batch size: 441, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:15:30,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=529794.0, ans=0.035 2023-06-20 09:16:09,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=529854.0, ans=0.125 2023-06-20 09:16:16,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=529854.0, ans=0.1 2023-06-20 09:17:22,795 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.48 vs. limit=15.0 2023-06-20 09:17:26,158 INFO [train.py:996] (2/4) Epoch 3, batch 27350, loss[loss=0.2471, simple_loss=0.324, pruned_loss=0.08509, over 21796.00 frames. ], tot_loss[loss=0.2741, simple_loss=0.3447, pruned_loss=0.1018, over 4279399.60 frames. ], batch size: 298, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:17:51,082 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=530094.0, ans=0.1 2023-06-20 09:18:39,913 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.58 vs. limit=6.0 2023-06-20 09:18:54,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=530214.0, ans=0.125 2023-06-20 09:19:17,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=530274.0, ans=0.1 2023-06-20 09:19:31,132 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.955e+02 2.578e+02 2.892e+02 3.255e+02 4.289e+02, threshold=5.785e+02, percent-clipped=0.0 2023-06-20 09:19:31,155 INFO [train.py:996] (2/4) Epoch 3, batch 27400, loss[loss=0.2372, simple_loss=0.2955, pruned_loss=0.08947, over 21186.00 frames. ], tot_loss[loss=0.2702, simple_loss=0.3393, pruned_loss=0.1005, over 4284864.79 frames. ], batch size: 608, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:20:18,387 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 09:21:09,264 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=530574.0, ans=0.0 2023-06-20 09:21:20,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=530574.0, ans=0.04949747468305833 2023-06-20 09:21:32,887 INFO [train.py:996] (2/4) Epoch 3, batch 27450, loss[loss=0.2439, simple_loss=0.3257, pruned_loss=0.08104, over 21670.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.3329, pruned_loss=0.09781, over 4283200.63 frames. ], batch size: 247, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:21:40,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=530634.0, ans=0.1 2023-06-20 09:21:51,455 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=530634.0, ans=0.125 2023-06-20 09:23:15,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=530874.0, ans=0.0 2023-06-20 09:23:33,228 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.961e+02 2.511e+02 2.901e+02 3.384e+02 5.453e+02, threshold=5.802e+02, percent-clipped=0.0 2023-06-20 09:23:33,265 INFO [train.py:996] (2/4) Epoch 3, batch 27500, loss[loss=0.2655, simple_loss=0.3259, pruned_loss=0.1025, over 21886.00 frames. ], tot_loss[loss=0.2653, simple_loss=0.3327, pruned_loss=0.09888, over 4283614.62 frames. ], batch size: 351, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:24:07,917 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=530994.0, ans=0.125 2023-06-20 09:24:17,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=531054.0, ans=0.95 2023-06-20 09:24:20,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=531054.0, ans=0.125 2023-06-20 09:25:24,436 INFO [train.py:996] (2/4) Epoch 3, batch 27550, loss[loss=0.2222, simple_loss=0.2832, pruned_loss=0.0806, over 21659.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.3267, pruned_loss=0.09448, over 4293866.86 frames. ], batch size: 298, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:25:24,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=531234.0, ans=0.125 2023-06-20 09:26:02,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=531294.0, ans=0.1 2023-06-20 09:26:19,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=531354.0, ans=0.1 2023-06-20 09:26:28,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=531354.0, ans=0.125 2023-06-20 09:27:16,436 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.687e+02 2.453e+02 3.210e+02 4.120e+02 6.200e+02, threshold=6.421e+02, percent-clipped=4.0 2023-06-20 09:27:16,464 INFO [train.py:996] (2/4) Epoch 3, batch 27600, loss[loss=0.2555, simple_loss=0.3112, pruned_loss=0.09988, over 22005.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.3206, pruned_loss=0.09339, over 4289151.66 frames. ], batch size: 103, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:27:16,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=531534.0, ans=0.125 2023-06-20 09:27:30,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=531534.0, ans=0.125 2023-06-20 09:27:31,157 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.45 vs. limit=22.5 2023-06-20 09:27:51,312 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.01 vs. limit=15.0 2023-06-20 09:29:00,327 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-06-20 09:29:01,891 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.77 vs. limit=12.0 2023-06-20 09:29:12,395 INFO [train.py:996] (2/4) Epoch 3, batch 27650, loss[loss=0.2567, simple_loss=0.3106, pruned_loss=0.1014, over 21605.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3161, pruned_loss=0.09339, over 4277979.37 frames. ], batch size: 389, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:29:17,850 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.19 vs. limit=8.0 2023-06-20 09:29:21,917 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.65 vs. limit=15.0 2023-06-20 09:30:13,927 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=22.5 2023-06-20 09:30:55,930 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.64 vs. limit=22.5 2023-06-20 09:31:10,932 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.765e+02 2.456e+02 3.066e+02 3.957e+02 5.583e+02, threshold=6.132e+02, percent-clipped=0.0 2023-06-20 09:31:10,955 INFO [train.py:996] (2/4) Epoch 3, batch 27700, loss[loss=0.2044, simple_loss=0.2815, pruned_loss=0.06362, over 21440.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3164, pruned_loss=0.09149, over 4277432.28 frames. ], batch size: 194, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:31:38,582 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=532194.0, ans=0.125 2023-06-20 09:31:41,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=532194.0, ans=0.2 2023-06-20 09:31:44,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=532194.0, ans=0.1 2023-06-20 09:31:45,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=532194.0, ans=0.0 2023-06-20 09:31:56,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=532194.0, ans=0.2 2023-06-20 09:33:04,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=532374.0, ans=0.035 2023-06-20 09:33:05,359 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=532374.0, ans=0.125 2023-06-20 09:33:20,141 INFO [train.py:996] (2/4) Epoch 3, batch 27750, loss[loss=0.2124, simple_loss=0.3143, pruned_loss=0.05531, over 20809.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.3191, pruned_loss=0.09102, over 4272065.37 frames. ], batch size: 608, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:33:47,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=532494.0, ans=0.125 2023-06-20 09:33:52,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=532494.0, ans=0.125 2023-06-20 09:33:58,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=532494.0, ans=0.125 2023-06-20 09:34:08,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=532494.0, ans=0.125 2023-06-20 09:35:06,107 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=532674.0, ans=0.0 2023-06-20 09:35:16,981 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.863e+02 2.578e+02 3.101e+02 3.636e+02 6.452e+02, threshold=6.201e+02, percent-clipped=3.0 2023-06-20 09:35:17,004 INFO [train.py:996] (2/4) Epoch 3, batch 27800, loss[loss=0.2435, simple_loss=0.3013, pruned_loss=0.09286, over 21465.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3174, pruned_loss=0.0912, over 4278522.40 frames. ], batch size: 159, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:36:09,694 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=532854.0, ans=0.2 2023-06-20 09:36:46,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=532914.0, ans=0.0 2023-06-20 09:36:49,145 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.53 vs. limit=6.0 2023-06-20 09:37:27,378 INFO [train.py:996] (2/4) Epoch 3, batch 27850, loss[loss=0.2358, simple_loss=0.3024, pruned_loss=0.0846, over 21965.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3162, pruned_loss=0.09216, over 4290354.10 frames. ], batch size: 333, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:38:12,113 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.54 vs. limit=15.0 2023-06-20 09:38:54,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=533154.0, ans=0.125 2023-06-20 09:39:20,315 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=533274.0, ans=0.2 2023-06-20 09:39:41,746 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.948e+02 2.579e+02 2.899e+02 3.562e+02 7.537e+02, threshold=5.798e+02, percent-clipped=1.0 2023-06-20 09:39:41,769 INFO [train.py:996] (2/4) Epoch 3, batch 27900, loss[loss=0.2963, simple_loss=0.3832, pruned_loss=0.1047, over 21806.00 frames. ], tot_loss[loss=0.2573, simple_loss=0.327, pruned_loss=0.09383, over 4288878.80 frames. ], batch size: 316, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:40:17,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=533394.0, ans=0.035 2023-06-20 09:41:10,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=533514.0, ans=0.125 2023-06-20 09:41:14,881 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=533514.0, ans=0.025 2023-06-20 09:41:47,950 INFO [train.py:996] (2/4) Epoch 3, batch 27950, loss[loss=0.3212, simple_loss=0.3882, pruned_loss=0.1271, over 21488.00 frames. ], tot_loss[loss=0.2517, simple_loss=0.3252, pruned_loss=0.08915, over 4289371.16 frames. ], batch size: 471, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:42:01,270 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=533634.0, ans=0.125 2023-06-20 09:42:28,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=533694.0, ans=10.0 2023-06-20 09:43:00,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=533814.0, ans=0.0 2023-06-20 09:43:45,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=533874.0, ans=0.125 2023-06-20 09:43:54,334 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.725e+02 2.301e+02 2.628e+02 3.233e+02 4.917e+02, threshold=5.255e+02, percent-clipped=0.0 2023-06-20 09:43:54,358 INFO [train.py:996] (2/4) Epoch 3, batch 28000, loss[loss=0.2822, simple_loss=0.3375, pruned_loss=0.1134, over 21882.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3226, pruned_loss=0.08716, over 4286811.86 frames. ], batch size: 107, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:43:56,253 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=533934.0, ans=0.0 2023-06-20 09:44:07,967 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=15.0 2023-06-20 09:44:34,769 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 09:45:05,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=534114.0, ans=0.125 2023-06-20 09:45:09,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=534114.0, ans=0.125 2023-06-20 09:46:01,518 INFO [train.py:996] (2/4) Epoch 3, batch 28050, loss[loss=0.2419, simple_loss=0.2941, pruned_loss=0.09492, over 21181.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3203, pruned_loss=0.08889, over 4286305.88 frames. ], batch size: 607, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:46:39,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=534294.0, ans=0.125 2023-06-20 09:46:39,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=534294.0, ans=0.125 2023-06-20 09:46:42,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=534294.0, ans=0.0 2023-06-20 09:47:46,149 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=534474.0, ans=0.0 2023-06-20 09:48:07,107 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.943e+02 2.725e+02 3.043e+02 3.696e+02 6.736e+02, threshold=6.086e+02, percent-clipped=4.0 2023-06-20 09:48:07,130 INFO [train.py:996] (2/4) Epoch 3, batch 28100, loss[loss=0.2126, simple_loss=0.2655, pruned_loss=0.07987, over 21424.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3173, pruned_loss=0.08852, over 4284466.75 frames. ], batch size: 212, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:48:07,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=534534.0, ans=0.125 2023-06-20 09:48:10,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=534534.0, ans=0.035 2023-06-20 09:48:32,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=534594.0, ans=0.1 2023-06-20 09:48:38,756 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=534594.0, ans=0.0 2023-06-20 09:48:41,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=534594.0, ans=0.1 2023-06-20 09:48:50,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=534654.0, ans=0.125 2023-06-20 09:49:11,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=534714.0, ans=0.0 2023-06-20 09:49:34,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=534774.0, ans=0.0 2023-06-20 09:49:51,904 INFO [train.py:996] (2/4) Epoch 3, batch 28150, loss[loss=0.2169, simple_loss=0.2807, pruned_loss=0.07654, over 21674.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3112, pruned_loss=0.08911, over 4281599.04 frames. ], batch size: 333, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:50:36,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=534894.0, ans=0.2 2023-06-20 09:51:52,816 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.120e+02 2.680e+02 3.048e+02 3.566e+02 6.007e+02, threshold=6.096e+02, percent-clipped=0.0 2023-06-20 09:51:52,852 INFO [train.py:996] (2/4) Epoch 3, batch 28200, loss[loss=0.2934, simple_loss=0.3466, pruned_loss=0.1202, over 21573.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3102, pruned_loss=0.09076, over 4273342.36 frames. ], batch size: 389, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:52:04,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=535134.0, ans=0.1 2023-06-20 09:52:28,819 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=535194.0, ans=0.125 2023-06-20 09:52:48,317 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.09 vs. limit=15.0 2023-06-20 09:53:37,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=535374.0, ans=0.125 2023-06-20 09:54:03,160 INFO [train.py:996] (2/4) Epoch 3, batch 28250, loss[loss=0.2227, simple_loss=0.2828, pruned_loss=0.0813, over 21392.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3135, pruned_loss=0.09361, over 4278491.03 frames. ], batch size: 211, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:55:07,502 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.19 vs. limit=6.0 2023-06-20 09:55:48,890 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-20 09:55:52,595 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.840e+02 2.642e+02 2.920e+02 3.400e+02 5.282e+02, threshold=5.841e+02, percent-clipped=0.0 2023-06-20 09:55:52,617 INFO [train.py:996] (2/4) Epoch 3, batch 28300, loss[loss=0.18, simple_loss=0.2637, pruned_loss=0.04818, over 21387.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3108, pruned_loss=0.09084, over 4276423.59 frames. ], batch size: 211, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:57:45,968 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.58 vs. limit=15.0 2023-06-20 09:57:55,377 INFO [train.py:996] (2/4) Epoch 3, batch 28350, loss[loss=0.2507, simple_loss=0.307, pruned_loss=0.09722, over 21304.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3073, pruned_loss=0.08443, over 4267145.93 frames. ], batch size: 471, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:59:18,160 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.46 vs. limit=15.0 2023-06-20 09:59:57,217 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.905e+02 2.442e+02 2.906e+02 3.756e+02 6.474e+02, threshold=5.811e+02, percent-clipped=1.0 2023-06-20 09:59:57,242 INFO [train.py:996] (2/4) Epoch 3, batch 28400, loss[loss=0.243, simple_loss=0.3028, pruned_loss=0.09156, over 21458.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3044, pruned_loss=0.0839, over 4258312.20 frames. ], batch size: 389, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 10:00:03,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=536334.0, ans=0.125 2023-06-20 10:01:26,666 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.89 vs. limit=22.5 2023-06-20 10:01:54,089 INFO [train.py:996] (2/4) Epoch 3, batch 28450, loss[loss=0.2475, simple_loss=0.309, pruned_loss=0.09306, over 21612.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3103, pruned_loss=0.08908, over 4265036.88 frames. ], batch size: 263, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 10:03:22,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=536814.0, ans=0.2 2023-06-20 10:03:22,103 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=536814.0, ans=0.125 2023-06-20 10:03:27,531 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=536814.0, ans=0.0 2023-06-20 10:04:20,982 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.267e+02 2.862e+02 3.602e+02 4.332e+02 6.960e+02, threshold=7.204e+02, percent-clipped=5.0 2023-06-20 10:04:21,006 INFO [train.py:996] (2/4) Epoch 3, batch 28500, loss[loss=0.2889, simple_loss=0.3502, pruned_loss=0.1138, over 21893.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3135, pruned_loss=0.09132, over 4274185.86 frames. ], batch size: 371, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 10:04:37,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=536994.0, ans=0.0 2023-06-20 10:04:50,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=536994.0, ans=0.2 2023-06-20 10:05:07,980 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=537054.0, ans=0.0 2023-06-20 10:05:26,233 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.61 vs. limit=10.0 2023-06-20 10:06:02,415 INFO [train.py:996] (2/4) Epoch 3, batch 28550, loss[loss=0.2559, simple_loss=0.318, pruned_loss=0.09692, over 19954.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3209, pruned_loss=0.09347, over 4275469.43 frames. ], batch size: 702, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 10:06:04,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=537234.0, ans=0.125 2023-06-20 10:06:40,248 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.63 vs. limit=10.0 2023-06-20 10:07:57,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=537474.0, ans=0.125 2023-06-20 10:08:03,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=537474.0, ans=0.125 2023-06-20 10:08:13,359 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.256e+02 2.757e+02 3.378e+02 4.291e+02 7.271e+02, threshold=6.756e+02, percent-clipped=1.0 2023-06-20 10:08:13,384 INFO [train.py:996] (2/4) Epoch 3, batch 28600, loss[loss=0.257, simple_loss=0.328, pruned_loss=0.09307, over 21614.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.3295, pruned_loss=0.09698, over 4281556.06 frames. ], batch size: 230, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 10:08:41,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=537594.0, ans=0.1 2023-06-20 10:08:55,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=537594.0, ans=0.0 2023-06-20 10:09:42,703 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=537714.0, ans=0.0 2023-06-20 10:10:06,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=537834.0, ans=0.125 2023-06-20 10:10:12,635 INFO [train.py:996] (2/4) Epoch 3, batch 28650, loss[loss=0.2327, simple_loss=0.2874, pruned_loss=0.08898, over 21654.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.3238, pruned_loss=0.09599, over 4280728.61 frames. ], batch size: 333, lr: 1.01e-02, grad_scale: 16.0 2023-06-20 10:10:15,148 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.19 vs. limit=15.0 2023-06-20 10:11:03,235 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=537954.0, ans=0.0 2023-06-20 10:12:09,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=538134.0, ans=0.125 2023-06-20 10:12:15,810 INFO [train.py:996] (2/4) Epoch 3, batch 28700, loss[loss=0.2588, simple_loss=0.3236, pruned_loss=0.09698, over 21495.00 frames. ], tot_loss[loss=0.259, simple_loss=0.3238, pruned_loss=0.09706, over 4273253.40 frames. ], batch size: 194, lr: 1.01e-02, grad_scale: 16.0 2023-06-20 10:12:17,215 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.131e+02 2.709e+02 3.341e+02 4.150e+02 7.060e+02, threshold=6.681e+02, percent-clipped=1.0 2023-06-20 10:12:55,540 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2023-06-20 10:13:07,335 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-06-20 10:13:13,278 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.95 vs. limit=15.0 2023-06-20 10:13:26,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=538314.0, ans=0.125 2023-06-20 10:14:06,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=538374.0, ans=0.0 2023-06-20 10:14:19,921 INFO [train.py:996] (2/4) Epoch 3, batch 28750, loss[loss=0.256, simple_loss=0.3202, pruned_loss=0.09592, over 21915.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3238, pruned_loss=0.09705, over 4271863.24 frames. ], batch size: 333, lr: 1.01e-02, grad_scale: 16.0 2023-06-20 10:14:23,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=538434.0, ans=0.1 2023-06-20 10:14:33,691 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=538434.0, ans=0.5 2023-06-20 10:14:38,379 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=538434.0, ans=0.1 2023-06-20 10:14:44,697 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=538494.0, ans=0.2 2023-06-20 10:15:04,760 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=538554.0, ans=0.1 2023-06-20 10:15:20,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=538554.0, ans=0.125 2023-06-20 10:15:41,505 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=538614.0, ans=0.2 2023-06-20 10:16:11,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=538674.0, ans=0.125 2023-06-20 10:16:18,039 INFO [train.py:996] (2/4) Epoch 3, batch 28800, loss[loss=0.29, simple_loss=0.3609, pruned_loss=0.1096, over 21763.00 frames. ], tot_loss[loss=0.2614, simple_loss=0.3273, pruned_loss=0.09777, over 4276795.33 frames. ], batch size: 124, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 10:16:25,198 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.038e+02 2.544e+02 3.077e+02 3.520e+02 7.771e+02, threshold=6.153e+02, percent-clipped=2.0 2023-06-20 10:16:27,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=538734.0, ans=0.07 2023-06-20 10:16:27,255 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=538734.0, ans=0.0 2023-06-20 10:16:30,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=538734.0, ans=0.05 2023-06-20 10:17:01,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=538794.0, ans=0.2 2023-06-20 10:17:02,687 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=538794.0, ans=0.125 2023-06-20 10:17:32,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=538854.0, ans=0.125 2023-06-20 10:17:44,659 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-06-20 10:17:45,579 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=538914.0, ans=0.0 2023-06-20 10:18:25,602 INFO [train.py:996] (2/4) Epoch 3, batch 28850, loss[loss=0.3112, simple_loss=0.3478, pruned_loss=0.1373, over 21738.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.3294, pruned_loss=0.09954, over 4283681.33 frames. ], batch size: 508, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 10:18:35,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=539034.0, ans=0.5 2023-06-20 10:20:20,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=539274.0, ans=0.0 2023-06-20 10:20:23,107 INFO [train.py:996] (2/4) Epoch 3, batch 28900, loss[loss=0.2771, simple_loss=0.3504, pruned_loss=0.1019, over 21868.00 frames. ], tot_loss[loss=0.2678, simple_loss=0.3328, pruned_loss=0.1014, over 4283831.41 frames. ], batch size: 371, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 10:20:24,597 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.163e+02 2.799e+02 3.210e+02 3.937e+02 8.118e+02, threshold=6.420e+02, percent-clipped=2.0 2023-06-20 10:20:37,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=539394.0, ans=0.125 2023-06-20 10:20:50,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=539394.0, ans=0.1 2023-06-20 10:21:57,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=539574.0, ans=0.2 2023-06-20 10:22:08,534 INFO [train.py:996] (2/4) Epoch 3, batch 28950, loss[loss=0.2796, simple_loss=0.3716, pruned_loss=0.09383, over 21673.00 frames. ], tot_loss[loss=0.2666, simple_loss=0.3321, pruned_loss=0.1006, over 4277260.80 frames. ], batch size: 414, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:22:36,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=539634.0, ans=0.2 2023-06-20 10:22:38,583 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-20 10:22:40,182 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.18 vs. limit=22.5 2023-06-20 10:23:29,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=539754.0, ans=0.125 2023-06-20 10:24:29,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=539934.0, ans=0.125 2023-06-20 10:24:30,734 INFO [train.py:996] (2/4) Epoch 3, batch 29000, loss[loss=0.2687, simple_loss=0.3282, pruned_loss=0.1046, over 21474.00 frames. ], tot_loss[loss=0.2653, simple_loss=0.3329, pruned_loss=0.09883, over 4271320.56 frames. ], batch size: 211, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:24:32,018 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.238e+02 2.768e+02 3.396e+02 4.275e+02 6.208e+02, threshold=6.793e+02, percent-clipped=0.0 2023-06-20 10:24:35,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=539934.0, ans=0.0 2023-06-20 10:24:36,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=539934.0, ans=0.0 2023-06-20 10:25:20,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=540054.0, ans=0.125 2023-06-20 10:25:22,115 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.24 vs. limit=22.5 2023-06-20 10:25:40,076 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.87 vs. limit=15.0 2023-06-20 10:25:49,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=540114.0, ans=0.0 2023-06-20 10:26:16,177 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=540174.0, ans=0.125 2023-06-20 10:26:38,120 INFO [train.py:996] (2/4) Epoch 3, batch 29050, loss[loss=0.2579, simple_loss=0.3137, pruned_loss=0.1011, over 21334.00 frames. ], tot_loss[loss=0.2648, simple_loss=0.3325, pruned_loss=0.09857, over 4268434.12 frames. ], batch size: 159, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:26:55,187 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.85 vs. limit=22.5 2023-06-20 10:27:14,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=540294.0, ans=0.05 2023-06-20 10:28:27,054 INFO [train.py:996] (2/4) Epoch 3, batch 29100, loss[loss=0.2262, simple_loss=0.2854, pruned_loss=0.08351, over 21835.00 frames. ], tot_loss[loss=0.2596, simple_loss=0.3249, pruned_loss=0.09718, over 4273623.20 frames. ], batch size: 372, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:28:34,010 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.165e+02 2.763e+02 3.093e+02 3.779e+02 6.198e+02, threshold=6.186e+02, percent-clipped=0.0 2023-06-20 10:29:42,349 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=540774.0, ans=0.1 2023-06-20 10:30:09,099 INFO [train.py:996] (2/4) Epoch 3, batch 29150, loss[loss=0.2391, simple_loss=0.2996, pruned_loss=0.08926, over 21966.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.324, pruned_loss=0.09553, over 4273356.24 frames. ], batch size: 103, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:31:05,921 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=540954.0, ans=0.125 2023-06-20 10:31:11,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=541014.0, ans=0.125 2023-06-20 10:31:26,171 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 10:31:49,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=541074.0, ans=0.125 2023-06-20 10:31:51,688 INFO [train.py:996] (2/4) Epoch 3, batch 29200, loss[loss=0.2306, simple_loss=0.3056, pruned_loss=0.07785, over 21731.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.3193, pruned_loss=0.09406, over 4267646.91 frames. ], batch size: 333, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:31:52,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=541134.0, ans=0.0 2023-06-20 10:31:53,131 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.013e+02 2.609e+02 3.171e+02 4.055e+02 6.216e+02, threshold=6.341e+02, percent-clipped=1.0 2023-06-20 10:32:34,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=541194.0, ans=0.0 2023-06-20 10:33:33,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=541374.0, ans=0.2 2023-06-20 10:33:42,772 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.65 vs. limit=22.5 2023-06-20 10:34:03,918 INFO [train.py:996] (2/4) Epoch 3, batch 29250, loss[loss=0.2076, simple_loss=0.2858, pruned_loss=0.06475, over 21466.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3152, pruned_loss=0.0906, over 4255559.88 frames. ], batch size: 212, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:35:21,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=541614.0, ans=0.2 2023-06-20 10:35:47,774 INFO [train.py:996] (2/4) Epoch 3, batch 29300, loss[loss=0.2219, simple_loss=0.2899, pruned_loss=0.077, over 21787.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3153, pruned_loss=0.08933, over 4257087.38 frames. ], batch size: 118, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:35:48,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=541734.0, ans=0.125 2023-06-20 10:36:05,800 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.805e+02 2.416e+02 2.645e+02 3.207e+02 5.648e+02, threshold=5.289e+02, percent-clipped=0.0 2023-06-20 10:36:35,202 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.09 vs. limit=15.0 2023-06-20 10:36:45,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=541854.0, ans=10.0 2023-06-20 10:37:20,440 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.76 vs. limit=15.0 2023-06-20 10:37:50,507 INFO [train.py:996] (2/4) Epoch 3, batch 29350, loss[loss=0.2228, simple_loss=0.3017, pruned_loss=0.07195, over 21230.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.313, pruned_loss=0.08921, over 4259959.06 frames. ], batch size: 176, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:38:09,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=542034.0, ans=0.0 2023-06-20 10:38:09,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=542034.0, ans=0.125 2023-06-20 10:38:33,254 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=542094.0, ans=0.2 2023-06-20 10:38:37,656 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=542154.0, ans=0.125 2023-06-20 10:39:12,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=542214.0, ans=0.125 2023-06-20 10:39:20,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=542214.0, ans=0.125 2023-06-20 10:40:03,518 INFO [train.py:996] (2/4) Epoch 3, batch 29400, loss[loss=0.2945, simple_loss=0.3601, pruned_loss=0.1144, over 21484.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3118, pruned_loss=0.08698, over 4259070.97 frames. ], batch size: 509, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:40:04,938 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.004e+02 2.619e+02 2.949e+02 3.526e+02 5.601e+02, threshold=5.897e+02, percent-clipped=1.0 2023-06-20 10:40:37,131 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.95 vs. limit=6.0 2023-06-20 10:40:52,713 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=542454.0, ans=0.125 2023-06-20 10:40:56,160 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.97 vs. limit=15.0 2023-06-20 10:40:58,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=542514.0, ans=0.125 2023-06-20 10:41:46,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=542574.0, ans=0.0 2023-06-20 10:42:05,237 INFO [train.py:996] (2/4) Epoch 3, batch 29450, loss[loss=0.2633, simple_loss=0.332, pruned_loss=0.09727, over 21608.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3126, pruned_loss=0.08683, over 4258128.80 frames. ], batch size: 263, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:43:08,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=542754.0, ans=0.1 2023-06-20 10:43:21,563 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.99 vs. limit=15.0 2023-06-20 10:43:35,057 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=12.0 2023-06-20 10:43:36,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=542874.0, ans=0.04949747468305833 2023-06-20 10:43:55,197 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.47 vs. limit=22.5 2023-06-20 10:43:55,593 INFO [train.py:996] (2/4) Epoch 3, batch 29500, loss[loss=0.2715, simple_loss=0.3306, pruned_loss=0.1062, over 21362.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3166, pruned_loss=0.09085, over 4263470.71 frames. ], batch size: 143, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:43:57,013 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.068e+02 2.681e+02 3.088e+02 3.658e+02 6.266e+02, threshold=6.176e+02, percent-clipped=1.0 2023-06-20 10:44:04,386 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=542934.0, ans=0.0 2023-06-20 10:44:20,157 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.72 vs. limit=10.0 2023-06-20 10:44:31,308 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.66 vs. limit=8.0 2023-06-20 10:45:23,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=543114.0, ans=0.0 2023-06-20 10:46:06,819 INFO [train.py:996] (2/4) Epoch 3, batch 29550, loss[loss=0.246, simple_loss=0.3069, pruned_loss=0.09257, over 21676.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3173, pruned_loss=0.0928, over 4271078.17 frames. ], batch size: 263, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:46:21,697 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=543234.0, ans=0.04949747468305833 2023-06-20 10:46:47,440 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.75 vs. limit=15.0 2023-06-20 10:47:55,309 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=543414.0, ans=0.125 2023-06-20 10:48:26,407 INFO [train.py:996] (2/4) Epoch 3, batch 29600, loss[loss=0.2674, simple_loss=0.344, pruned_loss=0.09535, over 21423.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.3249, pruned_loss=0.09614, over 4281760.81 frames. ], batch size: 211, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:48:27,807 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.156e+02 3.045e+02 3.811e+02 4.571e+02 9.006e+02, threshold=7.623e+02, percent-clipped=4.0 2023-06-20 10:48:51,253 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=15.0 2023-06-20 10:49:06,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=543594.0, ans=0.2 2023-06-20 10:49:20,235 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=543654.0, ans=0.0 2023-06-20 10:49:23,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=543654.0, ans=0.0 2023-06-20 10:49:25,895 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=543654.0, ans=0.1 2023-06-20 10:49:47,434 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=543714.0, ans=0.125 2023-06-20 10:49:57,116 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 10:50:34,289 INFO [train.py:996] (2/4) Epoch 3, batch 29650, loss[loss=0.2577, simple_loss=0.3337, pruned_loss=0.09084, over 20041.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.319, pruned_loss=0.09108, over 4279074.17 frames. ], batch size: 702, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:51:48,908 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 10:52:09,324 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=544074.0, ans=0.125 2023-06-20 10:52:11,254 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.94 vs. limit=6.0 2023-06-20 10:52:17,992 INFO [train.py:996] (2/4) Epoch 3, batch 29700, loss[loss=0.2534, simple_loss=0.322, pruned_loss=0.09242, over 21776.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.32, pruned_loss=0.09145, over 4284797.33 frames. ], batch size: 112, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:52:19,324 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.600e+02 2.285e+02 2.545e+02 2.906e+02 5.391e+02, threshold=5.090e+02, percent-clipped=0.0 2023-06-20 10:53:02,756 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=544254.0, ans=0.125 2023-06-20 10:53:36,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=544314.0, ans=0.0 2023-06-20 10:54:13,212 INFO [train.py:996] (2/4) Epoch 3, batch 29750, loss[loss=0.2462, simple_loss=0.3306, pruned_loss=0.08091, over 21319.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3236, pruned_loss=0.09081, over 4288544.60 frames. ], batch size: 176, lr: 1.00e-02, grad_scale: 16.0 2023-06-20 10:55:18,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=544554.0, ans=0.125 2023-06-20 10:55:49,469 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=544614.0, ans=0.125 2023-06-20 10:56:02,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=544674.0, ans=0.2 2023-06-20 10:56:13,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=544674.0, ans=0.1 2023-06-20 10:56:17,061 INFO [train.py:996] (2/4) Epoch 3, batch 29800, loss[loss=0.2262, simple_loss=0.2949, pruned_loss=0.07872, over 21682.00 frames. ], tot_loss[loss=0.2555, simple_loss=0.3269, pruned_loss=0.09204, over 4292361.92 frames. ], batch size: 263, lr: 1.00e-02, grad_scale: 16.0 2023-06-20 10:56:28,610 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.984e+02 2.744e+02 3.274e+02 3.878e+02 6.407e+02, threshold=6.548e+02, percent-clipped=7.0 2023-06-20 10:56:29,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=544734.0, ans=0.125 2023-06-20 10:56:41,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=544734.0, ans=0.0 2023-06-20 10:56:44,618 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=544794.0, ans=0.125 2023-06-20 10:56:53,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=544794.0, ans=0.0 2023-06-20 10:56:57,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=544794.0, ans=0.2 2023-06-20 10:57:02,658 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.43 vs. limit=15.0 2023-06-20 10:57:19,695 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=544854.0, ans=0.1 2023-06-20 10:57:19,724 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=544854.0, ans=0.0 2023-06-20 10:57:41,648 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.57 vs. limit=15.0 2023-06-20 10:57:42,409 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=544914.0, ans=0.125 2023-06-20 10:57:55,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=544974.0, ans=0.125 2023-06-20 10:58:16,255 INFO [train.py:996] (2/4) Epoch 3, batch 29850, loss[loss=0.1955, simple_loss=0.2793, pruned_loss=0.05584, over 21675.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3241, pruned_loss=0.08995, over 4287604.23 frames. ], batch size: 247, lr: 1.00e-02, grad_scale: 16.0 2023-06-20 10:58:41,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=545034.0, ans=0.1 2023-06-20 10:59:22,059 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.09 vs. limit=15.0 2023-06-20 11:00:15,936 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=545274.0, ans=0.2 2023-06-20 11:00:18,349 INFO [train.py:996] (2/4) Epoch 3, batch 29900, loss[loss=0.2525, simple_loss=0.3172, pruned_loss=0.09391, over 21361.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3224, pruned_loss=0.09169, over 4294090.55 frames. ], batch size: 131, lr: 1.00e-02, grad_scale: 16.0 2023-06-20 11:00:21,258 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.785e+02 2.554e+02 2.921e+02 3.179e+02 4.891e+02, threshold=5.842e+02, percent-clipped=0.0 2023-06-20 11:01:55,240 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=545514.0, ans=0.125 2023-06-20 11:02:07,368 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=545574.0, ans=0.035 2023-06-20 11:02:27,195 INFO [train.py:996] (2/4) Epoch 3, batch 29950, loss[loss=0.2497, simple_loss=0.3025, pruned_loss=0.09848, over 20949.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3259, pruned_loss=0.09518, over 4287595.90 frames. ], batch size: 607, lr: 9.99e-03, grad_scale: 16.0 2023-06-20 11:04:02,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=545814.0, ans=0.0 2023-06-20 11:04:32,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=545874.0, ans=0.125 2023-06-20 11:04:34,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=545874.0, ans=0.125 2023-06-20 11:04:43,783 INFO [train.py:996] (2/4) Epoch 3, batch 30000, loss[loss=0.2144, simple_loss=0.3054, pruned_loss=0.0617, over 21660.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.328, pruned_loss=0.09493, over 4285615.30 frames. ], batch size: 263, lr: 9.99e-03, grad_scale: 32.0 2023-06-20 11:04:43,783 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-20 11:05:44,596 INFO [train.py:1028] (2/4) Epoch 3, validation: loss=0.2515, simple_loss=0.3537, pruned_loss=0.07464, over 1796401.00 frames. 2023-06-20 11:05:44,597 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-20 11:05:47,607 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.065e+02 2.773e+02 3.132e+02 3.473e+02 5.556e+02, threshold=6.264e+02, percent-clipped=0.0 2023-06-20 11:06:25,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=546054.0, ans=0.1 2023-06-20 11:06:27,374 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=546054.0, ans=0.0 2023-06-20 11:06:32,188 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 11:06:48,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=546114.0, ans=0.125 2023-06-20 11:06:59,386 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.41 vs. limit=15.0 2023-06-20 11:07:01,740 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=546114.0, ans=0.1 2023-06-20 11:07:46,072 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=546234.0, ans=0.0 2023-06-20 11:07:51,370 INFO [train.py:996] (2/4) Epoch 3, batch 30050, loss[loss=0.2351, simple_loss=0.2966, pruned_loss=0.08679, over 21067.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.3313, pruned_loss=0.09178, over 4284352.63 frames. ], batch size: 143, lr: 9.99e-03, grad_scale: 32.0 2023-06-20 11:08:06,785 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.03 vs. limit=6.0 2023-06-20 11:09:29,440 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.58 vs. limit=15.0 2023-06-20 11:09:38,688 INFO [train.py:996] (2/4) Epoch 3, batch 30100, loss[loss=0.2212, simple_loss=0.2713, pruned_loss=0.08559, over 21181.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.3307, pruned_loss=0.09213, over 4286496.96 frames. ], batch size: 549, lr: 9.99e-03, grad_scale: 32.0 2023-06-20 11:09:39,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=546534.0, ans=0.125 2023-06-20 11:09:41,726 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.826e+02 2.525e+02 3.109e+02 3.772e+02 7.845e+02, threshold=6.218e+02, percent-clipped=1.0 2023-06-20 11:10:04,156 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=546594.0, ans=0.125 2023-06-20 11:10:08,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=546594.0, ans=0.125 2023-06-20 11:10:19,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=546654.0, ans=0.1 2023-06-20 11:10:49,313 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.83 vs. limit=15.0 2023-06-20 11:10:55,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=546714.0, ans=0.0 2023-06-20 11:11:14,997 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.10 vs. limit=6.0 2023-06-20 11:11:23,333 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=546774.0, ans=0.125 2023-06-20 11:11:25,729 INFO [train.py:996] (2/4) Epoch 3, batch 30150, loss[loss=0.2742, simple_loss=0.3345, pruned_loss=0.1069, over 21973.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.3278, pruned_loss=0.09472, over 4287186.79 frames. ], batch size: 317, lr: 9.98e-03, grad_scale: 32.0 2023-06-20 11:11:59,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=546894.0, ans=0.125 2023-06-20 11:12:16,755 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.87 vs. limit=15.0 2023-06-20 11:12:45,429 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=546954.0, ans=0.1 2023-06-20 11:13:23,618 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=547014.0, ans=0.0 2023-06-20 11:13:25,257 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=547014.0, ans=0.1 2023-06-20 11:13:43,231 INFO [train.py:996] (2/4) Epoch 3, batch 30200, loss[loss=0.2405, simple_loss=0.3276, pruned_loss=0.07669, over 21626.00 frames. ], tot_loss[loss=0.2595, simple_loss=0.331, pruned_loss=0.09398, over 4282626.98 frames. ], batch size: 263, lr: 9.98e-03, grad_scale: 32.0 2023-06-20 11:13:46,144 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.833e+02 2.482e+02 2.831e+02 3.246e+02 4.619e+02, threshold=5.661e+02, percent-clipped=0.0 2023-06-20 11:14:20,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=547194.0, ans=0.2 2023-06-20 11:14:35,159 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.72 vs. limit=15.0 2023-06-20 11:15:08,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=547314.0, ans=0.0 2023-06-20 11:15:24,094 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.59 vs. limit=15.0 2023-06-20 11:15:26,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=547374.0, ans=0.0 2023-06-20 11:15:35,931 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=547374.0, ans=0.0 2023-06-20 11:16:00,891 INFO [train.py:996] (2/4) Epoch 3, batch 30250, loss[loss=0.2996, simple_loss=0.3973, pruned_loss=0.1009, over 21872.00 frames. ], tot_loss[loss=0.2645, simple_loss=0.3374, pruned_loss=0.09583, over 4281014.89 frames. ], batch size: 372, lr: 9.98e-03, grad_scale: 16.0 2023-06-20 11:18:02,167 INFO [train.py:996] (2/4) Epoch 3, batch 30300, loss[loss=0.2224, simple_loss=0.28, pruned_loss=0.08242, over 21464.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3327, pruned_loss=0.09474, over 4285126.18 frames. ], batch size: 195, lr: 9.97e-03, grad_scale: 16.0 2023-06-20 11:18:06,677 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.928e+02 2.736e+02 3.183e+02 3.962e+02 5.943e+02, threshold=6.366e+02, percent-clipped=1.0 2023-06-20 11:18:19,122 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=547734.0, ans=0.2 2023-06-20 11:18:20,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=547734.0, ans=0.0 2023-06-20 11:20:01,376 INFO [train.py:996] (2/4) Epoch 3, batch 30350, loss[loss=0.2064, simple_loss=0.2599, pruned_loss=0.07642, over 21296.00 frames. ], tot_loss[loss=0.2639, simple_loss=0.335, pruned_loss=0.09644, over 4277961.80 frames. ], batch size: 176, lr: 9.97e-03, grad_scale: 16.0 2023-06-20 11:20:09,078 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=548034.0, ans=0.125 2023-06-20 11:20:31,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=548094.0, ans=0.125 2023-06-20 11:20:31,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=548094.0, ans=0.1 2023-06-20 11:21:43,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=548214.0, ans=0.125 2023-06-20 11:22:57,190 INFO [train.py:996] (2/4) Epoch 3, batch 30400, loss[loss=0.2466, simple_loss=0.2816, pruned_loss=0.1058, over 20367.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.3256, pruned_loss=0.09315, over 4270508.75 frames. ], batch size: 703, lr: 9.97e-03, grad_scale: 32.0 2023-06-20 11:23:00,865 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.303e+02 3.083e+02 3.599e+02 4.389e+02 8.139e+02, threshold=7.198e+02, percent-clipped=3.0 2023-06-20 11:25:35,776 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.98 vs. limit=15.0 2023-06-20 11:26:19,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=548574.0, ans=0.125 2023-06-20 11:26:54,477 INFO [train.py:996] (2/4) Epoch 3, batch 30450, loss[loss=0.3226, simple_loss=0.4299, pruned_loss=0.1077, over 19850.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3273, pruned_loss=0.09435, over 4209448.21 frames. ], batch size: 702, lr: 9.97e-03, grad_scale: 32.0 2023-06-20 11:27:18,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=548634.0, ans=0.1 2023-06-20 11:28:17,317 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 11:28:18,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=548694.0, ans=0.0 2023-06-20 11:29:35,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=548814.0, ans=0.1 2023-06-20 11:32:21,684 INFO [train.py:996] (2/4) Epoch 4, batch 0, loss[loss=0.2338, simple_loss=0.2949, pruned_loss=0.08636, over 21498.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.2949, pruned_loss=0.08636, over 21498.00 frames. ], batch size: 212, lr: 8.60e-03, grad_scale: 32.0 2023-06-20 11:32:21,684 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-20 11:33:10,766 INFO [train.py:1028] (2/4) Epoch 4, validation: loss=0.2494, simple_loss=0.3589, pruned_loss=0.06994, over 1796401.00 frames. 2023-06-20 11:33:10,767 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-20 11:33:23,652 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.619e+02 4.575e+02 6.276e+02 9.904e+02 2.096e+03, threshold=1.255e+03, percent-clipped=39.0 2023-06-20 11:34:02,568 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=549024.0, ans=0.125 2023-06-20 11:34:18,857 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=549084.0, ans=0.125 2023-06-20 11:34:52,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=549204.0, ans=0.125 2023-06-20 11:34:53,769 INFO [train.py:996] (2/4) Epoch 4, batch 50, loss[loss=0.2536, simple_loss=0.3308, pruned_loss=0.08816, over 21430.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3262, pruned_loss=0.09076, over 952269.44 frames. ], batch size: 211, lr: 8.60e-03, grad_scale: 32.0 2023-06-20 11:34:54,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=549204.0, ans=0.1 2023-06-20 11:35:20,470 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=549264.0, ans=0.0 2023-06-20 11:35:34,128 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.38 vs. limit=15.0 2023-06-20 11:36:17,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=549384.0, ans=0.0 2023-06-20 11:36:21,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=549384.0, ans=0.09899494936611666 2023-06-20 11:36:48,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=549444.0, ans=0.1 2023-06-20 11:36:59,300 INFO [train.py:996] (2/4) Epoch 4, batch 100, loss[loss=0.3335, simple_loss=0.3978, pruned_loss=0.1346, over 21734.00 frames. ], tot_loss[loss=0.2671, simple_loss=0.3445, pruned_loss=0.09479, over 1691356.90 frames. ], batch size: 441, lr: 8.60e-03, grad_scale: 32.0 2023-06-20 11:37:14,384 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=549504.0, ans=0.0 2023-06-20 11:37:23,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=549504.0, ans=0.125 2023-06-20 11:37:24,735 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.917e+02 2.415e+02 2.758e+02 3.125e+02 7.692e+02, threshold=5.515e+02, percent-clipped=0.0 2023-06-20 11:38:24,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=549684.0, ans=0.125 2023-06-20 11:38:27,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=549744.0, ans=0.125 2023-06-20 11:38:31,372 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=549744.0, ans=0.125 2023-06-20 11:38:35,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=549744.0, ans=0.125 2023-06-20 11:38:37,615 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.30 vs. limit=15.0 2023-06-20 11:38:47,461 INFO [train.py:996] (2/4) Epoch 4, batch 150, loss[loss=0.223, simple_loss=0.3072, pruned_loss=0.06944, over 21804.00 frames. ], tot_loss[loss=0.2662, simple_loss=0.345, pruned_loss=0.09366, over 2266167.12 frames. ], batch size: 282, lr: 8.59e-03, grad_scale: 32.0 2023-06-20 11:39:08,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=549864.0, ans=0.125 2023-06-20 11:39:24,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=549924.0, ans=0.5 2023-06-20 11:40:18,427 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.05 vs. limit=22.5 2023-06-20 11:40:46,819 INFO [train.py:996] (2/4) Epoch 4, batch 200, loss[loss=0.261, simple_loss=0.3289, pruned_loss=0.09651, over 21258.00 frames. ], tot_loss[loss=0.2651, simple_loss=0.3427, pruned_loss=0.09378, over 2717981.08 frames. ], batch size: 143, lr: 8.59e-03, grad_scale: 32.0 2023-06-20 11:41:01,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=550104.0, ans=0.95 2023-06-20 11:41:04,824 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.980e+02 2.490e+02 2.754e+02 3.308e+02 4.592e+02, threshold=5.508e+02, percent-clipped=0.0 2023-06-20 11:41:34,792 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=550224.0, ans=0.125 2023-06-20 11:42:02,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=550284.0, ans=0.125 2023-06-20 11:42:18,437 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-20 11:42:45,738 INFO [train.py:996] (2/4) Epoch 4, batch 250, loss[loss=0.2192, simple_loss=0.2887, pruned_loss=0.07482, over 21822.00 frames. ], tot_loss[loss=0.2621, simple_loss=0.3384, pruned_loss=0.09289, over 3060694.00 frames. ], batch size: 107, lr: 8.59e-03, grad_scale: 32.0 2023-06-20 11:42:57,703 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=550404.0, ans=0.125 2023-06-20 11:44:08,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=550524.0, ans=0.0 2023-06-20 11:44:09,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=550524.0, ans=0.025 2023-06-20 11:44:43,132 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.42 vs. limit=15.0 2023-06-20 11:44:52,861 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.62 vs. limit=10.0 2023-06-20 11:45:28,944 INFO [train.py:996] (2/4) Epoch 4, batch 300, loss[loss=0.2403, simple_loss=0.3045, pruned_loss=0.08805, over 20958.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3323, pruned_loss=0.09182, over 3329085.05 frames. ], batch size: 607, lr: 8.59e-03, grad_scale: 32.0 2023-06-20 11:45:43,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=550704.0, ans=0.1 2023-06-20 11:45:47,473 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.926e+02 2.545e+02 3.030e+02 3.596e+02 5.664e+02, threshold=6.060e+02, percent-clipped=1.0 2023-06-20 11:46:16,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=550764.0, ans=0.125 2023-06-20 11:47:39,859 INFO [train.py:996] (2/4) Epoch 4, batch 350, loss[loss=0.2306, simple_loss=0.3276, pruned_loss=0.06685, over 21782.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3239, pruned_loss=0.09062, over 3534675.73 frames. ], batch size: 351, lr: 8.59e-03, grad_scale: 32.0 2023-06-20 11:48:05,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=551004.0, ans=0.125 2023-06-20 11:48:35,174 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=551064.0, ans=0.125 2023-06-20 11:49:49,466 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.41 vs. limit=15.0 2023-06-20 11:50:13,570 INFO [train.py:996] (2/4) Epoch 4, batch 400, loss[loss=0.2685, simple_loss=0.3208, pruned_loss=0.1081, over 21322.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3153, pruned_loss=0.08779, over 3695767.40 frames. ], batch size: 471, lr: 8.58e-03, grad_scale: 32.0 2023-06-20 11:50:26,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=551304.0, ans=0.125 2023-06-20 11:50:36,579 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.917e+02 2.584e+02 2.891e+02 3.548e+02 6.771e+02, threshold=5.782e+02, percent-clipped=1.0 2023-06-20 11:50:46,891 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 11:52:30,845 INFO [train.py:996] (2/4) Epoch 4, batch 450, loss[loss=0.2678, simple_loss=0.3573, pruned_loss=0.08912, over 21228.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3184, pruned_loss=0.08862, over 3830216.18 frames. ], batch size: 548, lr: 8.58e-03, grad_scale: 32.0 2023-06-20 11:53:00,416 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.00 vs. limit=15.0 2023-06-20 11:53:24,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=551664.0, ans=0.04949747468305833 2023-06-20 11:53:45,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=551784.0, ans=0.1 2023-06-20 11:54:07,413 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=551844.0, ans=0.125 2023-06-20 11:54:15,843 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=551844.0, ans=0.2 2023-06-20 11:54:32,205 INFO [train.py:996] (2/4) Epoch 4, batch 500, loss[loss=0.2714, simple_loss=0.3728, pruned_loss=0.08504, over 21771.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3182, pruned_loss=0.08584, over 3930720.94 frames. ], batch size: 351, lr: 8.58e-03, grad_scale: 32.0 2023-06-20 11:54:57,395 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.767e+02 2.704e+02 3.098e+02 4.611e+02 6.929e+02, threshold=6.196e+02, percent-clipped=8.0 2023-06-20 11:55:23,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=552024.0, ans=10.0 2023-06-20 11:56:23,607 INFO [train.py:996] (2/4) Epoch 4, batch 550, loss[loss=0.2843, simple_loss=0.3401, pruned_loss=0.1143, over 21863.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3204, pruned_loss=0.08563, over 4013136.18 frames. ], batch size: 371, lr: 8.58e-03, grad_scale: 32.0 2023-06-20 11:56:44,322 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.69 vs. limit=15.0 2023-06-20 11:57:03,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=552264.0, ans=0.125 2023-06-20 11:57:27,957 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-20 11:58:45,244 INFO [train.py:996] (2/4) Epoch 4, batch 600, loss[loss=0.2436, simple_loss=0.3133, pruned_loss=0.08689, over 21914.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3225, pruned_loss=0.08584, over 4071695.39 frames. ], batch size: 351, lr: 8.57e-03, grad_scale: 32.0 2023-06-20 11:58:58,257 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 2.831e+02 3.316e+02 4.076e+02 6.310e+02, threshold=6.631e+02, percent-clipped=1.0 2023-06-20 11:59:32,469 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=552624.0, ans=0.1 2023-06-20 11:59:35,823 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.02 vs. limit=15.0 2023-06-20 11:59:41,144 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=552684.0, ans=0.0 2023-06-20 11:59:54,635 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-06-20 12:00:28,308 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=552804.0, ans=0.2 2023-06-20 12:00:29,236 INFO [train.py:996] (2/4) Epoch 4, batch 650, loss[loss=0.231, simple_loss=0.2984, pruned_loss=0.08179, over 21854.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3227, pruned_loss=0.086, over 4113127.63 frames. ], batch size: 98, lr: 8.57e-03, grad_scale: 16.0 2023-06-20 12:00:36,964 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=552804.0, ans=0.0 2023-06-20 12:00:39,925 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=552804.0, ans=0.2 2023-06-20 12:01:06,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=552864.0, ans=0.1 2023-06-20 12:02:19,253 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=553044.0, ans=0.125 2023-06-20 12:02:45,254 INFO [train.py:996] (2/4) Epoch 4, batch 700, loss[loss=0.2349, simple_loss=0.2966, pruned_loss=0.08662, over 21206.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.322, pruned_loss=0.08748, over 4148736.27 frames. ], batch size: 159, lr: 8.57e-03, grad_scale: 16.0 2023-06-20 12:02:55,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=553104.0, ans=0.125 2023-06-20 12:02:59,439 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.076e+02 2.784e+02 3.467e+02 4.888e+02 7.822e+02, threshold=6.935e+02, percent-clipped=3.0 2023-06-20 12:03:27,959 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.72 vs. limit=15.0 2023-06-20 12:03:50,045 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.75 vs. limit=15.0 2023-06-20 12:04:02,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=553344.0, ans=0.02 2023-06-20 12:04:16,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=553344.0, ans=0.125 2023-06-20 12:04:28,794 INFO [train.py:996] (2/4) Epoch 4, batch 750, loss[loss=0.2456, simple_loss=0.3093, pruned_loss=0.091, over 21531.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3198, pruned_loss=0.08775, over 4184826.35 frames. ], batch size: 263, lr: 8.57e-03, grad_scale: 16.0 2023-06-20 12:04:40,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=553404.0, ans=0.1 2023-06-20 12:04:51,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=553464.0, ans=0.125 2023-06-20 12:05:01,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=553464.0, ans=0.1 2023-06-20 12:05:25,319 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.13 vs. limit=15.0 2023-06-20 12:06:03,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=553644.0, ans=0.0 2023-06-20 12:06:08,288 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=553644.0, ans=0.125 2023-06-20 12:06:11,707 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.31 vs. limit=15.0 2023-06-20 12:06:12,195 INFO [train.py:996] (2/4) Epoch 4, batch 800, loss[loss=0.2698, simple_loss=0.3276, pruned_loss=0.106, over 21877.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3178, pruned_loss=0.08801, over 4193832.99 frames. ], batch size: 414, lr: 8.56e-03, grad_scale: 32.0 2023-06-20 12:06:43,595 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.113e+02 2.671e+02 3.153e+02 3.774e+02 5.879e+02, threshold=6.307e+02, percent-clipped=0.0 2023-06-20 12:07:04,592 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=553764.0, ans=0.0 2023-06-20 12:07:09,077 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=553764.0, ans=0.125 2023-06-20 12:07:16,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=553824.0, ans=0.0 2023-06-20 12:07:29,918 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=553824.0, ans=0.125 2023-06-20 12:07:32,798 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=553884.0, ans=0.1 2023-06-20 12:07:38,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=553884.0, ans=0.125 2023-06-20 12:08:29,752 INFO [train.py:996] (2/4) Epoch 4, batch 850, loss[loss=0.2615, simple_loss=0.323, pruned_loss=0.09998, over 21926.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3211, pruned_loss=0.08921, over 4213524.78 frames. ], batch size: 124, lr: 8.56e-03, grad_scale: 32.0 2023-06-20 12:09:01,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=554064.0, ans=0.125 2023-06-20 12:09:06,516 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=554124.0, ans=0.0 2023-06-20 12:09:40,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=554244.0, ans=0.0 2023-06-20 12:10:07,103 INFO [train.py:996] (2/4) Epoch 4, batch 900, loss[loss=0.2346, simple_loss=0.3024, pruned_loss=0.0834, over 21816.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3183, pruned_loss=0.08846, over 4231719.52 frames. ], batch size: 247, lr: 8.56e-03, grad_scale: 32.0 2023-06-20 12:10:36,120 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=554304.0, ans=0.125 2023-06-20 12:10:42,156 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.949e+02 2.695e+02 3.057e+02 3.508e+02 5.893e+02, threshold=6.115e+02, percent-clipped=0.0 2023-06-20 12:11:07,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=554424.0, ans=0.07 2023-06-20 12:12:17,117 INFO [train.py:996] (2/4) Epoch 4, batch 950, loss[loss=0.2257, simple_loss=0.2715, pruned_loss=0.08999, over 21175.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.315, pruned_loss=0.08779, over 4247132.77 frames. ], batch size: 548, lr: 8.56e-03, grad_scale: 32.0 2023-06-20 12:12:18,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=554604.0, ans=0.125 2023-06-20 12:12:52,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=554664.0, ans=0.125 2023-06-20 12:13:24,997 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.11 vs. limit=15.0 2023-06-20 12:13:34,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=554844.0, ans=0.0 2023-06-20 12:13:53,071 INFO [train.py:996] (2/4) Epoch 4, batch 1000, loss[loss=0.2054, simple_loss=0.2955, pruned_loss=0.05764, over 21698.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.314, pruned_loss=0.08748, over 4261400.11 frames. ], batch size: 263, lr: 8.56e-03, grad_scale: 32.0 2023-06-20 12:14:11,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=554904.0, ans=0.035 2023-06-20 12:14:14,187 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.958e+02 2.396e+02 2.779e+02 3.341e+02 4.374e+02, threshold=5.558e+02, percent-clipped=0.0 2023-06-20 12:15:34,334 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=555084.0, ans=0.0 2023-06-20 12:15:54,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=555144.0, ans=0.125 2023-06-20 12:15:57,400 INFO [train.py:996] (2/4) Epoch 4, batch 1050, loss[loss=0.3161, simple_loss=0.4194, pruned_loss=0.1064, over 20899.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3153, pruned_loss=0.08857, over 4269623.38 frames. ], batch size: 608, lr: 8.55e-03, grad_scale: 32.0 2023-06-20 12:16:09,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=555204.0, ans=0.1 2023-06-20 12:16:09,652 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.29 vs. limit=15.0 2023-06-20 12:16:22,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=555264.0, ans=10.0 2023-06-20 12:16:51,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=555324.0, ans=0.05 2023-06-20 12:16:51,651 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-06-20 12:18:12,963 INFO [train.py:996] (2/4) Epoch 4, batch 1100, loss[loss=0.2139, simple_loss=0.2723, pruned_loss=0.07775, over 21245.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3167, pruned_loss=0.08868, over 4276041.05 frames. ], batch size: 548, lr: 8.55e-03, grad_scale: 16.0 2023-06-20 12:18:28,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=555564.0, ans=0.025 2023-06-20 12:18:29,533 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.835e+02 2.724e+02 3.414e+02 3.990e+02 8.036e+02, threshold=6.829e+02, percent-clipped=6.0 2023-06-20 12:18:50,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=555624.0, ans=22.5 2023-06-20 12:19:13,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=555684.0, ans=0.2 2023-06-20 12:19:51,993 INFO [train.py:996] (2/4) Epoch 4, batch 1150, loss[loss=0.2453, simple_loss=0.3264, pruned_loss=0.08213, over 21758.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3161, pruned_loss=0.08773, over 4274248.64 frames. ], batch size: 298, lr: 8.55e-03, grad_scale: 16.0 2023-06-20 12:20:42,225 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=12.0 2023-06-20 12:21:09,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=555984.0, ans=0.125 2023-06-20 12:21:12,531 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=555984.0, ans=0.125 2023-06-20 12:21:27,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=555984.0, ans=0.0 2023-06-20 12:21:56,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=556044.0, ans=0.125 2023-06-20 12:22:03,436 INFO [train.py:996] (2/4) Epoch 4, batch 1200, loss[loss=0.2927, simple_loss=0.3543, pruned_loss=0.1156, over 21884.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3168, pruned_loss=0.08786, over 4278373.00 frames. ], batch size: 371, lr: 8.55e-03, grad_scale: 32.0 2023-06-20 12:22:33,898 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.884e+02 2.663e+02 2.998e+02 3.388e+02 6.421e+02, threshold=5.997e+02, percent-clipped=0.0 2023-06-20 12:22:37,115 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=556164.0, ans=0.125 2023-06-20 12:23:16,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=556224.0, ans=0.125 2023-06-20 12:23:20,042 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.02 vs. limit=12.0 2023-06-20 12:23:47,742 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 12:23:50,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=556344.0, ans=0.125 2023-06-20 12:23:50,449 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=556344.0, ans=0.2 2023-06-20 12:24:08,861 INFO [train.py:996] (2/4) Epoch 4, batch 1250, loss[loss=0.2679, simple_loss=0.3392, pruned_loss=0.09833, over 21672.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3182, pruned_loss=0.08809, over 4278082.24 frames. ], batch size: 351, lr: 8.54e-03, grad_scale: 32.0 2023-06-20 12:26:04,918 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-06-20 12:26:14,466 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.79 vs. limit=15.0 2023-06-20 12:26:17,531 INFO [train.py:996] (2/4) Epoch 4, batch 1300, loss[loss=0.2457, simple_loss=0.32, pruned_loss=0.08569, over 21915.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3191, pruned_loss=0.0891, over 4277369.55 frames. ], batch size: 351, lr: 8.54e-03, grad_scale: 32.0 2023-06-20 12:26:43,507 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.031e+02 2.594e+02 2.933e+02 3.629e+02 6.395e+02, threshold=5.867e+02, percent-clipped=1.0 2023-06-20 12:27:23,106 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.86 vs. limit=15.0 2023-06-20 12:27:25,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=556884.0, ans=0.0 2023-06-20 12:27:59,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=556944.0, ans=0.95 2023-06-20 12:28:26,529 INFO [train.py:996] (2/4) Epoch 4, batch 1350, loss[loss=0.2304, simple_loss=0.2933, pruned_loss=0.08373, over 21688.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3185, pruned_loss=0.08881, over 4279362.11 frames. ], batch size: 230, lr: 8.54e-03, grad_scale: 32.0 2023-06-20 12:28:29,021 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.05 vs. limit=12.0 2023-06-20 12:28:41,575 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=557004.0, ans=0.0 2023-06-20 12:29:01,403 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=557064.0, ans=0.5 2023-06-20 12:29:18,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=557124.0, ans=0.2 2023-06-20 12:29:38,111 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.55 vs. limit=12.0 2023-06-20 12:29:43,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=557184.0, ans=0.0 2023-06-20 12:29:52,105 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.96 vs. limit=12.0 2023-06-20 12:30:31,698 INFO [train.py:996] (2/4) Epoch 4, batch 1400, loss[loss=0.213, simple_loss=0.2757, pruned_loss=0.0752, over 21652.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3173, pruned_loss=0.08909, over 4287828.15 frames. ], batch size: 282, lr: 8.54e-03, grad_scale: 32.0 2023-06-20 12:30:58,526 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.947e+02 2.685e+02 2.934e+02 3.988e+02 7.934e+02, threshold=5.867e+02, percent-clipped=7.0 2023-06-20 12:31:00,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=557364.0, ans=0.0 2023-06-20 12:31:29,125 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=557424.0, ans=0.125 2023-06-20 12:31:43,579 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 12:32:12,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=557544.0, ans=0.0 2023-06-20 12:32:45,059 INFO [train.py:996] (2/4) Epoch 4, batch 1450, loss[loss=0.2745, simple_loss=0.3398, pruned_loss=0.1046, over 21817.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.316, pruned_loss=0.08948, over 4295860.98 frames. ], batch size: 282, lr: 8.54e-03, grad_scale: 32.0 2023-06-20 12:32:51,880 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.45 vs. limit=15.0 2023-06-20 12:32:57,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=557604.0, ans=0.07 2023-06-20 12:33:29,029 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=557724.0, ans=0.05 2023-06-20 12:33:53,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=557724.0, ans=0.1 2023-06-20 12:34:53,735 INFO [train.py:996] (2/4) Epoch 4, batch 1500, loss[loss=0.2466, simple_loss=0.315, pruned_loss=0.08914, over 21900.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.319, pruned_loss=0.09102, over 4294607.17 frames. ], batch size: 333, lr: 8.53e-03, grad_scale: 32.0 2023-06-20 12:35:06,840 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=557904.0, ans=0.0 2023-06-20 12:35:14,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=557904.0, ans=0.125 2023-06-20 12:35:18,457 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.243e+02 2.667e+02 3.053e+02 3.442e+02 4.744e+02, threshold=6.106e+02, percent-clipped=0.0 2023-06-20 12:35:21,896 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=557964.0, ans=0.2 2023-06-20 12:35:37,135 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 12:35:41,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=558024.0, ans=0.1 2023-06-20 12:35:51,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=558024.0, ans=0.125 2023-06-20 12:37:17,107 INFO [train.py:996] (2/4) Epoch 4, batch 1550, loss[loss=0.2222, simple_loss=0.3126, pruned_loss=0.06588, over 21756.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3175, pruned_loss=0.0897, over 4297520.52 frames. ], batch size: 332, lr: 8.53e-03, grad_scale: 32.0 2023-06-20 12:37:50,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=558264.0, ans=0.2 2023-06-20 12:37:59,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=558324.0, ans=0.125 2023-06-20 12:38:00,798 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=558324.0, ans=0.125 2023-06-20 12:38:15,778 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.67 vs. limit=6.0 2023-06-20 12:38:17,937 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=558384.0, ans=0.0 2023-06-20 12:39:11,529 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.51 vs. limit=22.5 2023-06-20 12:39:13,417 INFO [train.py:996] (2/4) Epoch 4, batch 1600, loss[loss=0.232, simple_loss=0.3008, pruned_loss=0.08161, over 21714.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.317, pruned_loss=0.08901, over 4291149.29 frames. ], batch size: 332, lr: 8.53e-03, grad_scale: 32.0 2023-06-20 12:39:29,431 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.788e+02 2.678e+02 3.179e+02 3.660e+02 5.340e+02, threshold=6.358e+02, percent-clipped=0.0 2023-06-20 12:39:54,372 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=558624.0, ans=0.0 2023-06-20 12:41:19,038 INFO [train.py:996] (2/4) Epoch 4, batch 1650, loss[loss=0.2154, simple_loss=0.275, pruned_loss=0.07788, over 21488.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3167, pruned_loss=0.08848, over 4291012.21 frames. ], batch size: 212, lr: 8.53e-03, grad_scale: 32.0 2023-06-20 12:41:41,285 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=558864.0, ans=0.0 2023-06-20 12:42:01,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=558864.0, ans=0.125 2023-06-20 12:42:23,413 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=558924.0, ans=0.0 2023-06-20 12:42:46,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=558984.0, ans=0.5 2023-06-20 12:42:50,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=558984.0, ans=0.125 2023-06-20 12:43:28,525 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.35 vs. limit=12.0 2023-06-20 12:43:30,334 INFO [train.py:996] (2/4) Epoch 4, batch 1700, loss[loss=0.2639, simple_loss=0.3052, pruned_loss=0.1112, over 21412.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.3217, pruned_loss=0.09191, over 4284568.62 frames. ], batch size: 473, lr: 8.52e-03, grad_scale: 32.0 2023-06-20 12:43:30,808 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=559104.0, ans=0.0 2023-06-20 12:43:58,558 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.789e+02 2.589e+02 2.821e+02 3.325e+02 4.601e+02, threshold=5.642e+02, percent-clipped=0.0 2023-06-20 12:44:40,282 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.37 vs. limit=15.0 2023-06-20 12:45:22,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=559344.0, ans=0.125 2023-06-20 12:45:24,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=559344.0, ans=0.125 2023-06-20 12:45:35,094 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.11 vs. limit=15.0 2023-06-20 12:45:49,023 INFO [train.py:996] (2/4) Epoch 4, batch 1750, loss[loss=0.1935, simple_loss=0.2751, pruned_loss=0.056, over 21560.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3201, pruned_loss=0.08911, over 4279865.23 frames. ], batch size: 212, lr: 8.52e-03, grad_scale: 32.0 2023-06-20 12:46:59,448 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=559524.0, ans=0.1 2023-06-20 12:47:08,996 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.60 vs. limit=22.5 2023-06-20 12:48:18,424 INFO [train.py:996] (2/4) Epoch 4, batch 1800, loss[loss=0.2275, simple_loss=0.3284, pruned_loss=0.06333, over 21726.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3163, pruned_loss=0.08539, over 4273541.21 frames. ], batch size: 332, lr: 8.52e-03, grad_scale: 32.0 2023-06-20 12:48:34,541 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.759e+02 2.695e+02 3.353e+02 3.851e+02 6.055e+02, threshold=6.706e+02, percent-clipped=1.0 2023-06-20 12:50:15,943 INFO [train.py:996] (2/4) Epoch 4, batch 1850, loss[loss=0.2834, simple_loss=0.3813, pruned_loss=0.09269, over 20825.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3169, pruned_loss=0.08336, over 4265840.15 frames. ], batch size: 607, lr: 8.52e-03, grad_scale: 32.0 2023-06-20 12:52:18,569 INFO [train.py:996] (2/4) Epoch 4, batch 1900, loss[loss=0.2177, simple_loss=0.281, pruned_loss=0.0772, over 21382.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3178, pruned_loss=0.08467, over 4272645.77 frames. ], batch size: 159, lr: 8.51e-03, grad_scale: 32.0 2023-06-20 12:52:19,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=560304.0, ans=0.125 2023-06-20 12:52:20,015 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.80 vs. limit=8.0 2023-06-20 12:52:40,247 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.808e+02 2.526e+02 2.932e+02 3.560e+02 6.916e+02, threshold=5.863e+02, percent-clipped=1.0 2023-06-20 12:53:05,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=560424.0, ans=0.2 2023-06-20 12:54:36,912 INFO [train.py:996] (2/4) Epoch 4, batch 1950, loss[loss=0.2473, simple_loss=0.3158, pruned_loss=0.08935, over 21259.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3138, pruned_loss=0.0841, over 4271644.69 frames. ], batch size: 549, lr: 8.51e-03, grad_scale: 32.0 2023-06-20 12:55:22,248 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=560724.0, ans=0.0 2023-06-20 12:56:05,111 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.34 vs. limit=15.0 2023-06-20 12:56:12,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=560844.0, ans=0.125 2023-06-20 12:56:40,393 INFO [train.py:996] (2/4) Epoch 4, batch 2000, loss[loss=0.2487, simple_loss=0.3308, pruned_loss=0.08331, over 21739.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3078, pruned_loss=0.08222, over 4268846.25 frames. ], batch size: 332, lr: 8.51e-03, grad_scale: 32.0 2023-06-20 12:57:18,136 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.010e+02 2.659e+02 3.074e+02 3.933e+02 7.372e+02, threshold=6.149e+02, percent-clipped=6.0 2023-06-20 12:57:35,509 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=561024.0, ans=0.125 2023-06-20 12:58:30,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=561144.0, ans=0.2 2023-06-20 12:58:31,228 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.40 vs. limit=22.5 2023-06-20 12:58:37,542 INFO [train.py:996] (2/4) Epoch 4, batch 2050, loss[loss=0.2189, simple_loss=0.294, pruned_loss=0.07184, over 21445.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3108, pruned_loss=0.08221, over 4276701.75 frames. ], batch size: 131, lr: 8.51e-03, grad_scale: 32.0 2023-06-20 12:58:37,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=561204.0, ans=0.125 2023-06-20 12:59:29,463 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.50 vs. limit=15.0 2023-06-20 12:59:40,881 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.41 vs. limit=6.0 2023-06-20 12:59:43,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=561324.0, ans=0.0 2023-06-20 13:01:09,989 INFO [train.py:996] (2/4) Epoch 4, batch 2100, loss[loss=0.2637, simple_loss=0.309, pruned_loss=0.1092, over 19985.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.315, pruned_loss=0.08517, over 4276051.85 frames. ], batch size: 702, lr: 8.51e-03, grad_scale: 32.0 2023-06-20 13:01:37,540 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.974e+02 2.531e+02 2.915e+02 3.438e+02 5.326e+02, threshold=5.830e+02, percent-clipped=0.0 2023-06-20 13:01:38,073 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=561564.0, ans=0.2 2023-06-20 13:01:40,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=561564.0, ans=0.2 2023-06-20 13:01:40,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=561564.0, ans=0.125 2023-06-20 13:01:48,071 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.57 vs. limit=15.0 2023-06-20 13:03:06,315 INFO [train.py:996] (2/4) Epoch 4, batch 2150, loss[loss=0.2479, simple_loss=0.3329, pruned_loss=0.08152, over 21636.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3151, pruned_loss=0.08747, over 4269812.47 frames. ], batch size: 298, lr: 8.50e-03, grad_scale: 32.0 2023-06-20 13:03:47,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=561864.0, ans=0.95 2023-06-20 13:03:49,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=561864.0, ans=0.2 2023-06-20 13:04:06,850 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=561924.0, ans=0.125 2023-06-20 13:04:58,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=562044.0, ans=0.125 2023-06-20 13:05:32,864 INFO [train.py:996] (2/4) Epoch 4, batch 2200, loss[loss=0.2812, simple_loss=0.3424, pruned_loss=0.11, over 21800.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3154, pruned_loss=0.08586, over 4264554.17 frames. ], batch size: 441, lr: 8.50e-03, grad_scale: 32.0 2023-06-20 13:06:00,308 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.951e+02 2.421e+02 2.845e+02 3.269e+02 4.462e+02, threshold=5.690e+02, percent-clipped=0.0 2023-06-20 13:07:40,843 INFO [train.py:996] (2/4) Epoch 4, batch 2250, loss[loss=0.1862, simple_loss=0.2536, pruned_loss=0.05939, over 21796.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.313, pruned_loss=0.08373, over 4264627.33 frames. ], batch size: 124, lr: 8.50e-03, grad_scale: 32.0 2023-06-20 13:08:36,787 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=562524.0, ans=0.0 2023-06-20 13:09:51,962 INFO [train.py:996] (2/4) Epoch 4, batch 2300, loss[loss=0.2671, simple_loss=0.2991, pruned_loss=0.1176, over 21332.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3097, pruned_loss=0.08408, over 4257821.25 frames. ], batch size: 507, lr: 8.50e-03, grad_scale: 32.0 2023-06-20 13:10:19,785 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.011e+02 2.850e+02 3.293e+02 4.156e+02 7.467e+02, threshold=6.587e+02, percent-clipped=11.0 2023-06-20 13:11:44,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=562944.0, ans=0.125 2023-06-20 13:11:49,248 INFO [train.py:996] (2/4) Epoch 4, batch 2350, loss[loss=0.1977, simple_loss=0.2691, pruned_loss=0.06309, over 21386.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3063, pruned_loss=0.08439, over 4265894.96 frames. ], batch size: 211, lr: 8.49e-03, grad_scale: 32.0 2023-06-20 13:12:38,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=563064.0, ans=0.95 2023-06-20 13:12:44,241 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=563064.0, ans=0.0 2023-06-20 13:12:48,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=563124.0, ans=0.125 2023-06-20 13:12:54,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=563124.0, ans=0.1 2023-06-20 13:13:21,680 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.85 vs. limit=15.0 2023-06-20 13:13:42,850 INFO [train.py:996] (2/4) Epoch 4, batch 2400, loss[loss=0.2895, simple_loss=0.3521, pruned_loss=0.1135, over 21553.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3136, pruned_loss=0.08796, over 4268732.13 frames. ], batch size: 389, lr: 8.49e-03, grad_scale: 32.0 2023-06-20 13:13:46,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=563304.0, ans=0.125 2023-06-20 13:14:05,168 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.168e+02 2.727e+02 3.231e+02 4.252e+02 7.805e+02, threshold=6.463e+02, percent-clipped=2.0 2023-06-20 13:15:50,103 INFO [train.py:996] (2/4) Epoch 4, batch 2450, loss[loss=0.2479, simple_loss=0.3094, pruned_loss=0.0932, over 21124.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3204, pruned_loss=0.0907, over 4274536.94 frames. ], batch size: 143, lr: 8.49e-03, grad_scale: 32.0 2023-06-20 13:15:56,813 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.14 vs. limit=15.0 2023-06-20 13:16:05,917 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.97 vs. limit=22.5 2023-06-20 13:16:06,776 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=563604.0, ans=0.125 2023-06-20 13:16:07,243 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.62 vs. limit=6.0 2023-06-20 13:16:32,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=563664.0, ans=0.2 2023-06-20 13:17:38,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=563844.0, ans=0.1 2023-06-20 13:17:51,339 INFO [train.py:996] (2/4) Epoch 4, batch 2500, loss[loss=0.2253, simple_loss=0.3157, pruned_loss=0.06743, over 21392.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3174, pruned_loss=0.08872, over 4264359.27 frames. ], batch size: 194, lr: 8.49e-03, grad_scale: 32.0 2023-06-20 13:18:25,929 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.919e+02 2.483e+02 2.988e+02 3.628e+02 7.218e+02, threshold=5.976e+02, percent-clipped=3.0 2023-06-20 13:19:17,198 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=564084.0, ans=0.125 2023-06-20 13:19:46,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=564144.0, ans=0.1 2023-06-20 13:20:01,570 INFO [train.py:996] (2/4) Epoch 4, batch 2550, loss[loss=0.2445, simple_loss=0.3014, pruned_loss=0.09387, over 21539.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.316, pruned_loss=0.08712, over 4265150.96 frames. ], batch size: 414, lr: 8.49e-03, grad_scale: 32.0 2023-06-20 13:21:11,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=564384.0, ans=0.125 2023-06-20 13:21:36,698 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.64 vs. limit=15.0 2023-06-20 13:21:53,627 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=564444.0, ans=0.1 2023-06-20 13:22:14,113 INFO [train.py:996] (2/4) Epoch 4, batch 2600, loss[loss=0.2956, simple_loss=0.3559, pruned_loss=0.1177, over 21641.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3189, pruned_loss=0.08961, over 4263545.51 frames. ], batch size: 389, lr: 8.48e-03, grad_scale: 32.0 2023-06-20 13:22:35,613 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.854e+02 2.603e+02 3.043e+02 3.665e+02 5.448e+02, threshold=6.087e+02, percent-clipped=0.0 2023-06-20 13:23:37,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=564684.0, ans=10.0 2023-06-20 13:24:04,590 INFO [train.py:996] (2/4) Epoch 4, batch 2650, loss[loss=0.2697, simple_loss=0.3264, pruned_loss=0.1065, over 21858.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3184, pruned_loss=0.09003, over 4270840.26 frames. ], batch size: 371, lr: 8.48e-03, grad_scale: 32.0 2023-06-20 13:25:44,266 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=564984.0, ans=0.125 2023-06-20 13:26:08,581 INFO [train.py:996] (2/4) Epoch 4, batch 2700, loss[loss=0.2172, simple_loss=0.2848, pruned_loss=0.07477, over 21792.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.312, pruned_loss=0.08793, over 4278347.81 frames. ], batch size: 316, lr: 8.48e-03, grad_scale: 32.0 2023-06-20 13:26:37,013 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.06 vs. limit=10.0 2023-06-20 13:26:51,100 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 2.726e+02 3.108e+02 3.967e+02 7.517e+02, threshold=6.217e+02, percent-clipped=3.0 2023-06-20 13:28:36,563 INFO [train.py:996] (2/4) Epoch 4, batch 2750, loss[loss=0.2437, simple_loss=0.3272, pruned_loss=0.08013, over 17280.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3126, pruned_loss=0.08819, over 4278988.10 frames. ], batch size: 60, lr: 8.48e-03, grad_scale: 32.0 2023-06-20 13:29:20,830 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.18 vs. limit=15.0 2023-06-20 13:31:01,583 INFO [train.py:996] (2/4) Epoch 4, batch 2800, loss[loss=0.2095, simple_loss=0.236, pruned_loss=0.0915, over 16781.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3177, pruned_loss=0.09072, over 4275974.26 frames. ], batch size: 61, lr: 8.47e-03, grad_scale: 32.0 2023-06-20 13:31:24,212 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.099e+02 2.619e+02 3.182e+02 3.699e+02 5.789e+02, threshold=6.364e+02, percent-clipped=0.0 2023-06-20 13:33:03,262 INFO [train.py:996] (2/4) Epoch 4, batch 2850, loss[loss=0.3048, simple_loss=0.3683, pruned_loss=0.1206, over 21402.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3167, pruned_loss=0.09161, over 4266686.20 frames. ], batch size: 507, lr: 8.47e-03, grad_scale: 32.0 2023-06-20 13:33:06,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=566004.0, ans=0.0 2023-06-20 13:33:13,879 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=566004.0, ans=0.125 2023-06-20 13:34:12,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=566124.0, ans=0.125 2023-06-20 13:34:13,675 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=566124.0, ans=0.1 2023-06-20 13:34:25,843 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=566184.0, ans=0.125 2023-06-20 13:35:20,761 INFO [train.py:996] (2/4) Epoch 4, batch 2900, loss[loss=0.2215, simple_loss=0.294, pruned_loss=0.07448, over 21673.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3155, pruned_loss=0.09087, over 4272216.55 frames. ], batch size: 263, lr: 8.47e-03, grad_scale: 32.0 2023-06-20 13:35:43,045 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.160e+02 2.738e+02 3.167e+02 3.892e+02 8.808e+02, threshold=6.333e+02, percent-clipped=7.0 2023-06-20 13:36:03,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=566364.0, ans=0.0 2023-06-20 13:37:31,887 INFO [train.py:996] (2/4) Epoch 4, batch 2950, loss[loss=0.2475, simple_loss=0.3485, pruned_loss=0.07324, over 20857.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.319, pruned_loss=0.09093, over 4279842.80 frames. ], batch size: 607, lr: 8.47e-03, grad_scale: 32.0 2023-06-20 13:37:54,512 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.04 vs. limit=15.0 2023-06-20 13:38:38,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=566724.0, ans=0.0 2023-06-20 13:39:17,890 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.06 vs. limit=15.0 2023-06-20 13:39:55,847 INFO [train.py:996] (2/4) Epoch 4, batch 3000, loss[loss=0.2225, simple_loss=0.3128, pruned_loss=0.06612, over 21439.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3221, pruned_loss=0.09091, over 4280843.69 frames. ], batch size: 194, lr: 8.47e-03, grad_scale: 32.0 2023-06-20 13:39:55,849 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-20 13:40:43,515 INFO [train.py:1028] (2/4) Epoch 4, validation: loss=0.2581, simple_loss=0.352, pruned_loss=0.08208, over 1796401.00 frames. 2023-06-20 13:40:43,519 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-20 13:40:51,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=566904.0, ans=0.125 2023-06-20 13:40:52,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=566904.0, ans=6.0 2023-06-20 13:41:00,248 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.160e+02 2.790e+02 3.201e+02 3.672e+02 6.689e+02, threshold=6.402e+02, percent-clipped=1.0 2023-06-20 13:41:58,746 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=567084.0, ans=0.0 2023-06-20 13:42:39,633 INFO [train.py:996] (2/4) Epoch 4, batch 3050, loss[loss=0.19, simple_loss=0.2817, pruned_loss=0.04914, over 21840.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3241, pruned_loss=0.09068, over 4280401.92 frames. ], batch size: 282, lr: 8.46e-03, grad_scale: 32.0 2023-06-20 13:44:13,760 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=567384.0, ans=0.125 2023-06-20 13:44:19,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=567444.0, ans=0.1 2023-06-20 13:44:31,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=567444.0, ans=0.125 2023-06-20 13:44:38,267 INFO [train.py:996] (2/4) Epoch 4, batch 3100, loss[loss=0.3116, simple_loss=0.4081, pruned_loss=0.1075, over 21268.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3237, pruned_loss=0.08968, over 4286722.61 frames. ], batch size: 548, lr: 8.46e-03, grad_scale: 32.0 2023-06-20 13:45:06,093 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.835e+02 2.520e+02 2.947e+02 3.627e+02 6.337e+02, threshold=5.895e+02, percent-clipped=0.0 2023-06-20 13:45:19,810 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=567564.0, ans=0.0 2023-06-20 13:45:30,547 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.56 vs. limit=22.5 2023-06-20 13:46:03,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=567684.0, ans=0.2 2023-06-20 13:46:03,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=567684.0, ans=0.0 2023-06-20 13:46:21,681 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.23 vs. limit=12.0 2023-06-20 13:46:27,458 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 13:46:34,665 INFO [train.py:996] (2/4) Epoch 4, batch 3150, loss[loss=0.2582, simple_loss=0.334, pruned_loss=0.09116, over 21826.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.325, pruned_loss=0.09026, over 4285448.63 frames. ], batch size: 282, lr: 8.46e-03, grad_scale: 32.0 2023-06-20 13:46:36,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=567804.0, ans=0.0 2023-06-20 13:46:48,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=567804.0, ans=10.0 2023-06-20 13:48:38,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=568044.0, ans=0.0 2023-06-20 13:48:40,623 INFO [train.py:996] (2/4) Epoch 4, batch 3200, loss[loss=0.1871, simple_loss=0.2604, pruned_loss=0.05695, over 21294.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.327, pruned_loss=0.09078, over 4291250.91 frames. ], batch size: 159, lr: 8.46e-03, grad_scale: 32.0 2023-06-20 13:49:00,732 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=22.5 2023-06-20 13:49:07,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=568164.0, ans=0.125 2023-06-20 13:49:09,863 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.795e+02 2.444e+02 2.825e+02 3.335e+02 7.265e+02, threshold=5.650e+02, percent-clipped=2.0 2023-06-20 13:49:53,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=568224.0, ans=0.125 2023-06-20 13:50:05,137 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=568224.0, ans=0.07 2023-06-20 13:50:17,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=568284.0, ans=0.0 2023-06-20 13:50:50,896 INFO [train.py:996] (2/4) Epoch 4, batch 3250, loss[loss=0.2734, simple_loss=0.3304, pruned_loss=0.1083, over 21199.00 frames. ], tot_loss[loss=0.2571, simple_loss=0.3285, pruned_loss=0.09288, over 4283256.48 frames. ], batch size: 176, lr: 8.45e-03, grad_scale: 16.0 2023-06-20 13:50:54,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=568404.0, ans=0.0 2023-06-20 13:50:56,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=568404.0, ans=0.125 2023-06-20 13:51:01,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=568404.0, ans=0.2 2023-06-20 13:51:14,246 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 13:51:21,925 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=568464.0, ans=0.0 2023-06-20 13:51:32,416 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=568524.0, ans=0.125 2023-06-20 13:51:35,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=568524.0, ans=0.1 2023-06-20 13:51:49,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=568584.0, ans=0.125 2023-06-20 13:52:14,800 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=568644.0, ans=0.0 2023-06-20 13:52:14,809 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=568644.0, ans=0.2 2023-06-20 13:52:24,886 INFO [train.py:996] (2/4) Epoch 4, batch 3300, loss[loss=0.2262, simple_loss=0.2907, pruned_loss=0.08081, over 21319.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3233, pruned_loss=0.09253, over 4276665.62 frames. ], batch size: 144, lr: 8.45e-03, grad_scale: 16.0 2023-06-20 13:52:51,757 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.938e+02 2.805e+02 3.280e+02 3.867e+02 6.254e+02, threshold=6.560e+02, percent-clipped=3.0 2023-06-20 13:53:25,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=568824.0, ans=0.0 2023-06-20 13:54:07,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=568944.0, ans=0.125 2023-06-20 13:54:16,872 INFO [train.py:996] (2/4) Epoch 4, batch 3350, loss[loss=0.3317, simple_loss=0.3829, pruned_loss=0.1403, over 21478.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3265, pruned_loss=0.09283, over 4280681.64 frames. ], batch size: 507, lr: 8.45e-03, grad_scale: 16.0 2023-06-20 13:54:32,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=569064.0, ans=0.125 2023-06-20 13:54:48,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=569064.0, ans=0.125 2023-06-20 13:54:59,115 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=569124.0, ans=0.04949747468305833 2023-06-20 13:54:59,739 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.43 vs. limit=15.0 2023-06-20 13:55:06,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=569184.0, ans=0.07 2023-06-20 13:55:10,622 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=569184.0, ans=0.0 2023-06-20 13:55:19,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=569184.0, ans=0.125 2023-06-20 13:55:40,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=569244.0, ans=0.2 2023-06-20 13:56:09,618 INFO [train.py:996] (2/4) Epoch 4, batch 3400, loss[loss=0.2508, simple_loss=0.3365, pruned_loss=0.08255, over 21921.00 frames. ], tot_loss[loss=0.256, simple_loss=0.3255, pruned_loss=0.09319, over 4280515.44 frames. ], batch size: 372, lr: 8.45e-03, grad_scale: 16.0 2023-06-20 13:56:31,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=569364.0, ans=0.0 2023-06-20 13:56:35,101 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.131e+02 2.736e+02 3.073e+02 3.574e+02 7.893e+02, threshold=6.146e+02, percent-clipped=1.0 2023-06-20 13:56:36,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=569364.0, ans=0.2 2023-06-20 13:57:21,711 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=569484.0, ans=0.5 2023-06-20 13:57:53,502 INFO [train.py:996] (2/4) Epoch 4, batch 3450, loss[loss=0.2519, simple_loss=0.3198, pruned_loss=0.09201, over 21721.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3194, pruned_loss=0.09172, over 4269944.18 frames. ], batch size: 316, lr: 8.45e-03, grad_scale: 16.0 2023-06-20 13:57:58,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=569604.0, ans=0.125 2023-06-20 13:58:00,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=569604.0, ans=0.125 2023-06-20 13:59:24,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=569844.0, ans=0.2 2023-06-20 14:00:02,856 INFO [train.py:996] (2/4) Epoch 4, batch 3500, loss[loss=0.2527, simple_loss=0.3107, pruned_loss=0.09733, over 21175.00 frames. ], tot_loss[loss=0.2602, simple_loss=0.3288, pruned_loss=0.09575, over 4270408.72 frames. ], batch size: 608, lr: 8.44e-03, grad_scale: 16.0 2023-06-20 14:00:06,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=569904.0, ans=10.0 2023-06-20 14:00:31,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=569964.0, ans=0.125 2023-06-20 14:00:34,137 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.161e+02 2.895e+02 3.377e+02 4.214e+02 8.364e+02, threshold=6.755e+02, percent-clipped=8.0 2023-06-20 14:01:29,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=570084.0, ans=0.1 2023-06-20 14:01:41,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=570144.0, ans=0.125 2023-06-20 14:01:59,214 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.96 vs. limit=15.0 2023-06-20 14:02:02,766 INFO [train.py:996] (2/4) Epoch 4, batch 3550, loss[loss=0.265, simple_loss=0.3411, pruned_loss=0.09445, over 21458.00 frames. ], tot_loss[loss=0.2637, simple_loss=0.3334, pruned_loss=0.09701, over 4268772.76 frames. ], batch size: 194, lr: 8.44e-03, grad_scale: 16.0 2023-06-20 14:02:10,414 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 14:03:04,282 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=570324.0, ans=0.0 2023-06-20 14:03:59,266 INFO [train.py:996] (2/4) Epoch 4, batch 3600, loss[loss=0.238, simple_loss=0.2959, pruned_loss=0.09001, over 21853.00 frames. ], tot_loss[loss=0.2614, simple_loss=0.3289, pruned_loss=0.09699, over 4271829.85 frames. ], batch size: 317, lr: 8.44e-03, grad_scale: 32.0 2023-06-20 14:04:25,642 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.156e+02 2.783e+02 3.214e+02 3.773e+02 6.590e+02, threshold=6.428e+02, percent-clipped=0.0 2023-06-20 14:05:02,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=570624.0, ans=0.0 2023-06-20 14:05:53,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff3.min_abs, batch_count=570744.0, ans=0.2 2023-06-20 14:05:55,919 INFO [train.py:996] (2/4) Epoch 4, batch 3650, loss[loss=0.2062, simple_loss=0.2955, pruned_loss=0.05842, over 21763.00 frames. ], tot_loss[loss=0.2612, simple_loss=0.3283, pruned_loss=0.097, over 4269751.93 frames. ], batch size: 247, lr: 8.44e-03, grad_scale: 16.0 2023-06-20 14:05:59,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=570804.0, ans=0.1 2023-06-20 14:06:34,297 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=570864.0, ans=0.95 2023-06-20 14:07:13,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=570984.0, ans=0.125 2023-06-20 14:08:03,198 INFO [train.py:996] (2/4) Epoch 4, batch 3700, loss[loss=0.2899, simple_loss=0.3506, pruned_loss=0.1146, over 21608.00 frames. ], tot_loss[loss=0.2604, simple_loss=0.3281, pruned_loss=0.09633, over 4270602.99 frames. ], batch size: 471, lr: 8.43e-03, grad_scale: 16.0 2023-06-20 14:08:09,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=571104.0, ans=0.0 2023-06-20 14:08:25,110 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.46 vs. limit=12.0 2023-06-20 14:08:41,029 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 2.519e+02 3.000e+02 3.598e+02 6.843e+02, threshold=6.000e+02, percent-clipped=1.0 2023-06-20 14:09:40,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=571344.0, ans=0.0 2023-06-20 14:09:51,439 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.18 vs. limit=15.0 2023-06-20 14:10:13,611 INFO [train.py:996] (2/4) Epoch 4, batch 3750, loss[loss=0.1777, simple_loss=0.2529, pruned_loss=0.05123, over 21602.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.3252, pruned_loss=0.09485, over 4278791.11 frames. ], batch size: 230, lr: 8.43e-03, grad_scale: 16.0 2023-06-20 14:10:52,730 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 14:11:10,212 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=571524.0, ans=0.125 2023-06-20 14:11:19,029 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=571584.0, ans=0.125 2023-06-20 14:11:53,003 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.92 vs. limit=15.0 2023-06-20 14:12:13,301 INFO [train.py:996] (2/4) Epoch 4, batch 3800, loss[loss=0.2534, simple_loss=0.317, pruned_loss=0.09487, over 21765.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.323, pruned_loss=0.09277, over 4284821.01 frames. ], batch size: 298, lr: 8.43e-03, grad_scale: 16.0 2023-06-20 14:12:45,351 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.843e+02 2.431e+02 2.892e+02 3.588e+02 7.450e+02, threshold=5.785e+02, percent-clipped=3.0 2023-06-20 14:13:02,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=571824.0, ans=0.0 2023-06-20 14:13:35,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=571944.0, ans=0.0 2023-06-20 14:13:36,119 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.43 vs. limit=15.0 2023-06-20 14:13:50,977 INFO [train.py:996] (2/4) Epoch 4, batch 3850, loss[loss=0.2538, simple_loss=0.306, pruned_loss=0.1008, over 21661.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3206, pruned_loss=0.09276, over 4284508.28 frames. ], batch size: 417, lr: 8.43e-03, grad_scale: 16.0 2023-06-20 14:14:36,691 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 14:14:58,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=572124.0, ans=0.125 2023-06-20 14:15:32,963 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=572244.0, ans=0.125 2023-06-20 14:15:38,372 INFO [train.py:996] (2/4) Epoch 4, batch 3900, loss[loss=0.2767, simple_loss=0.327, pruned_loss=0.1132, over 21773.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3155, pruned_loss=0.09221, over 4281462.32 frames. ], batch size: 441, lr: 8.43e-03, grad_scale: 16.0 2023-06-20 14:15:46,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=572304.0, ans=0.125 2023-06-20 14:15:58,897 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.46 vs. limit=22.5 2023-06-20 14:16:07,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=572304.0, ans=0.0 2023-06-20 14:16:17,380 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.048e+02 2.553e+02 2.993e+02 3.640e+02 7.151e+02, threshold=5.987e+02, percent-clipped=1.0 2023-06-20 14:16:37,596 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.31 vs. limit=15.0 2023-06-20 14:16:49,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=572484.0, ans=0.1 2023-06-20 14:17:39,764 INFO [train.py:996] (2/4) Epoch 4, batch 3950, loss[loss=0.2071, simple_loss=0.3144, pruned_loss=0.04992, over 19688.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.315, pruned_loss=0.08978, over 4275198.24 frames. ], batch size: 703, lr: 8.42e-03, grad_scale: 16.0 2023-06-20 14:18:23,729 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.40 vs. limit=10.0 2023-06-20 14:18:40,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=572724.0, ans=0.2 2023-06-20 14:19:46,318 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=572904.0, ans=0.125 2023-06-20 14:19:47,313 INFO [train.py:996] (2/4) Epoch 4, batch 4000, loss[loss=0.2008, simple_loss=0.2649, pruned_loss=0.06833, over 21577.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3083, pruned_loss=0.08561, over 4271606.41 frames. ], batch size: 263, lr: 8.42e-03, grad_scale: 32.0 2023-06-20 14:19:53,553 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=572904.0, ans=0.125 2023-06-20 14:19:58,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=572904.0, ans=0.2 2023-06-20 14:20:13,756 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.785e+02 2.382e+02 2.703e+02 3.231e+02 4.558e+02, threshold=5.407e+02, percent-clipped=0.0 2023-06-20 14:20:14,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=572964.0, ans=0.1 2023-06-20 14:20:46,941 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=573084.0, ans=0.0 2023-06-20 14:20:56,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=573144.0, ans=0.125 2023-06-20 14:21:28,567 INFO [train.py:996] (2/4) Epoch 4, batch 4050, loss[loss=0.2222, simple_loss=0.2991, pruned_loss=0.07261, over 21825.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3072, pruned_loss=0.08442, over 4268867.26 frames. ], batch size: 118, lr: 8.42e-03, grad_scale: 32.0 2023-06-20 14:21:50,022 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.50 vs. limit=22.5 2023-06-20 14:22:15,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=573264.0, ans=0.0 2023-06-20 14:22:25,340 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=573324.0, ans=0.1 2023-06-20 14:22:25,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=573324.0, ans=0.125 2023-06-20 14:22:32,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=573324.0, ans=0.125 2023-06-20 14:22:36,646 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.79 vs. limit=22.5 2023-06-20 14:23:32,494 INFO [train.py:996] (2/4) Epoch 4, batch 4100, loss[loss=0.2425, simple_loss=0.317, pruned_loss=0.08402, over 21855.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3091, pruned_loss=0.08471, over 4267198.05 frames. ], batch size: 316, lr: 8.42e-03, grad_scale: 32.0 2023-06-20 14:23:49,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=573504.0, ans=0.125 2023-06-20 14:24:14,724 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.799e+02 2.729e+02 3.089e+02 3.698e+02 6.441e+02, threshold=6.178e+02, percent-clipped=7.0 2023-06-20 14:24:15,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=573564.0, ans=0.2 2023-06-20 14:24:58,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=573684.0, ans=0.0 2023-06-20 14:25:27,908 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.82 vs. limit=12.0 2023-06-20 14:25:28,424 INFO [train.py:996] (2/4) Epoch 4, batch 4150, loss[loss=0.2255, simple_loss=0.3023, pruned_loss=0.07431, over 21876.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3086, pruned_loss=0.08186, over 4270548.93 frames. ], batch size: 373, lr: 8.41e-03, grad_scale: 32.0 2023-06-20 14:26:17,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=573924.0, ans=0.09899494936611666 2023-06-20 14:27:29,741 INFO [train.py:996] (2/4) Epoch 4, batch 4200, loss[loss=0.2457, simple_loss=0.3191, pruned_loss=0.08619, over 21594.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3108, pruned_loss=0.08257, over 4265523.13 frames. ], batch size: 414, lr: 8.41e-03, grad_scale: 32.0 2023-06-20 14:27:39,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=574104.0, ans=0.125 2023-06-20 14:27:56,305 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.588e+02 2.492e+02 2.868e+02 3.338e+02 5.959e+02, threshold=5.736e+02, percent-clipped=0.0 2023-06-20 14:27:56,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=574164.0, ans=0.125 2023-06-20 14:27:58,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=574164.0, ans=0.125 2023-06-20 14:27:58,807 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.48 vs. limit=15.0 2023-06-20 14:29:23,352 INFO [train.py:996] (2/4) Epoch 4, batch 4250, loss[loss=0.2815, simple_loss=0.3503, pruned_loss=0.1063, over 21318.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3156, pruned_loss=0.08383, over 4271103.55 frames. ], batch size: 176, lr: 8.41e-03, grad_scale: 32.0 2023-06-20 14:29:24,559 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-20 14:29:33,716 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.85 vs. limit=10.0 2023-06-20 14:29:58,531 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=574464.0, ans=0.125 2023-06-20 14:30:14,777 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 14:30:44,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=574584.0, ans=0.0 2023-06-20 14:31:09,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=574584.0, ans=0.125 2023-06-20 14:31:33,584 INFO [train.py:996] (2/4) Epoch 4, batch 4300, loss[loss=0.2272, simple_loss=0.2616, pruned_loss=0.09643, over 20053.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3233, pruned_loss=0.08722, over 4270837.76 frames. ], batch size: 704, lr: 8.41e-03, grad_scale: 32.0 2023-06-20 14:32:18,443 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.022e+02 2.663e+02 3.157e+02 3.844e+02 7.898e+02, threshold=6.314e+02, percent-clipped=3.0 2023-06-20 14:32:31,659 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.78 vs. limit=15.0 2023-06-20 14:32:34,369 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.42 vs. limit=15.0 2023-06-20 14:33:14,497 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=574944.0, ans=0.1 2023-06-20 14:33:35,746 INFO [train.py:996] (2/4) Epoch 4, batch 4350, loss[loss=0.2414, simple_loss=0.3022, pruned_loss=0.09026, over 21796.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3198, pruned_loss=0.08528, over 4262790.50 frames. ], batch size: 371, lr: 8.41e-03, grad_scale: 32.0 2023-06-20 14:33:42,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=575004.0, ans=0.0 2023-06-20 14:34:23,943 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=575124.0, ans=0.125 2023-06-20 14:34:30,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=575124.0, ans=0.125 2023-06-20 14:35:27,075 INFO [train.py:996] (2/4) Epoch 4, batch 4400, loss[loss=0.2257, simple_loss=0.3152, pruned_loss=0.06806, over 21594.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.316, pruned_loss=0.08407, over 4266443.87 frames. ], batch size: 263, lr: 8.40e-03, grad_scale: 32.0 2023-06-20 14:35:27,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=575304.0, ans=0.0 2023-06-20 14:35:46,674 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=575364.0, ans=0.1 2023-06-20 14:36:06,020 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.951e+02 2.625e+02 3.087e+02 3.539e+02 6.162e+02, threshold=6.175e+02, percent-clipped=0.0 2023-06-20 14:37:17,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=575544.0, ans=22.5 2023-06-20 14:37:37,830 INFO [train.py:996] (2/4) Epoch 4, batch 4450, loss[loss=0.2688, simple_loss=0.3346, pruned_loss=0.1015, over 21773.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3262, pruned_loss=0.08767, over 4271653.92 frames. ], batch size: 124, lr: 8.40e-03, grad_scale: 32.0 2023-06-20 14:38:38,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=575784.0, ans=0.125 2023-06-20 14:39:22,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=575844.0, ans=0.125 2023-06-20 14:39:25,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=575844.0, ans=0.125 2023-06-20 14:39:35,194 INFO [train.py:996] (2/4) Epoch 4, batch 4500, loss[loss=0.3729, simple_loss=0.4845, pruned_loss=0.1306, over 19706.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3285, pruned_loss=0.08989, over 4276861.52 frames. ], batch size: 702, lr: 8.40e-03, grad_scale: 32.0 2023-06-20 14:39:45,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=575904.0, ans=0.125 2023-06-20 14:40:07,218 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.914e+02 2.623e+02 2.937e+02 3.558e+02 5.301e+02, threshold=5.874e+02, percent-clipped=0.0 2023-06-20 14:40:23,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=575964.0, ans=0.125 2023-06-20 14:40:35,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=576024.0, ans=0.125 2023-06-20 14:41:22,689 INFO [train.py:996] (2/4) Epoch 4, batch 4550, loss[loss=0.2905, simple_loss=0.3636, pruned_loss=0.1087, over 21623.00 frames. ], tot_loss[loss=0.2571, simple_loss=0.3326, pruned_loss=0.09078, over 4276214.33 frames. ], batch size: 389, lr: 8.40e-03, grad_scale: 32.0 2023-06-20 14:41:54,367 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.00 vs. limit=15.0 2023-06-20 14:42:25,362 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.20 vs. limit=6.0 2023-06-20 14:42:42,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=576384.0, ans=0.95 2023-06-20 14:43:23,257 INFO [train.py:996] (2/4) Epoch 4, batch 4600, loss[loss=0.2272, simple_loss=0.2988, pruned_loss=0.07786, over 21280.00 frames. ], tot_loss[loss=0.26, simple_loss=0.3341, pruned_loss=0.09291, over 4280071.14 frames. ], batch size: 176, lr: 8.39e-03, grad_scale: 32.0 2023-06-20 14:43:49,989 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.983e+02 2.820e+02 3.441e+02 4.090e+02 6.361e+02, threshold=6.882e+02, percent-clipped=1.0 2023-06-20 14:43:50,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=576564.0, ans=0.125 2023-06-20 14:44:44,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=576744.0, ans=0.0 2023-06-20 14:44:47,326 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.50 vs. limit=22.5 2023-06-20 14:45:01,221 INFO [train.py:996] (2/4) Epoch 4, batch 4650, loss[loss=0.1379, simple_loss=0.1985, pruned_loss=0.03869, over 16248.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3256, pruned_loss=0.09027, over 4282879.60 frames. ], batch size: 61, lr: 8.39e-03, grad_scale: 32.0 2023-06-20 14:45:06,814 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-06-20 14:45:33,006 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=576864.0, ans=0.5 2023-06-20 14:45:52,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=576924.0, ans=0.0 2023-06-20 14:45:54,145 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=576924.0, ans=0.1 2023-06-20 14:45:57,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=576984.0, ans=0.125 2023-06-20 14:46:12,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=576984.0, ans=0.0 2023-06-20 14:46:36,961 INFO [train.py:996] (2/4) Epoch 4, batch 4700, loss[loss=0.2209, simple_loss=0.2812, pruned_loss=0.08028, over 22010.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3156, pruned_loss=0.08705, over 4282846.86 frames. ], batch size: 103, lr: 8.39e-03, grad_scale: 32.0 2023-06-20 14:46:46,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=577104.0, ans=0.125 2023-06-20 14:47:09,025 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.012e+02 2.385e+02 2.859e+02 3.485e+02 5.797e+02, threshold=5.718e+02, percent-clipped=0.0 2023-06-20 14:47:10,115 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.39 vs. limit=15.0 2023-06-20 14:47:33,255 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=577284.0, ans=0.0 2023-06-20 14:47:36,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=577284.0, ans=10.0 2023-06-20 14:48:05,908 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=577344.0, ans=0.125 2023-06-20 14:48:14,348 INFO [train.py:996] (2/4) Epoch 4, batch 4750, loss[loss=0.2357, simple_loss=0.2943, pruned_loss=0.08854, over 21522.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3107, pruned_loss=0.08766, over 4288465.45 frames. ], batch size: 548, lr: 8.39e-03, grad_scale: 32.0 2023-06-20 14:48:16,470 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=577404.0, ans=0.0 2023-06-20 14:48:34,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=577464.0, ans=0.0 2023-06-20 14:48:47,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=577464.0, ans=0.2 2023-06-20 14:49:06,107 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.16 vs. limit=15.0 2023-06-20 14:50:08,457 INFO [train.py:996] (2/4) Epoch 4, batch 4800, loss[loss=0.2326, simple_loss=0.2818, pruned_loss=0.09172, over 20262.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3092, pruned_loss=0.08757, over 4283970.37 frames. ], batch size: 703, lr: 8.39e-03, grad_scale: 32.0 2023-06-20 14:50:45,674 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.974e+02 2.623e+02 2.992e+02 3.457e+02 8.061e+02, threshold=5.984e+02, percent-clipped=2.0 2023-06-20 14:51:09,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=577824.0, ans=0.125 2023-06-20 14:51:21,439 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.83 vs. limit=15.0 2023-06-20 14:51:33,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=577884.0, ans=0.125 2023-06-20 14:52:02,774 INFO [train.py:996] (2/4) Epoch 4, batch 4850, loss[loss=0.2438, simple_loss=0.3122, pruned_loss=0.08769, over 21818.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3089, pruned_loss=0.08682, over 4283658.59 frames. ], batch size: 351, lr: 8.38e-03, grad_scale: 32.0 2023-06-20 14:52:16,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=578004.0, ans=0.2 2023-06-20 14:53:13,247 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=578184.0, ans=0.0 2023-06-20 14:53:41,183 INFO [train.py:996] (2/4) Epoch 4, batch 4900, loss[loss=0.3195, simple_loss=0.3714, pruned_loss=0.1338, over 21513.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.311, pruned_loss=0.08804, over 4279299.73 frames. ], batch size: 471, lr: 8.38e-03, grad_scale: 32.0 2023-06-20 14:54:10,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=578364.0, ans=0.2 2023-06-20 14:54:12,915 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.73 vs. limit=15.0 2023-06-20 14:54:13,435 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.029e+02 2.490e+02 2.860e+02 3.460e+02 5.530e+02, threshold=5.721e+02, percent-clipped=0.0 2023-06-20 14:54:17,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=578364.0, ans=0.125 2023-06-20 14:54:42,115 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=578424.0, ans=0.0 2023-06-20 14:55:01,128 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=578484.0, ans=0.125 2023-06-20 14:55:27,091 INFO [train.py:996] (2/4) Epoch 4, batch 4950, loss[loss=0.2105, simple_loss=0.297, pruned_loss=0.062, over 21200.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3145, pruned_loss=0.08627, over 4277915.16 frames. ], batch size: 159, lr: 8.38e-03, grad_scale: 32.0 2023-06-20 14:57:04,933 INFO [train.py:996] (2/4) Epoch 4, batch 5000, loss[loss=0.1868, simple_loss=0.2506, pruned_loss=0.06148, over 15594.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3138, pruned_loss=0.083, over 4266557.69 frames. ], batch size: 60, lr: 8.38e-03, grad_scale: 32.0 2023-06-20 14:57:08,396 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=578904.0, ans=0.125 2023-06-20 14:57:25,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=578964.0, ans=0.1 2023-06-20 14:57:31,070 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.17 vs. limit=15.0 2023-06-20 14:57:31,401 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.789e+02 2.437e+02 2.852e+02 3.653e+02 5.289e+02, threshold=5.704e+02, percent-clipped=0.0 2023-06-20 14:57:37,827 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 14:57:42,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=579024.0, ans=0.0 2023-06-20 14:57:50,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=579024.0, ans=0.035 2023-06-20 14:58:00,297 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=579084.0, ans=0.125 2023-06-20 14:58:15,975 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.16 vs. limit=22.5 2023-06-20 14:58:36,503 INFO [train.py:996] (2/4) Epoch 4, batch 5050, loss[loss=0.2437, simple_loss=0.3051, pruned_loss=0.09114, over 21260.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3157, pruned_loss=0.08442, over 4267761.68 frames. ], batch size: 176, lr: 8.38e-03, grad_scale: 32.0 2023-06-20 14:59:10,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=579264.0, ans=0.125 2023-06-20 15:00:06,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=579444.0, ans=0.125 2023-06-20 15:00:17,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=579504.0, ans=0.1 2023-06-20 15:00:18,918 INFO [train.py:996] (2/4) Epoch 4, batch 5100, loss[loss=0.2596, simple_loss=0.3229, pruned_loss=0.09814, over 21697.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3151, pruned_loss=0.08442, over 4271914.59 frames. ], batch size: 112, lr: 8.37e-03, grad_scale: 32.0 2023-06-20 15:00:27,611 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.84 vs. limit=6.0 2023-06-20 15:00:29,204 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.36 vs. limit=22.5 2023-06-20 15:00:45,742 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.915e+02 2.522e+02 2.846e+02 3.197e+02 5.068e+02, threshold=5.691e+02, percent-clipped=0.0 2023-06-20 15:01:00,512 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.32 vs. limit=15.0 2023-06-20 15:01:44,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=579744.0, ans=0.125 2023-06-20 15:01:51,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=579744.0, ans=0.125 2023-06-20 15:01:57,157 INFO [train.py:996] (2/4) Epoch 4, batch 5150, loss[loss=0.2509, simple_loss=0.3078, pruned_loss=0.09697, over 21322.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3148, pruned_loss=0.08601, over 4280513.62 frames. ], batch size: 143, lr: 8.37e-03, grad_scale: 32.0 2023-06-20 15:02:34,027 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=579924.0, ans=10.0 2023-06-20 15:02:41,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=579924.0, ans=0.0 2023-06-20 15:02:53,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=579984.0, ans=0.0 2023-06-20 15:02:53,843 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.79 vs. limit=15.0 2023-06-20 15:03:25,638 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.23 vs. limit=22.5 2023-06-20 15:03:34,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=580104.0, ans=0.2 2023-06-20 15:03:35,257 INFO [train.py:996] (2/4) Epoch 4, batch 5200, loss[loss=0.2487, simple_loss=0.3389, pruned_loss=0.07928, over 21714.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3172, pruned_loss=0.08675, over 4282016.62 frames. ], batch size: 247, lr: 8.37e-03, grad_scale: 32.0 2023-06-20 15:04:01,162 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.077e+02 2.678e+02 3.286e+02 3.852e+02 6.243e+02, threshold=6.571e+02, percent-clipped=1.0 2023-06-20 15:05:21,823 INFO [train.py:996] (2/4) Epoch 4, batch 5250, loss[loss=0.2689, simple_loss=0.35, pruned_loss=0.09391, over 21779.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.32, pruned_loss=0.08534, over 4280070.75 frames. ], batch size: 371, lr: 8.37e-03, grad_scale: 32.0 2023-06-20 15:05:38,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=580404.0, ans=0.1 2023-06-20 15:05:44,186 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=580464.0, ans=0.0 2023-06-20 15:06:01,505 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.77 vs. limit=12.0 2023-06-20 15:06:11,313 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.74 vs. limit=15.0 2023-06-20 15:06:31,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=580584.0, ans=0.125 2023-06-20 15:06:32,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=580584.0, ans=0.125 2023-06-20 15:06:47,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=580644.0, ans=0.2 2023-06-20 15:06:59,110 INFO [train.py:996] (2/4) Epoch 4, batch 5300, loss[loss=0.2529, simple_loss=0.3173, pruned_loss=0.09425, over 21882.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.32, pruned_loss=0.08636, over 4290918.04 frames. ], batch size: 332, lr: 8.36e-03, grad_scale: 32.0 2023-06-20 15:07:01,766 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.56 vs. limit=15.0 2023-06-20 15:07:07,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=580704.0, ans=0.015 2023-06-20 15:07:14,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=580704.0, ans=0.125 2023-06-20 15:07:19,465 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.48 vs. limit=15.0 2023-06-20 15:07:25,987 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.942e+02 2.445e+02 2.866e+02 3.418e+02 4.898e+02, threshold=5.732e+02, percent-clipped=0.0 2023-06-20 15:08:21,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=580944.0, ans=0.04949747468305833 2023-06-20 15:08:35,817 INFO [train.py:996] (2/4) Epoch 4, batch 5350, loss[loss=0.2965, simple_loss=0.3332, pruned_loss=0.1299, over 21809.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3202, pruned_loss=0.08692, over 4281235.12 frames. ], batch size: 508, lr: 8.36e-03, grad_scale: 32.0 2023-06-20 15:09:13,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=581064.0, ans=0.125 2023-06-20 15:09:16,329 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 15:09:29,584 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.34 vs. limit=15.0 2023-06-20 15:10:06,163 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=581184.0, ans=0.125 2023-06-20 15:10:10,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=581244.0, ans=0.125 2023-06-20 15:10:16,583 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=581244.0, ans=0.2 2023-06-20 15:10:26,851 INFO [train.py:996] (2/4) Epoch 4, batch 5400, loss[loss=0.2169, simple_loss=0.2965, pruned_loss=0.06863, over 21834.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3183, pruned_loss=0.08749, over 4283655.35 frames. ], batch size: 351, lr: 8.36e-03, grad_scale: 32.0 2023-06-20 15:10:59,006 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.571e+02 2.501e+02 3.019e+02 3.777e+02 8.074e+02, threshold=6.038e+02, percent-clipped=3.0 2023-06-20 15:12:15,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=581544.0, ans=0.035 2023-06-20 15:12:18,267 INFO [train.py:996] (2/4) Epoch 4, batch 5450, loss[loss=0.2513, simple_loss=0.3628, pruned_loss=0.06991, over 21759.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3174, pruned_loss=0.08533, over 4285346.68 frames. ], batch size: 298, lr: 8.36e-03, grad_scale: 32.0 2023-06-20 15:12:40,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=581664.0, ans=0.125 2023-06-20 15:12:45,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=581664.0, ans=0.125 2023-06-20 15:12:58,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=581724.0, ans=0.125 2023-06-20 15:13:12,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=581724.0, ans=15.0 2023-06-20 15:13:56,317 INFO [train.py:996] (2/4) Epoch 4, batch 5500, loss[loss=0.2107, simple_loss=0.3191, pruned_loss=0.05115, over 21164.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3218, pruned_loss=0.08212, over 4282404.66 frames. ], batch size: 548, lr: 8.36e-03, grad_scale: 32.0 2023-06-20 15:14:09,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=581904.0, ans=0.125 2023-06-20 15:14:28,844 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.630e+02 2.356e+02 2.658e+02 3.357e+02 7.374e+02, threshold=5.315e+02, percent-clipped=2.0 2023-06-20 15:15:02,112 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=582084.0, ans=0.2 2023-06-20 15:15:20,221 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.90 vs. limit=6.0 2023-06-20 15:15:41,308 INFO [train.py:996] (2/4) Epoch 4, batch 5550, loss[loss=0.1831, simple_loss=0.2651, pruned_loss=0.05057, over 21094.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.32, pruned_loss=0.0794, over 4274785.07 frames. ], batch size: 159, lr: 8.35e-03, grad_scale: 32.0 2023-06-20 15:16:23,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=582324.0, ans=0.125 2023-06-20 15:16:29,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=582324.0, ans=0.125 2023-06-20 15:17:07,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=582444.0, ans=0.125 2023-06-20 15:17:25,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=582504.0, ans=0.125 2023-06-20 15:17:26,386 INFO [train.py:996] (2/4) Epoch 4, batch 5600, loss[loss=0.3005, simple_loss=0.3882, pruned_loss=0.1064, over 21679.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3182, pruned_loss=0.07686, over 4267579.26 frames. ], batch size: 389, lr: 8.35e-03, grad_scale: 32.0 2023-06-20 15:17:34,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=582504.0, ans=0.125 2023-06-20 15:17:37,134 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=582504.0, ans=0.0 2023-06-20 15:17:37,177 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=582504.0, ans=0.0 2023-06-20 15:17:54,168 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.628e+02 2.189e+02 2.611e+02 3.161e+02 5.286e+02, threshold=5.221e+02, percent-clipped=0.0 2023-06-20 15:17:56,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=582564.0, ans=0.125 2023-06-20 15:17:57,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=582564.0, ans=0.0 2023-06-20 15:18:05,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=582624.0, ans=0.125 2023-06-20 15:18:17,888 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 15:18:48,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=582744.0, ans=0.0 2023-06-20 15:19:02,455 INFO [train.py:996] (2/4) Epoch 4, batch 5650, loss[loss=0.2313, simple_loss=0.3011, pruned_loss=0.08074, over 21788.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3216, pruned_loss=0.07912, over 4268397.25 frames. ], batch size: 298, lr: 8.35e-03, grad_scale: 32.0 2023-06-20 15:19:12,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=582804.0, ans=0.0 2023-06-20 15:19:25,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=582864.0, ans=0.125 2023-06-20 15:19:27,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=582864.0, ans=0.125 2023-06-20 15:19:36,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=582924.0, ans=0.125 2023-06-20 15:20:02,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=582984.0, ans=0.125 2023-06-20 15:20:40,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=583044.0, ans=0.125 2023-06-20 15:20:44,353 INFO [train.py:996] (2/4) Epoch 4, batch 5700, loss[loss=0.2917, simple_loss=0.3697, pruned_loss=0.1068, over 21592.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3207, pruned_loss=0.08202, over 4277508.11 frames. ], batch size: 509, lr: 8.35e-03, grad_scale: 32.0 2023-06-20 15:21:12,331 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.938e+02 2.371e+02 2.953e+02 3.348e+02 5.178e+02, threshold=5.907e+02, percent-clipped=0.0 2023-06-20 15:21:13,072 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=583164.0, ans=0.0 2023-06-20 15:22:11,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=583284.0, ans=0.1 2023-06-20 15:22:33,100 INFO [train.py:996] (2/4) Epoch 4, batch 5750, loss[loss=0.1932, simple_loss=0.2837, pruned_loss=0.05139, over 21642.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3157, pruned_loss=0.07858, over 4267319.23 frames. ], batch size: 247, lr: 8.35e-03, grad_scale: 32.0 2023-06-20 15:23:06,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=583464.0, ans=0.1 2023-06-20 15:24:12,031 INFO [train.py:996] (2/4) Epoch 4, batch 5800, loss[loss=0.2425, simple_loss=0.3329, pruned_loss=0.07601, over 21772.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3125, pruned_loss=0.07646, over 4256481.45 frames. ], batch size: 282, lr: 8.34e-03, grad_scale: 32.0 2023-06-20 15:24:46,016 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.706e+02 2.366e+02 2.813e+02 3.634e+02 6.586e+02, threshold=5.626e+02, percent-clipped=4.0 2023-06-20 15:25:17,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=583824.0, ans=0.125 2023-06-20 15:25:36,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=583944.0, ans=0.125 2023-06-20 15:26:02,448 INFO [train.py:996] (2/4) Epoch 4, batch 5850, loss[loss=0.2403, simple_loss=0.3205, pruned_loss=0.08006, over 21475.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3101, pruned_loss=0.07251, over 4269067.19 frames. ], batch size: 507, lr: 8.34e-03, grad_scale: 32.0 2023-06-20 15:26:40,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=584124.0, ans=0.0 2023-06-20 15:26:52,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=584124.0, ans=0.1 2023-06-20 15:26:58,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=584184.0, ans=0.1 2023-06-20 15:27:00,691 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.12 vs. limit=6.0 2023-06-20 15:27:04,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=584184.0, ans=0.035 2023-06-20 15:27:10,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=584184.0, ans=0.1 2023-06-20 15:27:39,551 INFO [train.py:996] (2/4) Epoch 4, batch 5900, loss[loss=0.2583, simple_loss=0.3002, pruned_loss=0.1082, over 20234.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.3029, pruned_loss=0.06799, over 4264787.33 frames. ], batch size: 703, lr: 8.34e-03, grad_scale: 32.0 2023-06-20 15:27:56,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=584304.0, ans=0.2 2023-06-20 15:28:10,984 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.558e+02 2.239e+02 2.663e+02 3.390e+02 4.720e+02, threshold=5.325e+02, percent-clipped=0.0 2023-06-20 15:29:16,175 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.70 vs. limit=12.0 2023-06-20 15:29:18,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=584544.0, ans=0.2 2023-06-20 15:29:26,778 INFO [train.py:996] (2/4) Epoch 4, batch 5950, loss[loss=0.2028, simple_loss=0.2837, pruned_loss=0.06099, over 21763.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.3037, pruned_loss=0.07057, over 4275894.92 frames. ], batch size: 247, lr: 8.34e-03, grad_scale: 32.0 2023-06-20 15:29:42,338 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=15.0 2023-06-20 15:30:07,609 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=584724.0, ans=0.125 2023-06-20 15:30:10,217 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=584724.0, ans=0.0 2023-06-20 15:31:13,852 INFO [train.py:996] (2/4) Epoch 4, batch 6000, loss[loss=0.2234, simple_loss=0.2863, pruned_loss=0.08024, over 21794.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3008, pruned_loss=0.07448, over 4273127.71 frames. ], batch size: 102, lr: 8.33e-03, grad_scale: 32.0 2023-06-20 15:31:13,853 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-20 15:32:19,089 INFO [train.py:1028] (2/4) Epoch 4, validation: loss=0.2612, simple_loss=0.3595, pruned_loss=0.08138, over 1796401.00 frames. 2023-06-20 15:32:19,091 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-20 15:32:52,773 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.981e+02 2.636e+02 3.112e+02 3.720e+02 8.461e+02, threshold=6.223e+02, percent-clipped=3.0 2023-06-20 15:33:12,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=585024.0, ans=0.0 2023-06-20 15:33:48,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=585144.0, ans=0.04949747468305833 2023-06-20 15:33:59,944 INFO [train.py:996] (2/4) Epoch 4, batch 6050, loss[loss=0.1774, simple_loss=0.2584, pruned_loss=0.0482, over 21294.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.2965, pruned_loss=0.07551, over 4264680.64 frames. ], batch size: 176, lr: 8.33e-03, grad_scale: 32.0 2023-06-20 15:35:00,163 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=585324.0, ans=0.125 2023-06-20 15:35:15,214 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=585324.0, ans=0.2 2023-06-20 15:36:02,798 INFO [train.py:996] (2/4) Epoch 4, batch 6100, loss[loss=0.2322, simple_loss=0.3019, pruned_loss=0.08126, over 21937.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2953, pruned_loss=0.0743, over 4276763.65 frames. ], batch size: 316, lr: 8.33e-03, grad_scale: 32.0 2023-06-20 15:36:22,155 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.66 vs. limit=15.0 2023-06-20 15:36:24,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=585504.0, ans=0.1 2023-06-20 15:36:43,684 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.924e+02 2.490e+02 2.900e+02 3.284e+02 4.880e+02, threshold=5.799e+02, percent-clipped=0.0 2023-06-20 15:37:00,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=585624.0, ans=0.1 2023-06-20 15:37:41,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=585684.0, ans=0.125 2023-06-20 15:38:11,427 INFO [train.py:996] (2/4) Epoch 4, batch 6150, loss[loss=0.1944, simple_loss=0.2681, pruned_loss=0.06033, over 21168.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.2982, pruned_loss=0.07698, over 4283776.48 frames. ], batch size: 176, lr: 8.33e-03, grad_scale: 32.0 2023-06-20 15:38:26,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=585804.0, ans=0.1 2023-06-20 15:38:41,083 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=585864.0, ans=0.2 2023-06-20 15:39:26,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=585984.0, ans=0.125 2023-06-20 15:39:37,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=586044.0, ans=0.125 2023-06-20 15:40:11,444 INFO [train.py:996] (2/4) Epoch 4, batch 6200, loss[loss=0.2236, simple_loss=0.2761, pruned_loss=0.08558, over 21218.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3008, pruned_loss=0.07805, over 4276717.20 frames. ], batch size: 608, lr: 8.33e-03, grad_scale: 32.0 2023-06-20 15:40:29,462 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=586104.0, ans=0.04949747468305833 2023-06-20 15:40:39,488 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.810e+02 2.393e+02 2.716e+02 3.174e+02 4.783e+02, threshold=5.432e+02, percent-clipped=0.0 2023-06-20 15:40:43,277 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=586164.0, ans=0.2 2023-06-20 15:41:08,199 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.83 vs. limit=10.0 2023-06-20 15:41:11,139 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=22.5 2023-06-20 15:41:15,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=586284.0, ans=0.125 2023-06-20 15:42:09,652 INFO [train.py:996] (2/4) Epoch 4, batch 6250, loss[loss=0.232, simple_loss=0.3314, pruned_loss=0.06626, over 21617.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3063, pruned_loss=0.07859, over 4277791.85 frames. ], batch size: 230, lr: 8.32e-03, grad_scale: 32.0 2023-06-20 15:42:19,179 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.57 vs. limit=15.0 2023-06-20 15:42:47,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=586464.0, ans=0.125 2023-06-20 15:42:54,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=586464.0, ans=0.125 2023-06-20 15:43:02,683 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.51 vs. limit=15.0 2023-06-20 15:44:11,045 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.32 vs. limit=22.5 2023-06-20 15:44:16,073 INFO [train.py:996] (2/4) Epoch 4, batch 6300, loss[loss=0.2241, simple_loss=0.296, pruned_loss=0.07612, over 21674.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3099, pruned_loss=0.07725, over 4282718.75 frames. ], batch size: 263, lr: 8.32e-03, grad_scale: 32.0 2023-06-20 15:44:22,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=586704.0, ans=0.0 2023-06-20 15:44:22,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=586704.0, ans=0.0 2023-06-20 15:44:48,355 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.762e+02 2.254e+02 2.717e+02 3.393e+02 6.074e+02, threshold=5.434e+02, percent-clipped=2.0 2023-06-20 15:45:52,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=587004.0, ans=0.125 2023-06-20 15:45:53,402 INFO [train.py:996] (2/4) Epoch 4, batch 6350, loss[loss=0.2011, simple_loss=0.3169, pruned_loss=0.04268, over 20834.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3145, pruned_loss=0.08108, over 4282933.47 frames. ], batch size: 607, lr: 8.32e-03, grad_scale: 32.0 2023-06-20 15:46:17,679 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.66 vs. limit=22.5 2023-06-20 15:46:30,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=587064.0, ans=0.1 2023-06-20 15:46:58,896 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=587124.0, ans=0.125 2023-06-20 15:47:21,823 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.07 vs. limit=15.0 2023-06-20 15:48:07,610 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=587304.0, ans=0.125 2023-06-20 15:48:08,583 INFO [train.py:996] (2/4) Epoch 4, batch 6400, loss[loss=0.2938, simple_loss=0.3563, pruned_loss=0.1157, over 21348.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3229, pruned_loss=0.08605, over 4274076.50 frames. ], batch size: 176, lr: 8.32e-03, grad_scale: 32.0 2023-06-20 15:48:13,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=587304.0, ans=0.1 2023-06-20 15:48:36,213 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.218e+02 2.822e+02 3.385e+02 3.962e+02 5.879e+02, threshold=6.771e+02, percent-clipped=4.0 2023-06-20 15:48:46,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=587364.0, ans=10.0 2023-06-20 15:49:42,105 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=587484.0, ans=0.2 2023-06-20 15:49:45,033 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=587484.0, ans=0.0 2023-06-20 15:49:51,499 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=15.0 2023-06-20 15:49:59,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=587544.0, ans=0.125 2023-06-20 15:50:01,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=587544.0, ans=0.05 2023-06-20 15:50:07,493 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=587544.0, ans=0.125 2023-06-20 15:50:16,759 INFO [train.py:996] (2/4) Epoch 4, batch 6450, loss[loss=0.2199, simple_loss=0.2954, pruned_loss=0.07219, over 21684.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3263, pruned_loss=0.08658, over 4269831.18 frames. ], batch size: 332, lr: 8.32e-03, grad_scale: 32.0 2023-06-20 15:50:43,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=587664.0, ans=0.1 2023-06-20 15:50:58,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=587724.0, ans=0.1 2023-06-20 15:51:39,349 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=587844.0, ans=0.2 2023-06-20 15:51:40,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=587844.0, ans=0.0 2023-06-20 15:51:52,946 INFO [train.py:996] (2/4) Epoch 4, batch 6500, loss[loss=0.2712, simple_loss=0.3473, pruned_loss=0.09758, over 21402.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.318, pruned_loss=0.08552, over 4261219.34 frames. ], batch size: 471, lr: 8.31e-03, grad_scale: 32.0 2023-06-20 15:52:18,379 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=587964.0, ans=0.0 2023-06-20 15:52:19,373 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.872e+02 2.385e+02 2.722e+02 3.333e+02 5.165e+02, threshold=5.444e+02, percent-clipped=0.0 2023-06-20 15:52:55,122 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.09 vs. limit=6.0 2023-06-20 15:52:56,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=588024.0, ans=0.0 2023-06-20 15:53:11,846 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.70 vs. limit=22.5 2023-06-20 15:53:27,982 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=588144.0, ans=0.125 2023-06-20 15:53:44,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=588144.0, ans=0.1 2023-06-20 15:53:49,904 INFO [train.py:996] (2/4) Epoch 4, batch 6550, loss[loss=0.2167, simple_loss=0.2913, pruned_loss=0.07104, over 21601.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3147, pruned_loss=0.08359, over 4266824.10 frames. ], batch size: 230, lr: 8.31e-03, grad_scale: 32.0 2023-06-20 15:54:09,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=588264.0, ans=0.125 2023-06-20 15:54:28,751 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=588324.0, ans=0.05 2023-06-20 15:54:48,419 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-06-20 15:54:58,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=588384.0, ans=0.125 2023-06-20 15:55:24,203 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=588444.0, ans=0.0 2023-06-20 15:55:33,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=588444.0, ans=0.125 2023-06-20 15:55:39,397 INFO [train.py:996] (2/4) Epoch 4, batch 6600, loss[loss=0.2187, simple_loss=0.2839, pruned_loss=0.07677, over 21633.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3095, pruned_loss=0.08317, over 4266088.95 frames. ], batch size: 298, lr: 8.31e-03, grad_scale: 32.0 2023-06-20 15:55:51,336 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=588504.0, ans=0.0 2023-06-20 15:56:09,322 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.904e+02 2.413e+02 2.653e+02 3.142e+02 5.278e+02, threshold=5.306e+02, percent-clipped=0.0 2023-06-20 15:56:29,776 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=588624.0, ans=0.125 2023-06-20 15:57:15,142 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.25 vs. limit=12.0 2023-06-20 15:57:20,004 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=588744.0, ans=0.0 2023-06-20 15:57:31,180 INFO [train.py:996] (2/4) Epoch 4, batch 6650, loss[loss=0.1993, simple_loss=0.2684, pruned_loss=0.06506, over 21780.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3035, pruned_loss=0.07968, over 4265149.03 frames. ], batch size: 317, lr: 8.31e-03, grad_scale: 32.0 2023-06-20 15:57:51,896 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=588864.0, ans=0.2 2023-06-20 15:59:08,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=589044.0, ans=0.125 2023-06-20 15:59:13,815 INFO [train.py:996] (2/4) Epoch 4, batch 6700, loss[loss=0.1951, simple_loss=0.2544, pruned_loss=0.06785, over 16745.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.2993, pruned_loss=0.07919, over 4247131.63 frames. ], batch size: 63, lr: 8.31e-03, grad_scale: 32.0 2023-06-20 15:59:46,541 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.73 vs. limit=15.0 2023-06-20 15:59:54,876 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.856e+02 2.379e+02 2.739e+02 3.210e+02 5.153e+02, threshold=5.478e+02, percent-clipped=0.0 2023-06-20 16:00:21,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=589284.0, ans=0.1 2023-06-20 16:00:27,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=589284.0, ans=10.0 2023-06-20 16:00:38,823 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=589284.0, ans=0.125 2023-06-20 16:00:42,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.whiten.whitening_limit, batch_count=589344.0, ans=12.0 2023-06-20 16:00:48,436 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.60 vs. limit=15.0 2023-06-20 16:00:59,790 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.78 vs. limit=22.5 2023-06-20 16:01:03,413 INFO [train.py:996] (2/4) Epoch 4, batch 6750, loss[loss=0.2548, simple_loss=0.3006, pruned_loss=0.1045, over 21385.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.2978, pruned_loss=0.0804, over 4259781.92 frames. ], batch size: 473, lr: 8.30e-03, grad_scale: 32.0 2023-06-20 16:01:04,670 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.52 vs. limit=15.0 2023-06-20 16:01:13,773 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.19 vs. limit=5.0 2023-06-20 16:01:58,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=589584.0, ans=0.0 2023-06-20 16:02:07,266 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=589584.0, ans=0.0 2023-06-20 16:02:36,471 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.54 vs. limit=15.0 2023-06-20 16:02:39,945 INFO [train.py:996] (2/4) Epoch 4, batch 6800, loss[loss=0.2471, simple_loss=0.3089, pruned_loss=0.09265, over 21353.00 frames. ], tot_loss[loss=0.232, simple_loss=0.2996, pruned_loss=0.08224, over 4262216.68 frames. ], batch size: 143, lr: 8.30e-03, grad_scale: 32.0 2023-06-20 16:02:51,215 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.21 vs. limit=15.0 2023-06-20 16:03:02,165 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 2.637e+02 2.976e+02 3.448e+02 7.542e+02, threshold=5.952e+02, percent-clipped=4.0 2023-06-20 16:04:16,751 INFO [train.py:996] (2/4) Epoch 4, batch 6850, loss[loss=0.23, simple_loss=0.2801, pruned_loss=0.08994, over 21343.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.2977, pruned_loss=0.08356, over 4268090.60 frames. ], batch size: 177, lr: 8.30e-03, grad_scale: 32.0 2023-06-20 16:04:26,331 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=590004.0, ans=0.125 2023-06-20 16:04:41,372 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=590064.0, ans=0.0 2023-06-20 16:05:00,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=590124.0, ans=0.125 2023-06-20 16:05:14,788 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.77 vs. limit=15.0 2023-06-20 16:05:15,721 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=590184.0, ans=0.0 2023-06-20 16:05:55,773 INFO [train.py:996] (2/4) Epoch 4, batch 6900, loss[loss=0.2011, simple_loss=0.2928, pruned_loss=0.05471, over 21617.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.2997, pruned_loss=0.08375, over 4273077.69 frames. ], batch size: 263, lr: 8.30e-03, grad_scale: 32.0 2023-06-20 16:05:59,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=590304.0, ans=0.0 2023-06-20 16:06:01,365 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=590304.0, ans=0.125 2023-06-20 16:06:02,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=590304.0, ans=0.2 2023-06-20 16:06:31,282 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=590364.0, ans=0.2 2023-06-20 16:06:37,563 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 16:06:47,127 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.644e+02 2.466e+02 2.910e+02 3.426e+02 7.067e+02, threshold=5.820e+02, percent-clipped=2.0 2023-06-20 16:06:49,787 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=590364.0, ans=0.0 2023-06-20 16:07:50,461 INFO [train.py:996] (2/4) Epoch 4, batch 6950, loss[loss=0.2647, simple_loss=0.3342, pruned_loss=0.09765, over 21664.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3023, pruned_loss=0.08222, over 4270154.06 frames. ], batch size: 351, lr: 8.29e-03, grad_scale: 32.0 2023-06-20 16:08:22,523 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.41 vs. limit=10.0 2023-06-20 16:09:15,721 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.56 vs. limit=6.0 2023-06-20 16:09:24,523 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=590844.0, ans=0.125 2023-06-20 16:09:31,661 INFO [train.py:996] (2/4) Epoch 4, batch 7000, loss[loss=0.2536, simple_loss=0.3175, pruned_loss=0.09488, over 15778.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3058, pruned_loss=0.08499, over 4264437.22 frames. ], batch size: 61, lr: 8.29e-03, grad_scale: 32.0 2023-06-20 16:09:36,984 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=590904.0, ans=15.0 2023-06-20 16:10:04,333 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.798e+02 2.556e+02 2.969e+02 3.627e+02 5.556e+02, threshold=5.939e+02, percent-clipped=0.0 2023-06-20 16:10:15,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=591024.0, ans=0.125 2023-06-20 16:11:04,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=591144.0, ans=0.125 2023-06-20 16:11:13,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=591144.0, ans=0.125 2023-06-20 16:11:31,290 INFO [train.py:996] (2/4) Epoch 4, batch 7050, loss[loss=0.2049, simple_loss=0.2657, pruned_loss=0.07204, over 21793.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3023, pruned_loss=0.08302, over 4273041.80 frames. ], batch size: 102, lr: 8.29e-03, grad_scale: 32.0 2023-06-20 16:11:50,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=591204.0, ans=0.125 2023-06-20 16:12:05,349 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=591264.0, ans=0.125 2023-06-20 16:12:11,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=591324.0, ans=0.125 2023-06-20 16:12:14,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=591324.0, ans=0.2 2023-06-20 16:13:25,301 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.71 vs. limit=22.5 2023-06-20 16:13:35,494 INFO [train.py:996] (2/4) Epoch 4, batch 7100, loss[loss=0.2457, simple_loss=0.3245, pruned_loss=0.08349, over 21804.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3068, pruned_loss=0.08422, over 4268636.66 frames. ], batch size: 371, lr: 8.29e-03, grad_scale: 32.0 2023-06-20 16:13:57,793 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.714e+02 2.233e+02 2.662e+02 3.259e+02 5.350e+02, threshold=5.324e+02, percent-clipped=0.0 2023-06-20 16:14:10,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=591624.0, ans=0.125 2023-06-20 16:14:40,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=591684.0, ans=0.125 2023-06-20 16:15:16,171 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=591744.0, ans=0.2 2023-06-20 16:15:18,612 INFO [train.py:996] (2/4) Epoch 4, batch 7150, loss[loss=0.2555, simple_loss=0.362, pruned_loss=0.07444, over 19781.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3028, pruned_loss=0.0814, over 4261890.89 frames. ], batch size: 703, lr: 8.29e-03, grad_scale: 32.0 2023-06-20 16:15:40,709 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.04 vs. limit=15.0 2023-06-20 16:16:04,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=591924.0, ans=0.125 2023-06-20 16:16:13,883 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.92 vs. limit=12.0 2023-06-20 16:16:17,457 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=591984.0, ans=0.125 2023-06-20 16:16:37,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=592044.0, ans=0.125 2023-06-20 16:17:04,870 INFO [train.py:996] (2/4) Epoch 4, batch 7200, loss[loss=0.2455, simple_loss=0.3004, pruned_loss=0.09534, over 21568.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3086, pruned_loss=0.08551, over 4263789.14 frames. ], batch size: 415, lr: 8.28e-03, grad_scale: 32.0 2023-06-20 16:17:05,309 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=592104.0, ans=0.125 2023-06-20 16:17:27,114 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.360e+02 2.565e+02 2.895e+02 3.560e+02 7.126e+02, threshold=5.790e+02, percent-clipped=7.0 2023-06-20 16:18:04,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=592284.0, ans=0.1 2023-06-20 16:18:09,672 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2023-06-20 16:18:37,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=592344.0, ans=0.2 2023-06-20 16:18:41,923 INFO [train.py:996] (2/4) Epoch 4, batch 7250, loss[loss=0.1968, simple_loss=0.2503, pruned_loss=0.07166, over 21282.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.303, pruned_loss=0.08537, over 4270051.23 frames. ], batch size: 551, lr: 8.28e-03, grad_scale: 32.0 2023-06-20 16:19:08,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=592464.0, ans=0.125 2023-06-20 16:19:31,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=592524.0, ans=0.125 2023-06-20 16:19:32,430 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=592524.0, ans=0.0 2023-06-20 16:20:06,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=592644.0, ans=0.125 2023-06-20 16:20:19,821 INFO [train.py:996] (2/4) Epoch 4, batch 7300, loss[loss=0.2224, simple_loss=0.28, pruned_loss=0.08237, over 21888.00 frames. ], tot_loss[loss=0.232, simple_loss=0.2967, pruned_loss=0.0837, over 4272495.90 frames. ], batch size: 373, lr: 8.28e-03, grad_scale: 32.0 2023-06-20 16:20:25,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=592704.0, ans=0.0 2023-06-20 16:20:53,513 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.788e+02 2.476e+02 2.833e+02 3.500e+02 5.020e+02, threshold=5.666e+02, percent-clipped=0.0 2023-06-20 16:20:55,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=592764.0, ans=0.125 2023-06-20 16:21:03,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=592824.0, ans=0.125 2023-06-20 16:21:04,956 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=592824.0, ans=0.02 2023-06-20 16:21:50,967 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.56 vs. limit=15.0 2023-06-20 16:21:58,017 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.71 vs. limit=22.5 2023-06-20 16:22:03,835 INFO [train.py:996] (2/4) Epoch 4, batch 7350, loss[loss=0.2216, simple_loss=0.2932, pruned_loss=0.07505, over 21101.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.2945, pruned_loss=0.084, over 4271819.77 frames. ], batch size: 607, lr: 8.28e-03, grad_scale: 32.0 2023-06-20 16:22:54,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=593124.0, ans=0.0 2023-06-20 16:23:20,518 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.37 vs. limit=6.0 2023-06-20 16:23:23,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=593184.0, ans=0.125 2023-06-20 16:23:25,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=593184.0, ans=0.125 2023-06-20 16:23:53,925 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=593244.0, ans=0.0 2023-06-20 16:23:56,341 INFO [train.py:996] (2/4) Epoch 4, batch 7400, loss[loss=0.2364, simple_loss=0.3174, pruned_loss=0.07774, over 21919.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.2999, pruned_loss=0.08576, over 4264472.27 frames. ], batch size: 317, lr: 8.28e-03, grad_scale: 32.0 2023-06-20 16:23:58,233 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=593304.0, ans=0.0 2023-06-20 16:24:24,070 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.106e+02 2.858e+02 3.325e+02 3.971e+02 5.288e+02, threshold=6.650e+02, percent-clipped=0.0 2023-06-20 16:24:28,050 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.28 vs. limit=6.0 2023-06-20 16:24:33,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=593424.0, ans=0.1 2023-06-20 16:24:59,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=593484.0, ans=0.125 2023-06-20 16:25:16,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=593544.0, ans=0.0 2023-06-20 16:25:33,487 INFO [train.py:996] (2/4) Epoch 4, batch 7450, loss[loss=0.2234, simple_loss=0.2875, pruned_loss=0.07961, over 21420.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.2983, pruned_loss=0.0848, over 4258272.50 frames. ], batch size: 131, lr: 8.27e-03, grad_scale: 32.0 2023-06-20 16:25:58,466 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=593664.0, ans=0.125 2023-06-20 16:26:03,355 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=12.0 2023-06-20 16:27:10,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=593784.0, ans=0.125 2023-06-20 16:27:34,242 INFO [train.py:996] (2/4) Epoch 4, batch 7500, loss[loss=0.2589, simple_loss=0.343, pruned_loss=0.08745, over 21242.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.304, pruned_loss=0.08681, over 4263198.96 frames. ], batch size: 176, lr: 8.27e-03, grad_scale: 32.0 2023-06-20 16:28:09,126 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.017e+02 2.540e+02 2.828e+02 3.392e+02 7.342e+02, threshold=5.655e+02, percent-clipped=3.0 2023-06-20 16:28:31,802 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.54 vs. limit=15.0 2023-06-20 16:29:13,484 INFO [train.py:996] (2/4) Epoch 4, batch 7550, loss[loss=0.2334, simple_loss=0.3283, pruned_loss=0.06921, over 21648.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3114, pruned_loss=0.08516, over 4272111.84 frames. ], batch size: 414, lr: 8.27e-03, grad_scale: 32.0 2023-06-20 16:29:14,697 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.81 vs. limit=15.0 2023-06-20 16:29:18,537 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=594204.0, ans=0.125 2023-06-20 16:29:50,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=594264.0, ans=0.125 2023-06-20 16:30:05,275 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.37 vs. limit=15.0 2023-06-20 16:30:07,031 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.66 vs. limit=22.5 2023-06-20 16:30:46,077 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=594444.0, ans=0.0 2023-06-20 16:30:49,605 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.05 vs. limit=6.0 2023-06-20 16:30:50,004 INFO [train.py:996] (2/4) Epoch 4, batch 7600, loss[loss=0.2772, simple_loss=0.3361, pruned_loss=0.1091, over 21833.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3115, pruned_loss=0.08394, over 4279531.04 frames. ], batch size: 441, lr: 8.27e-03, grad_scale: 32.0 2023-06-20 16:31:17,500 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.735e+02 2.434e+02 2.853e+02 3.264e+02 4.790e+02, threshold=5.707e+02, percent-clipped=0.0 2023-06-20 16:31:32,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=594624.0, ans=0.125 2023-06-20 16:31:38,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=594624.0, ans=0.125 2023-06-20 16:31:41,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=594624.0, ans=0.0 2023-06-20 16:31:44,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=594684.0, ans=0.05 2023-06-20 16:31:48,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=594684.0, ans=0.5 2023-06-20 16:32:25,974 INFO [train.py:996] (2/4) Epoch 4, batch 7650, loss[loss=0.2816, simple_loss=0.3268, pruned_loss=0.1182, over 21621.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3112, pruned_loss=0.0865, over 4283362.75 frames. ], batch size: 471, lr: 8.27e-03, grad_scale: 32.0 2023-06-20 16:32:26,499 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=594804.0, ans=0.125 2023-06-20 16:32:47,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=594864.0, ans=0.125 2023-06-20 16:33:09,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=594924.0, ans=0.0 2023-06-20 16:33:20,130 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=594924.0, ans=0.125 2023-06-20 16:33:23,770 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.02 vs. limit=15.0 2023-06-20 16:33:31,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=594984.0, ans=0.1 2023-06-20 16:33:40,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=594984.0, ans=0.125 2023-06-20 16:34:04,134 INFO [train.py:996] (2/4) Epoch 4, batch 7700, loss[loss=0.2839, simple_loss=0.3522, pruned_loss=0.1078, over 21355.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3141, pruned_loss=0.08897, over 4288217.18 frames. ], batch size: 176, lr: 8.26e-03, grad_scale: 32.0 2023-06-20 16:34:14,282 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.49 vs. limit=6.0 2023-06-20 16:34:40,151 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.965e+02 2.502e+02 2.863e+02 3.403e+02 5.125e+02, threshold=5.726e+02, percent-clipped=0.0 2023-06-20 16:34:43,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=595164.0, ans=0.0 2023-06-20 16:34:48,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=595224.0, ans=0.125 2023-06-20 16:35:28,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=595284.0, ans=0.0 2023-06-20 16:35:46,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=595344.0, ans=0.2 2023-06-20 16:36:06,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=595344.0, ans=0.125 2023-06-20 16:36:09,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=595404.0, ans=0.2 2023-06-20 16:36:10,148 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=12.0 2023-06-20 16:36:10,693 INFO [train.py:996] (2/4) Epoch 4, batch 7750, loss[loss=0.2159, simple_loss=0.3023, pruned_loss=0.06474, over 21222.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3219, pruned_loss=0.08966, over 4281323.93 frames. ], batch size: 176, lr: 8.26e-03, grad_scale: 32.0 2023-06-20 16:36:26,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=595404.0, ans=0.0 2023-06-20 16:37:35,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=595584.0, ans=0.125 2023-06-20 16:37:37,529 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=22.41 vs. limit=15.0 2023-06-20 16:37:39,134 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.86 vs. limit=15.0 2023-06-20 16:38:19,749 INFO [train.py:996] (2/4) Epoch 4, batch 7800, loss[loss=0.2066, simple_loss=0.2542, pruned_loss=0.07947, over 20888.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3224, pruned_loss=0.09006, over 4267231.53 frames. ], batch size: 612, lr: 8.26e-03, grad_scale: 32.0 2023-06-20 16:38:49,534 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.871e+02 2.887e+02 3.418e+02 4.293e+02 7.867e+02, threshold=6.836e+02, percent-clipped=4.0 2023-06-20 16:38:58,180 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.57 vs. limit=10.0 2023-06-20 16:39:58,219 INFO [train.py:996] (2/4) Epoch 4, batch 7850, loss[loss=0.2161, simple_loss=0.2816, pruned_loss=0.07534, over 21682.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3174, pruned_loss=0.08902, over 4269567.27 frames. ], batch size: 333, lr: 8.26e-03, grad_scale: 16.0 2023-06-20 16:40:04,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=596004.0, ans=0.1 2023-06-20 16:40:09,484 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.26 vs. limit=22.5 2023-06-20 16:40:19,469 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=596064.0, ans=0.05 2023-06-20 16:40:32,897 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 16:40:38,819 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=596124.0, ans=0.2 2023-06-20 16:41:18,921 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=596244.0, ans=0.125 2023-06-20 16:41:48,977 INFO [train.py:996] (2/4) Epoch 4, batch 7900, loss[loss=0.2919, simple_loss=0.3475, pruned_loss=0.1181, over 21425.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3133, pruned_loss=0.08818, over 4269879.34 frames. ], batch size: 471, lr: 8.26e-03, grad_scale: 16.0 2023-06-20 16:42:29,162 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.18 vs. limit=15.0 2023-06-20 16:42:35,591 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.876e+02 2.748e+02 3.368e+02 4.194e+02 8.125e+02, threshold=6.737e+02, percent-clipped=4.0 2023-06-20 16:43:15,469 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=596484.0, ans=0.0 2023-06-20 16:43:19,482 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.31 vs. limit=12.0 2023-06-20 16:43:56,411 INFO [train.py:996] (2/4) Epoch 4, batch 7950, loss[loss=0.2391, simple_loss=0.3219, pruned_loss=0.07812, over 21777.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3154, pruned_loss=0.08694, over 4269199.75 frames. ], batch size: 298, lr: 8.25e-03, grad_scale: 16.0 2023-06-20 16:44:23,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=596664.0, ans=0.0 2023-06-20 16:44:25,788 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.70 vs. limit=15.0 2023-06-20 16:44:27,272 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.34 vs. limit=22.5 2023-06-20 16:44:38,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=596724.0, ans=0.0 2023-06-20 16:44:53,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=596724.0, ans=0.1 2023-06-20 16:45:37,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=596844.0, ans=0.125 2023-06-20 16:45:56,308 INFO [train.py:996] (2/4) Epoch 4, batch 8000, loss[loss=0.3162, simple_loss=0.3636, pruned_loss=0.1344, over 21427.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.319, pruned_loss=0.08994, over 4273501.53 frames. ], batch size: 471, lr: 8.25e-03, grad_scale: 32.0 2023-06-20 16:46:04,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=596904.0, ans=0.0 2023-06-20 16:46:20,105 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.72 vs. limit=15.0 2023-06-20 16:46:27,802 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.946e+02 2.568e+02 2.923e+02 3.494e+02 7.833e+02, threshold=5.846e+02, percent-clipped=1.0 2023-06-20 16:46:59,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=597024.0, ans=0.125 2023-06-20 16:48:21,257 INFO [train.py:996] (2/4) Epoch 4, batch 8050, loss[loss=0.2006, simple_loss=0.2495, pruned_loss=0.0758, over 21809.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3218, pruned_loss=0.08989, over 4274676.69 frames. ], batch size: 118, lr: 8.25e-03, grad_scale: 32.0 2023-06-20 16:48:25,308 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-20 16:48:39,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=597264.0, ans=0.0 2023-06-20 16:50:00,268 INFO [train.py:996] (2/4) Epoch 4, batch 8100, loss[loss=0.2302, simple_loss=0.303, pruned_loss=0.07872, over 21293.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.3204, pruned_loss=0.09027, over 4275746.54 frames. ], batch size: 143, lr: 8.25e-03, grad_scale: 32.0 2023-06-20 16:50:00,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=597504.0, ans=0.0 2023-06-20 16:50:04,131 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 16:50:08,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=597504.0, ans=0.2 2023-06-20 16:50:34,942 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.23 vs. limit=15.0 2023-06-20 16:50:38,477 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.350e+02 2.976e+02 3.712e+02 5.223e+02 1.010e+03, threshold=7.424e+02, percent-clipped=11.0 2023-06-20 16:51:00,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=597624.0, ans=0.125 2023-06-20 16:51:06,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=597684.0, ans=0.125 2023-06-20 16:51:41,196 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=597744.0, ans=0.0 2023-06-20 16:52:09,559 INFO [train.py:996] (2/4) Epoch 4, batch 8150, loss[loss=0.2294, simple_loss=0.3138, pruned_loss=0.07254, over 21646.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.3295, pruned_loss=0.09114, over 4278625.11 frames. ], batch size: 247, lr: 8.24e-03, grad_scale: 32.0 2023-06-20 16:52:22,544 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-06-20 16:52:31,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=597864.0, ans=0.125 2023-06-20 16:52:32,583 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=597864.0, ans=0.125 2023-06-20 16:52:56,104 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.95 vs. limit=15.0 2023-06-20 16:53:36,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=598044.0, ans=0.125 2023-06-20 16:53:43,719 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=598044.0, ans=0.0 2023-06-20 16:53:50,585 INFO [train.py:996] (2/4) Epoch 4, batch 8200, loss[loss=0.2505, simple_loss=0.3143, pruned_loss=0.0933, over 21780.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3207, pruned_loss=0.08816, over 4280224.69 frames. ], batch size: 98, lr: 8.24e-03, grad_scale: 32.0 2023-06-20 16:54:23,006 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=598164.0, ans=0.125 2023-06-20 16:54:23,633 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.93 vs. limit=15.0 2023-06-20 16:54:36,859 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.931e+02 2.510e+02 2.906e+02 3.709e+02 6.115e+02, threshold=5.811e+02, percent-clipped=0.0 2023-06-20 16:54:38,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=598164.0, ans=0.125 2023-06-20 16:55:30,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=598344.0, ans=0.1 2023-06-20 16:55:54,666 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.02 vs. limit=12.0 2023-06-20 16:55:54,976 INFO [train.py:996] (2/4) Epoch 4, batch 8250, loss[loss=0.2558, simple_loss=0.3687, pruned_loss=0.07147, over 20812.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3187, pruned_loss=0.0876, over 4279236.87 frames. ], batch size: 607, lr: 8.24e-03, grad_scale: 32.0 2023-06-20 16:56:37,578 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.87 vs. limit=22.5 2023-06-20 16:56:45,108 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.95 vs. limit=15.0 2023-06-20 16:57:18,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=598644.0, ans=0.125 2023-06-20 16:57:23,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=598644.0, ans=0.125 2023-06-20 16:57:38,402 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.15 vs. limit=6.0 2023-06-20 16:57:38,826 INFO [train.py:996] (2/4) Epoch 4, batch 8300, loss[loss=0.2204, simple_loss=0.2896, pruned_loss=0.07557, over 21545.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3152, pruned_loss=0.08496, over 4276697.76 frames. ], batch size: 195, lr: 8.24e-03, grad_scale: 32.0 2023-06-20 16:57:46,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=598704.0, ans=0.125 2023-06-20 16:57:48,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=598704.0, ans=0.0 2023-06-20 16:58:06,072 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=598764.0, ans=0.035 2023-06-20 16:58:09,928 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.765e+02 2.374e+02 2.802e+02 3.318e+02 5.012e+02, threshold=5.604e+02, percent-clipped=0.0 2023-06-20 16:58:39,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=598884.0, ans=0.125 2023-06-20 16:59:10,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=598944.0, ans=0.125 2023-06-20 16:59:15,928 INFO [train.py:996] (2/4) Epoch 4, batch 8350, loss[loss=0.2238, simple_loss=0.3021, pruned_loss=0.07276, over 21536.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3128, pruned_loss=0.08311, over 4278377.01 frames. ], batch size: 230, lr: 8.24e-03, grad_scale: 32.0 2023-06-20 17:00:42,456 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.30 vs. limit=22.5 2023-06-20 17:00:54,171 INFO [train.py:996] (2/4) Epoch 4, batch 8400, loss[loss=0.2573, simple_loss=0.3347, pruned_loss=0.08993, over 20745.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3098, pruned_loss=0.08051, over 4266852.59 frames. ], batch size: 607, lr: 8.23e-03, grad_scale: 32.0 2023-06-20 17:01:19,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=599364.0, ans=0.0 2023-06-20 17:01:24,806 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.636e+02 2.360e+02 2.679e+02 3.034e+02 4.807e+02, threshold=5.358e+02, percent-clipped=0.0 2023-06-20 17:01:29,800 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=599424.0, ans=0.2 2023-06-20 17:02:38,952 INFO [train.py:996] (2/4) Epoch 4, batch 8450, loss[loss=0.2406, simple_loss=0.3051, pruned_loss=0.08802, over 21839.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3097, pruned_loss=0.08042, over 4274772.26 frames. ], batch size: 298, lr: 8.23e-03, grad_scale: 32.0 2023-06-20 17:03:34,446 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.55 vs. limit=15.0 2023-06-20 17:04:03,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=599784.0, ans=0.2 2023-06-20 17:04:34,202 INFO [train.py:996] (2/4) Epoch 4, batch 8500, loss[loss=0.2558, simple_loss=0.3181, pruned_loss=0.09677, over 21702.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3076, pruned_loss=0.08209, over 4274751.88 frames. ], batch size: 351, lr: 8.23e-03, grad_scale: 32.0 2023-06-20 17:04:56,699 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.07 vs. limit=10.0 2023-06-20 17:04:57,746 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=599964.0, ans=0.125 2023-06-20 17:05:01,267 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.33 vs. limit=12.0 2023-06-20 17:05:09,683 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.188e+02 2.562e+02 2.810e+02 3.301e+02 4.951e+02, threshold=5.621e+02, percent-clipped=0.0 2023-06-20 17:05:37,721 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=600084.0, ans=0.0 2023-06-20 17:06:17,599 INFO [train.py:996] (2/4) Epoch 4, batch 8550, loss[loss=0.2436, simple_loss=0.3169, pruned_loss=0.08515, over 21261.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3113, pruned_loss=0.08434, over 4267149.77 frames. ], batch size: 159, lr: 8.23e-03, grad_scale: 32.0 2023-06-20 17:06:51,878 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.64 vs. limit=12.0 2023-06-20 17:07:43,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=600444.0, ans=0.035 2023-06-20 17:07:59,493 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=600444.0, ans=0.125 2023-06-20 17:08:03,051 INFO [train.py:996] (2/4) Epoch 4, batch 8600, loss[loss=0.2619, simple_loss=0.3893, pruned_loss=0.06725, over 20779.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3194, pruned_loss=0.08706, over 4276208.45 frames. ], batch size: 607, lr: 8.23e-03, grad_scale: 32.0 2023-06-20 17:08:03,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=600504.0, ans=0.125 2023-06-20 17:08:49,987 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.927e+02 2.757e+02 3.185e+02 4.135e+02 6.803e+02, threshold=6.371e+02, percent-clipped=9.0 2023-06-20 17:09:07,843 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=600624.0, ans=0.2 2023-06-20 17:09:12,432 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 17:09:13,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=600624.0, ans=0.125 2023-06-20 17:10:01,611 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.91 vs. limit=15.0 2023-06-20 17:10:02,025 INFO [train.py:996] (2/4) Epoch 4, batch 8650, loss[loss=0.2045, simple_loss=0.277, pruned_loss=0.06599, over 21267.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3263, pruned_loss=0.08784, over 4275570.38 frames. ], batch size: 131, lr: 8.22e-03, grad_scale: 32.0 2023-06-20 17:10:30,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=600864.0, ans=0.04949747468305833 2023-06-20 17:10:33,938 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.21 vs. limit=10.0 2023-06-20 17:11:03,557 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=600984.0, ans=0.09899494936611666 2023-06-20 17:11:38,714 INFO [train.py:996] (2/4) Epoch 4, batch 8700, loss[loss=0.2143, simple_loss=0.2762, pruned_loss=0.07618, over 15801.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3195, pruned_loss=0.08496, over 4262120.17 frames. ], batch size: 64, lr: 8.22e-03, grad_scale: 32.0 2023-06-20 17:11:45,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=601104.0, ans=0.1 2023-06-20 17:12:09,311 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.617e+02 2.376e+02 2.840e+02 3.647e+02 6.545e+02, threshold=5.680e+02, percent-clipped=1.0 2023-06-20 17:12:31,718 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.12 vs. limit=15.0 2023-06-20 17:12:45,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=601224.0, ans=0.1 2023-06-20 17:12:46,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=601284.0, ans=0.125 2023-06-20 17:13:24,750 INFO [train.py:996] (2/4) Epoch 4, batch 8750, loss[loss=0.2599, simple_loss=0.3242, pruned_loss=0.09774, over 21898.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3127, pruned_loss=0.08491, over 4270415.36 frames. ], batch size: 351, lr: 8.22e-03, grad_scale: 32.0 2023-06-20 17:14:34,917 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=601584.0, ans=0.125 2023-06-20 17:14:59,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=601644.0, ans=0.125 2023-06-20 17:15:02,029 INFO [train.py:996] (2/4) Epoch 4, batch 8800, loss[loss=0.2896, simple_loss=0.3562, pruned_loss=0.1115, over 21246.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3228, pruned_loss=0.08869, over 4272890.97 frames. ], batch size: 176, lr: 8.22e-03, grad_scale: 32.0 2023-06-20 17:15:18,617 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.80 vs. limit=15.0 2023-06-20 17:15:42,506 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.045e+02 2.769e+02 3.106e+02 3.590e+02 5.947e+02, threshold=6.211e+02, percent-clipped=1.0 2023-06-20 17:16:47,152 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=12.0 2023-06-20 17:16:59,659 INFO [train.py:996] (2/4) Epoch 4, batch 8850, loss[loss=0.2673, simple_loss=0.3669, pruned_loss=0.08392, over 19841.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3293, pruned_loss=0.0907, over 4268363.20 frames. ], batch size: 702, lr: 8.22e-03, grad_scale: 32.0 2023-06-20 17:18:07,155 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=602124.0, ans=0.125 2023-06-20 17:18:30,357 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=602244.0, ans=0.0 2023-06-20 17:18:40,459 INFO [train.py:996] (2/4) Epoch 4, batch 8900, loss[loss=0.3215, simple_loss=0.4388, pruned_loss=0.1021, over 19749.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3266, pruned_loss=0.08982, over 4263781.81 frames. ], batch size: 702, lr: 8.21e-03, grad_scale: 32.0 2023-06-20 17:18:42,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=602304.0, ans=0.5 2023-06-20 17:18:59,100 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=602304.0, ans=0.125 2023-06-20 17:19:30,531 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 2.687e+02 3.012e+02 3.727e+02 9.034e+02, threshold=6.025e+02, percent-clipped=1.0 2023-06-20 17:19:54,993 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=602424.0, ans=0.0 2023-06-20 17:20:40,449 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=602544.0, ans=0.1 2023-06-20 17:20:55,166 INFO [train.py:996] (2/4) Epoch 4, batch 8950, loss[loss=0.2144, simple_loss=0.2809, pruned_loss=0.07396, over 21418.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.3271, pruned_loss=0.08881, over 4266449.17 frames. ], batch size: 194, lr: 8.21e-03, grad_scale: 32.0 2023-06-20 17:21:30,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=602664.0, ans=0.125 2023-06-20 17:22:07,022 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 17:22:08,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=602784.0, ans=0.125 2023-06-20 17:22:35,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=602844.0, ans=0.1 2023-06-20 17:22:38,301 INFO [train.py:996] (2/4) Epoch 4, batch 9000, loss[loss=0.2375, simple_loss=0.3062, pruned_loss=0.08438, over 21625.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3206, pruned_loss=0.0881, over 4272638.21 frames. ], batch size: 332, lr: 8.21e-03, grad_scale: 32.0 2023-06-20 17:22:38,301 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-20 17:23:29,203 INFO [train.py:1028] (2/4) Epoch 4, validation: loss=0.2733, simple_loss=0.3656, pruned_loss=0.09047, over 1796401.00 frames. 2023-06-20 17:23:29,205 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-20 17:23:55,093 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.196e+02 2.920e+02 3.392e+02 4.044e+02 7.869e+02, threshold=6.783e+02, percent-clipped=2.0 2023-06-20 17:24:02,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=603024.0, ans=0.125 2023-06-20 17:24:08,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=603024.0, ans=0.125 2023-06-20 17:24:31,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=603084.0, ans=0.95 2023-06-20 17:24:46,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=603144.0, ans=0.1 2023-06-20 17:25:06,942 INFO [train.py:996] (2/4) Epoch 4, batch 9050, loss[loss=0.2563, simple_loss=0.3231, pruned_loss=0.09471, over 21476.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3164, pruned_loss=0.08439, over 4275529.22 frames. ], batch size: 194, lr: 8.21e-03, grad_scale: 16.0 2023-06-20 17:25:07,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=603204.0, ans=0.0 2023-06-20 17:25:19,798 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=603204.0, ans=0.125 2023-06-20 17:25:27,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=603264.0, ans=0.125 2023-06-20 17:27:03,498 INFO [train.py:996] (2/4) Epoch 4, batch 9100, loss[loss=0.2507, simple_loss=0.3415, pruned_loss=0.07996, over 21566.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.321, pruned_loss=0.08617, over 4277161.42 frames. ], batch size: 389, lr: 8.21e-03, grad_scale: 16.0 2023-06-20 17:27:09,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=603504.0, ans=0.2 2023-06-20 17:27:22,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=603504.0, ans=0.125 2023-06-20 17:27:53,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=603564.0, ans=0.125 2023-06-20 17:27:57,742 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.926e+02 2.581e+02 3.087e+02 3.555e+02 5.100e+02, threshold=6.174e+02, percent-clipped=0.0 2023-06-20 17:28:14,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=603624.0, ans=0.125 2023-06-20 17:28:42,679 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 17:29:08,292 INFO [train.py:996] (2/4) Epoch 4, batch 9150, loss[loss=0.2403, simple_loss=0.3247, pruned_loss=0.07797, over 21576.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3228, pruned_loss=0.08401, over 4281954.65 frames. ], batch size: 230, lr: 8.20e-03, grad_scale: 16.0 2023-06-20 17:30:11,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=603924.0, ans=0.125 2023-06-20 17:30:20,983 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=603924.0, ans=0.2 2023-06-20 17:30:27,959 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.85 vs. limit=15.0 2023-06-20 17:30:39,883 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.01 vs. limit=22.5 2023-06-20 17:30:55,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=604044.0, ans=0.125 2023-06-20 17:30:57,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=604044.0, ans=0.1 2023-06-20 17:31:03,050 INFO [train.py:996] (2/4) Epoch 4, batch 9200, loss[loss=0.2711, simple_loss=0.3455, pruned_loss=0.0984, over 21747.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3261, pruned_loss=0.08389, over 4282009.54 frames. ], batch size: 332, lr: 8.20e-03, grad_scale: 32.0 2023-06-20 17:31:15,247 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=604104.0, ans=0.0 2023-06-20 17:31:47,387 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.792e+02 2.497e+02 2.858e+02 3.581e+02 5.930e+02, threshold=5.716e+02, percent-clipped=0.0 2023-06-20 17:31:57,007 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.76 vs. limit=10.0 2023-06-20 17:32:15,760 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=604284.0, ans=0.0 2023-06-20 17:32:18,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=604284.0, ans=0.0 2023-06-20 17:32:25,983 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=604284.0, ans=0.0 2023-06-20 17:32:37,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=604344.0, ans=0.125 2023-06-20 17:32:45,107 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.72 vs. limit=22.5 2023-06-20 17:32:54,608 INFO [train.py:996] (2/4) Epoch 4, batch 9250, loss[loss=0.228, simple_loss=0.2911, pruned_loss=0.08247, over 21662.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3295, pruned_loss=0.087, over 4269268.89 frames. ], batch size: 282, lr: 8.20e-03, grad_scale: 32.0 2023-06-20 17:33:04,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=604404.0, ans=0.125 2023-06-20 17:33:44,032 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.93 vs. limit=15.0 2023-06-20 17:33:47,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=604524.0, ans=0.125 2023-06-20 17:34:38,708 INFO [train.py:996] (2/4) Epoch 4, batch 9300, loss[loss=0.242, simple_loss=0.3146, pruned_loss=0.08468, over 21844.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3248, pruned_loss=0.08708, over 4270989.42 frames. ], batch size: 317, lr: 8.20e-03, grad_scale: 32.0 2023-06-20 17:34:45,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=604704.0, ans=0.125 2023-06-20 17:34:47,795 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.01 vs. limit=15.0 2023-06-20 17:35:22,926 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.226e+02 3.268e+02 3.884e+02 4.664e+02 7.347e+02, threshold=7.768e+02, percent-clipped=7.0 2023-06-20 17:35:56,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=604884.0, ans=0.125 2023-06-20 17:36:18,033 INFO [train.py:996] (2/4) Epoch 4, batch 9350, loss[loss=0.258, simple_loss=0.3322, pruned_loss=0.09192, over 21606.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3284, pruned_loss=0.08842, over 4269949.65 frames. ], batch size: 263, lr: 8.20e-03, grad_scale: 32.0 2023-06-20 17:36:49,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=605064.0, ans=0.125 2023-06-20 17:37:23,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=605184.0, ans=0.0 2023-06-20 17:37:40,684 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.89 vs. limit=15.0 2023-06-20 17:38:01,948 INFO [train.py:996] (2/4) Epoch 4, batch 9400, loss[loss=0.2259, simple_loss=0.2836, pruned_loss=0.08414, over 21596.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.3303, pruned_loss=0.08861, over 4270793.82 frames. ], batch size: 247, lr: 8.19e-03, grad_scale: 16.0 2023-06-20 17:38:16,113 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.29 vs. limit=15.0 2023-06-20 17:38:35,632 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.058e+02 2.508e+02 2.883e+02 3.302e+02 5.601e+02, threshold=5.767e+02, percent-clipped=0.0 2023-06-20 17:38:38,144 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.41 vs. limit=12.0 2023-06-20 17:38:56,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=605484.0, ans=0.125 2023-06-20 17:39:01,937 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=605484.0, ans=0.1 2023-06-20 17:39:45,258 INFO [train.py:996] (2/4) Epoch 4, batch 9450, loss[loss=0.2131, simple_loss=0.2761, pruned_loss=0.07502, over 21803.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3215, pruned_loss=0.0871, over 4265966.69 frames. ], batch size: 317, lr: 8.19e-03, grad_scale: 16.0 2023-06-20 17:40:48,170 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=605784.0, ans=0.0 2023-06-20 17:41:18,636 INFO [train.py:996] (2/4) Epoch 4, batch 9500, loss[loss=0.2212, simple_loss=0.3025, pruned_loss=0.06998, over 21663.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3134, pruned_loss=0.08485, over 4271437.36 frames. ], batch size: 263, lr: 8.19e-03, grad_scale: 16.0 2023-06-20 17:41:29,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=605904.0, ans=0.1 2023-06-20 17:41:53,779 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.913e+02 2.513e+02 2.842e+02 3.687e+02 6.250e+02, threshold=5.685e+02, percent-clipped=3.0 2023-06-20 17:43:05,085 INFO [train.py:996] (2/4) Epoch 4, batch 9550, loss[loss=0.2578, simple_loss=0.3443, pruned_loss=0.08564, over 21758.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3173, pruned_loss=0.08672, over 4271330.44 frames. ], batch size: 247, lr: 8.19e-03, grad_scale: 16.0 2023-06-20 17:43:16,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=606204.0, ans=0.2 2023-06-20 17:43:52,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=606324.0, ans=0.0 2023-06-20 17:44:14,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=606384.0, ans=0.0 2023-06-20 17:44:32,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=606444.0, ans=0.04949747468305833 2023-06-20 17:44:55,268 INFO [train.py:996] (2/4) Epoch 4, batch 9600, loss[loss=0.2574, simple_loss=0.3187, pruned_loss=0.09799, over 21381.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3194, pruned_loss=0.08921, over 4276752.44 frames. ], batch size: 143, lr: 8.19e-03, grad_scale: 32.0 2023-06-20 17:44:57,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=606504.0, ans=0.04949747468305833 2023-06-20 17:45:18,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=606564.0, ans=0.125 2023-06-20 17:45:29,028 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.929e+02 2.596e+02 2.958e+02 3.550e+02 7.860e+02, threshold=5.916e+02, percent-clipped=4.0 2023-06-20 17:45:35,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=606624.0, ans=0.125 2023-06-20 17:45:38,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=606624.0, ans=0.125 2023-06-20 17:45:41,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=606624.0, ans=0.0 2023-06-20 17:46:28,706 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=606744.0, ans=0.125 2023-06-20 17:46:32,678 INFO [train.py:996] (2/4) Epoch 4, batch 9650, loss[loss=0.2545, simple_loss=0.3183, pruned_loss=0.09534, over 21601.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3195, pruned_loss=0.08893, over 4281800.53 frames. ], batch size: 508, lr: 8.18e-03, grad_scale: 32.0 2023-06-20 17:46:44,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=606804.0, ans=0.035 2023-06-20 17:47:00,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=606864.0, ans=0.125 2023-06-20 17:47:12,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=606924.0, ans=0.125 2023-06-20 17:47:19,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=606924.0, ans=0.0 2023-06-20 17:48:18,428 INFO [train.py:996] (2/4) Epoch 4, batch 9700, loss[loss=0.2311, simple_loss=0.3064, pruned_loss=0.07791, over 21712.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3227, pruned_loss=0.08954, over 4279454.26 frames. ], batch size: 247, lr: 8.18e-03, grad_scale: 32.0 2023-06-20 17:48:31,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=607104.0, ans=0.125 2023-06-20 17:48:39,576 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.63 vs. limit=12.0 2023-06-20 17:48:52,040 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.905e+02 2.445e+02 2.765e+02 3.359e+02 4.931e+02, threshold=5.531e+02, percent-clipped=0.0 2023-06-20 17:49:04,549 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=607224.0, ans=0.1 2023-06-20 17:49:56,396 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=607344.0, ans=0.125 2023-06-20 17:50:13,094 INFO [train.py:996] (2/4) Epoch 4, batch 9750, loss[loss=0.1888, simple_loss=0.2522, pruned_loss=0.06266, over 21478.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3163, pruned_loss=0.08823, over 4278483.32 frames. ], batch size: 212, lr: 8.18e-03, grad_scale: 32.0 2023-06-20 17:50:34,470 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=607464.0, ans=0.0 2023-06-20 17:51:01,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=607584.0, ans=0.125 2023-06-20 17:51:29,047 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.79 vs. limit=22.5 2023-06-20 17:51:29,948 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=607644.0, ans=0.0 2023-06-20 17:51:33,292 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.24 vs. limit=10.0 2023-06-20 17:51:43,074 INFO [train.py:996] (2/4) Epoch 4, batch 9800, loss[loss=0.2433, simple_loss=0.3284, pruned_loss=0.07914, over 21492.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3148, pruned_loss=0.088, over 4271876.76 frames. ], batch size: 131, lr: 8.18e-03, grad_scale: 16.0 2023-06-20 17:52:11,838 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=607764.0, ans=0.2 2023-06-20 17:52:18,891 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.047e+02 2.616e+02 2.976e+02 3.497e+02 5.489e+02, threshold=5.952e+02, percent-clipped=0.0 2023-06-20 17:52:41,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=607884.0, ans=0.125 2023-06-20 17:53:19,582 INFO [train.py:996] (2/4) Epoch 4, batch 9850, loss[loss=0.231, simple_loss=0.2916, pruned_loss=0.08516, over 21883.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.312, pruned_loss=0.08821, over 4276256.79 frames. ], batch size: 371, lr: 8.18e-03, grad_scale: 16.0 2023-06-20 17:53:37,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=608004.0, ans=0.0 2023-06-20 17:54:10,592 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.82 vs. limit=22.5 2023-06-20 17:54:11,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=608184.0, ans=0.0 2023-06-20 17:54:56,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=608244.0, ans=0.125 2023-06-20 17:55:00,048 INFO [train.py:996] (2/4) Epoch 4, batch 9900, loss[loss=0.2061, simple_loss=0.2747, pruned_loss=0.06878, over 21900.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3086, pruned_loss=0.08765, over 4273115.56 frames. ], batch size: 107, lr: 8.17e-03, grad_scale: 16.0 2023-06-20 17:55:28,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=608364.0, ans=0.125 2023-06-20 17:55:34,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=608364.0, ans=0.0 2023-06-20 17:55:38,215 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.985e+02 2.623e+02 2.955e+02 3.593e+02 6.748e+02, threshold=5.910e+02, percent-clipped=3.0 2023-06-20 17:56:10,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=608484.0, ans=0.1 2023-06-20 17:56:45,772 INFO [train.py:996] (2/4) Epoch 4, batch 9950, loss[loss=0.2633, simple_loss=0.3176, pruned_loss=0.1045, over 21092.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3134, pruned_loss=0.09047, over 4277414.54 frames. ], batch size: 143, lr: 8.17e-03, grad_scale: 16.0 2023-06-20 17:56:46,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=608604.0, ans=0.125 2023-06-20 17:56:55,305 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.17 vs. limit=15.0 2023-06-20 17:56:57,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=608604.0, ans=0.04949747468305833 2023-06-20 17:57:01,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=608604.0, ans=0.125 2023-06-20 17:57:20,576 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=608724.0, ans=0.125 2023-06-20 17:57:36,791 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=608784.0, ans=0.0 2023-06-20 17:58:26,706 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=608844.0, ans=0.125 2023-06-20 17:58:36,076 INFO [train.py:996] (2/4) Epoch 4, batch 10000, loss[loss=0.2057, simple_loss=0.2765, pruned_loss=0.06747, over 21297.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.309, pruned_loss=0.08884, over 4276052.13 frames. ], batch size: 176, lr: 8.17e-03, grad_scale: 32.0 2023-06-20 17:58:53,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=608964.0, ans=0.0 2023-06-20 17:59:06,191 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.779e+02 2.693e+02 3.176e+02 3.942e+02 5.803e+02, threshold=6.352e+02, percent-clipped=0.0 2023-06-20 18:00:12,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=609144.0, ans=0.0 2023-06-20 18:00:30,871 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.71 vs. limit=15.0 2023-06-20 18:00:33,092 INFO [train.py:996] (2/4) Epoch 4, batch 10050, loss[loss=0.2058, simple_loss=0.2868, pruned_loss=0.06242, over 21873.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3113, pruned_loss=0.08906, over 4277699.52 frames. ], batch size: 317, lr: 8.17e-03, grad_scale: 32.0 2023-06-20 18:00:36,494 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=609204.0, ans=0.2 2023-06-20 18:00:41,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=609204.0, ans=0.125 2023-06-20 18:00:47,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=609264.0, ans=0.125 2023-06-20 18:02:11,255 INFO [train.py:996] (2/4) Epoch 4, batch 10100, loss[loss=0.2654, simple_loss=0.3341, pruned_loss=0.09836, over 21766.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3086, pruned_loss=0.08663, over 4276587.27 frames. ], batch size: 332, lr: 8.17e-03, grad_scale: 32.0 2023-06-20 18:02:29,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=609564.0, ans=0.125 2023-06-20 18:02:48,177 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.956e+02 2.616e+02 3.061e+02 3.552e+02 5.046e+02, threshold=6.121e+02, percent-clipped=0.0 2023-06-20 18:02:55,229 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.93 vs. limit=15.0 2023-06-20 18:03:40,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=609744.0, ans=0.125 2023-06-20 18:03:54,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=609744.0, ans=0.125 2023-06-20 18:03:56,771 INFO [train.py:996] (2/4) Epoch 4, batch 10150, loss[loss=0.197, simple_loss=0.2743, pruned_loss=0.05989, over 21023.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3145, pruned_loss=0.08928, over 4278335.31 frames. ], batch size: 608, lr: 8.16e-03, grad_scale: 32.0 2023-06-20 18:04:10,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=609864.0, ans=0.125 2023-06-20 18:05:15,811 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.68 vs. limit=15.0 2023-06-20 18:05:28,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=610044.0, ans=0.125 2023-06-20 18:05:34,318 INFO [train.py:996] (2/4) Epoch 4, batch 10200, loss[loss=0.2119, simple_loss=0.2997, pruned_loss=0.0621, over 21741.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3136, pruned_loss=0.08684, over 4276541.73 frames. ], batch size: 298, lr: 8.16e-03, grad_scale: 32.0 2023-06-20 18:05:39,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=610104.0, ans=0.0 2023-06-20 18:06:09,916 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.682e+02 2.303e+02 2.650e+02 3.100e+02 6.273e+02, threshold=5.301e+02, percent-clipped=1.0 2023-06-20 18:06:12,554 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.54 vs. limit=15.0 2023-06-20 18:06:14,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=610224.0, ans=0.125 2023-06-20 18:06:26,879 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=610224.0, ans=0.2 2023-06-20 18:06:35,285 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=610284.0, ans=0.0 2023-06-20 18:06:36,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=610284.0, ans=0.1 2023-06-20 18:06:47,903 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.01 vs. limit=10.0 2023-06-20 18:07:06,500 INFO [train.py:996] (2/4) Epoch 4, batch 10250, loss[loss=0.24, simple_loss=0.34, pruned_loss=0.07004, over 19963.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3074, pruned_loss=0.08078, over 4263133.43 frames. ], batch size: 702, lr: 8.16e-03, grad_scale: 32.0 2023-06-20 18:07:55,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=610464.0, ans=0.125 2023-06-20 18:08:17,837 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.20 vs. limit=10.0 2023-06-20 18:08:46,770 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.23 vs. limit=15.0 2023-06-20 18:08:52,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=610644.0, ans=0.125 2023-06-20 18:08:59,189 INFO [train.py:996] (2/4) Epoch 4, batch 10300, loss[loss=0.2625, simple_loss=0.3445, pruned_loss=0.09022, over 21735.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3117, pruned_loss=0.08131, over 4272871.84 frames. ], batch size: 332, lr: 8.16e-03, grad_scale: 32.0 2023-06-20 18:09:44,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=610764.0, ans=0.125 2023-06-20 18:10:07,760 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.534e+02 2.456e+02 2.901e+02 3.440e+02 5.624e+02, threshold=5.802e+02, percent-clipped=3.0 2023-06-20 18:10:24,114 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.67 vs. limit=15.0 2023-06-20 18:11:11,082 INFO [train.py:996] (2/4) Epoch 4, batch 10350, loss[loss=0.2377, simple_loss=0.3201, pruned_loss=0.07768, over 21192.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3147, pruned_loss=0.08219, over 4272242.14 frames. ], batch size: 159, lr: 8.16e-03, grad_scale: 32.0 2023-06-20 18:11:39,557 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=611064.0, ans=0.07 2023-06-20 18:11:52,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=611064.0, ans=0.0 2023-06-20 18:12:21,374 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=611184.0, ans=0.0 2023-06-20 18:12:31,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=611244.0, ans=0.5 2023-06-20 18:12:54,575 INFO [train.py:996] (2/4) Epoch 4, batch 10400, loss[loss=0.2282, simple_loss=0.2692, pruned_loss=0.09358, over 20842.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3067, pruned_loss=0.08124, over 4271277.57 frames. ], batch size: 608, lr: 8.15e-03, grad_scale: 32.0 2023-06-20 18:13:15,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=611364.0, ans=0.0 2023-06-20 18:13:29,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=611364.0, ans=0.125 2023-06-20 18:13:34,055 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 18:13:36,269 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.04 vs. limit=6.0 2023-06-20 18:13:36,681 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.814e+02 2.614e+02 3.049e+02 3.745e+02 5.860e+02, threshold=6.098e+02, percent-clipped=1.0 2023-06-20 18:13:40,509 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 18:13:42,704 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.60 vs. limit=12.0 2023-06-20 18:14:04,374 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.08 vs. limit=6.0 2023-06-20 18:14:13,622 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=611544.0, ans=0.125 2023-06-20 18:14:33,850 INFO [train.py:996] (2/4) Epoch 4, batch 10450, loss[loss=0.2546, simple_loss=0.3352, pruned_loss=0.08699, over 19978.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3122, pruned_loss=0.085, over 4276368.91 frames. ], batch size: 704, lr: 8.15e-03, grad_scale: 32.0 2023-06-20 18:14:35,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=611604.0, ans=0.0 2023-06-20 18:14:35,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=611604.0, ans=0.1 2023-06-20 18:16:06,921 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 18:16:46,789 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=611844.0, ans=0.125 2023-06-20 18:16:49,366 INFO [train.py:996] (2/4) Epoch 4, batch 10500, loss[loss=0.2229, simple_loss=0.2902, pruned_loss=0.07777, over 21765.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.311, pruned_loss=0.08402, over 4280221.81 frames. ], batch size: 351, lr: 8.15e-03, grad_scale: 32.0 2023-06-20 18:17:18,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=611964.0, ans=0.0 2023-06-20 18:17:25,178 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.863e+02 2.554e+02 2.960e+02 3.444e+02 4.861e+02, threshold=5.921e+02, percent-clipped=0.0 2023-06-20 18:17:58,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=612084.0, ans=0.0 2023-06-20 18:18:27,192 INFO [train.py:996] (2/4) Epoch 4, batch 10550, loss[loss=0.2185, simple_loss=0.2881, pruned_loss=0.07446, over 21865.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3066, pruned_loss=0.08411, over 4275689.82 frames. ], batch size: 98, lr: 8.15e-03, grad_scale: 32.0 2023-06-20 18:18:57,833 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=15.0 2023-06-20 18:19:00,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=612264.0, ans=0.2 2023-06-20 18:19:09,694 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=612324.0, ans=0.125 2023-06-20 18:19:16,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=612324.0, ans=0.05 2023-06-20 18:20:05,554 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=612504.0, ans=0.0 2023-06-20 18:20:06,501 INFO [train.py:996] (2/4) Epoch 4, batch 10600, loss[loss=0.2022, simple_loss=0.2677, pruned_loss=0.06838, over 21859.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3001, pruned_loss=0.0818, over 4276078.51 frames. ], batch size: 107, lr: 8.15e-03, grad_scale: 32.0 2023-06-20 18:20:32,877 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 18:20:42,880 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.848e+02 2.468e+02 2.792e+02 3.375e+02 4.680e+02, threshold=5.585e+02, percent-clipped=0.0 2023-06-20 18:20:43,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=612624.0, ans=0.125 2023-06-20 18:20:58,187 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 18:21:11,953 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn2.whiten.whitening_limit, batch_count=612684.0, ans=22.5 2023-06-20 18:21:15,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=612684.0, ans=0.125 2023-06-20 18:21:19,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=612684.0, ans=0.125 2023-06-20 18:21:51,934 INFO [train.py:996] (2/4) Epoch 4, batch 10650, loss[loss=0.1809, simple_loss=0.2567, pruned_loss=0.05249, over 21759.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.301, pruned_loss=0.08093, over 4266307.78 frames. ], batch size: 282, lr: 8.14e-03, grad_scale: 32.0 2023-06-20 18:22:20,973 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=612864.0, ans=0.125 2023-06-20 18:22:56,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=612924.0, ans=0.0 2023-06-20 18:23:04,144 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=612984.0, ans=0.1 2023-06-20 18:23:43,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=613104.0, ans=0.125 2023-06-20 18:23:43,990 INFO [train.py:996] (2/4) Epoch 4, batch 10700, loss[loss=0.2361, simple_loss=0.3057, pruned_loss=0.08322, over 21627.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3019, pruned_loss=0.0807, over 4265997.77 frames. ], batch size: 263, lr: 8.14e-03, grad_scale: 32.0 2023-06-20 18:23:52,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=613104.0, ans=0.0 2023-06-20 18:24:21,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=613164.0, ans=0.125 2023-06-20 18:24:25,578 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.881e+02 2.621e+02 3.333e+02 3.923e+02 6.693e+02, threshold=6.666e+02, percent-clipped=4.0 2023-06-20 18:25:13,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=613344.0, ans=0.125 2023-06-20 18:25:17,608 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.65 vs. limit=15.0 2023-06-20 18:25:22,499 INFO [train.py:996] (2/4) Epoch 4, batch 10750, loss[loss=0.3179, simple_loss=0.4021, pruned_loss=0.1169, over 21553.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3123, pruned_loss=0.08442, over 4264835.00 frames. ], batch size: 508, lr: 8.14e-03, grad_scale: 32.0 2023-06-20 18:26:25,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=613584.0, ans=0.0 2023-06-20 18:26:38,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=613584.0, ans=0.0 2023-06-20 18:27:06,303 INFO [train.py:996] (2/4) Epoch 4, batch 10800, loss[loss=0.3301, simple_loss=0.3844, pruned_loss=0.1379, over 21428.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3178, pruned_loss=0.08529, over 4267268.33 frames. ], batch size: 471, lr: 8.14e-03, grad_scale: 32.0 2023-06-20 18:27:30,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=613704.0, ans=0.2 2023-06-20 18:27:43,145 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.80 vs. limit=15.0 2023-06-20 18:28:07,242 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.821e+02 2.553e+02 2.986e+02 3.356e+02 5.834e+02, threshold=5.972e+02, percent-clipped=0.0 2023-06-20 18:28:40,523 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=613884.0, ans=0.125 2023-06-20 18:28:42,776 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.30 vs. limit=15.0 2023-06-20 18:28:54,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=613944.0, ans=0.0 2023-06-20 18:29:04,045 INFO [train.py:996] (2/4) Epoch 4, batch 10850, loss[loss=0.214, simple_loss=0.2803, pruned_loss=0.07381, over 21664.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3188, pruned_loss=0.0861, over 4262474.79 frames. ], batch size: 333, lr: 8.14e-03, grad_scale: 32.0 2023-06-20 18:29:27,059 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=614064.0, ans=0.125 2023-06-20 18:30:11,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=614184.0, ans=0.125 2023-06-20 18:30:14,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=614184.0, ans=0.125 2023-06-20 18:30:47,233 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=614304.0, ans=0.05 2023-06-20 18:30:48,121 INFO [train.py:996] (2/4) Epoch 4, batch 10900, loss[loss=0.1919, simple_loss=0.2485, pruned_loss=0.06762, over 21234.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3117, pruned_loss=0.08348, over 4260658.79 frames. ], batch size: 549, lr: 8.13e-03, grad_scale: 32.0 2023-06-20 18:30:51,888 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=614304.0, ans=0.2 2023-06-20 18:31:33,699 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.811e+02 2.427e+02 2.861e+02 3.280e+02 5.229e+02, threshold=5.723e+02, percent-clipped=0.0 2023-06-20 18:31:49,403 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=614484.0, ans=0.1 2023-06-20 18:31:52,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=614484.0, ans=0.125 2023-06-20 18:32:30,222 INFO [train.py:996] (2/4) Epoch 4, batch 10950, loss[loss=0.224, simple_loss=0.2798, pruned_loss=0.0841, over 21671.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3078, pruned_loss=0.08242, over 4255401.94 frames. ], batch size: 248, lr: 8.13e-03, grad_scale: 32.0 2023-06-20 18:32:36,286 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=614604.0, ans=0.0 2023-06-20 18:33:04,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=614724.0, ans=0.125 2023-06-20 18:33:15,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=614724.0, ans=0.125 2023-06-20 18:33:35,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=614784.0, ans=0.0 2023-06-20 18:33:40,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=614844.0, ans=0.125 2023-06-20 18:34:07,256 INFO [train.py:996] (2/4) Epoch 4, batch 11000, loss[loss=0.2494, simple_loss=0.3132, pruned_loss=0.09278, over 21853.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3054, pruned_loss=0.0832, over 4256226.75 frames. ], batch size: 371, lr: 8.13e-03, grad_scale: 16.0 2023-06-20 18:34:42,611 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=17.63 vs. limit=22.5 2023-06-20 18:35:02,765 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.872e+02 2.436e+02 2.740e+02 3.123e+02 4.405e+02, threshold=5.481e+02, percent-clipped=0.0 2023-06-20 18:35:13,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=615024.0, ans=0.125 2023-06-20 18:35:25,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=615084.0, ans=0.0 2023-06-20 18:35:56,964 INFO [train.py:996] (2/4) Epoch 4, batch 11050, loss[loss=0.2069, simple_loss=0.269, pruned_loss=0.07245, over 21674.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3022, pruned_loss=0.08397, over 4265256.25 frames. ], batch size: 282, lr: 8.13e-03, grad_scale: 16.0 2023-06-20 18:36:06,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=615204.0, ans=0.0 2023-06-20 18:36:21,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=615264.0, ans=0.125 2023-06-20 18:37:00,905 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=615384.0, ans=0.0 2023-06-20 18:37:33,926 INFO [train.py:996] (2/4) Epoch 4, batch 11100, loss[loss=0.2237, simple_loss=0.2879, pruned_loss=0.07972, over 21682.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3019, pruned_loss=0.0844, over 4272358.30 frames. ], batch size: 316, lr: 8.13e-03, grad_scale: 16.0 2023-06-20 18:38:11,251 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.131e+02 2.701e+02 3.050e+02 3.889e+02 7.267e+02, threshold=6.099e+02, percent-clipped=1.0 2023-06-20 18:38:13,700 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.63 vs. limit=15.0 2023-06-20 18:38:32,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=615684.0, ans=0.1 2023-06-20 18:38:36,505 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=615684.0, ans=0.0 2023-06-20 18:38:41,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=615684.0, ans=0.0 2023-06-20 18:38:41,782 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.96 vs. limit=22.5 2023-06-20 18:39:11,203 INFO [train.py:996] (2/4) Epoch 4, batch 11150, loss[loss=0.2775, simple_loss=0.3572, pruned_loss=0.09889, over 21543.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3007, pruned_loss=0.08354, over 4275549.88 frames. ], batch size: 441, lr: 8.12e-03, grad_scale: 16.0 2023-06-20 18:39:51,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=615924.0, ans=0.125 2023-06-20 18:40:43,603 INFO [train.py:996] (2/4) Epoch 4, batch 11200, loss[loss=0.2242, simple_loss=0.2785, pruned_loss=0.08491, over 21544.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3001, pruned_loss=0.08314, over 4277374.20 frames. ], batch size: 231, lr: 8.12e-03, grad_scale: 32.0 2023-06-20 18:41:20,887 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.925e+02 2.501e+02 3.007e+02 3.463e+02 5.262e+02, threshold=6.015e+02, percent-clipped=0.0 2023-06-20 18:41:24,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=616224.0, ans=0.2 2023-06-20 18:42:05,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=616344.0, ans=0.1 2023-06-20 18:42:19,903 INFO [train.py:996] (2/4) Epoch 4, batch 11250, loss[loss=0.2413, simple_loss=0.3003, pruned_loss=0.09122, over 21640.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.2994, pruned_loss=0.08346, over 4272211.77 frames. ], batch size: 415, lr: 8.12e-03, grad_scale: 32.0 2023-06-20 18:42:42,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=616464.0, ans=0.0 2023-06-20 18:42:59,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=616524.0, ans=0.125 2023-06-20 18:43:18,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=616584.0, ans=0.125 2023-06-20 18:43:55,851 INFO [train.py:996] (2/4) Epoch 4, batch 11300, loss[loss=0.1959, simple_loss=0.2761, pruned_loss=0.05782, over 21517.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3006, pruned_loss=0.08342, over 4285837.60 frames. ], batch size: 195, lr: 8.12e-03, grad_scale: 32.0 2023-06-20 18:44:32,987 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.644e+02 2.391e+02 2.715e+02 3.266e+02 4.844e+02, threshold=5.429e+02, percent-clipped=0.0 2023-06-20 18:44:44,249 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.97 vs. limit=22.5 2023-06-20 18:44:54,726 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.20 vs. limit=15.0 2023-06-20 18:45:22,382 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.60 vs. limit=12.0 2023-06-20 18:45:31,009 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=616944.0, ans=15.0 2023-06-20 18:45:33,081 INFO [train.py:996] (2/4) Epoch 4, batch 11350, loss[loss=0.2538, simple_loss=0.326, pruned_loss=0.09081, over 21288.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3028, pruned_loss=0.08334, over 4285293.73 frames. ], batch size: 159, lr: 8.12e-03, grad_scale: 32.0 2023-06-20 18:45:35,157 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 18:45:45,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=617004.0, ans=0.125 2023-06-20 18:46:22,605 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=617124.0, ans=0.125 2023-06-20 18:47:05,359 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=617184.0, ans=0.0 2023-06-20 18:47:19,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=617244.0, ans=0.0 2023-06-20 18:47:27,856 INFO [train.py:996] (2/4) Epoch 4, batch 11400, loss[loss=0.2407, simple_loss=0.3277, pruned_loss=0.07687, over 21863.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3104, pruned_loss=0.08661, over 4278644.87 frames. ], batch size: 317, lr: 8.11e-03, grad_scale: 32.0 2023-06-20 18:47:35,721 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=617304.0, ans=0.0 2023-06-20 18:48:09,861 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.983e+02 2.547e+02 3.001e+02 3.571e+02 5.767e+02, threshold=6.003e+02, percent-clipped=1.0 2023-06-20 18:48:35,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=617424.0, ans=0.0 2023-06-20 18:49:10,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=617544.0, ans=0.0 2023-06-20 18:49:16,795 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=617544.0, ans=0.0 2023-06-20 18:49:20,890 INFO [train.py:996] (2/4) Epoch 4, batch 11450, loss[loss=0.2397, simple_loss=0.3103, pruned_loss=0.08455, over 20057.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3109, pruned_loss=0.08464, over 4272327.54 frames. ], batch size: 704, lr: 8.11e-03, grad_scale: 32.0 2023-06-20 18:49:42,582 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=617604.0, ans=0.2 2023-06-20 18:49:44,752 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.87 vs. limit=15.0 2023-06-20 18:49:53,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=617664.0, ans=0.125 2023-06-20 18:50:06,059 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=617724.0, ans=15.0 2023-06-20 18:50:16,384 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=15.0 2023-06-20 18:50:28,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=617784.0, ans=0.2 2023-06-20 18:50:30,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=617784.0, ans=0.125 2023-06-20 18:50:44,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=617844.0, ans=0.125 2023-06-20 18:51:09,829 INFO [train.py:996] (2/4) Epoch 4, batch 11500, loss[loss=0.2009, simple_loss=0.2894, pruned_loss=0.05625, over 21478.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3141, pruned_loss=0.08591, over 4275734.96 frames. ], batch size: 194, lr: 8.11e-03, grad_scale: 32.0 2023-06-20 18:51:34,308 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=617904.0, ans=0.1 2023-06-20 18:51:50,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=617964.0, ans=0.125 2023-06-20 18:51:54,634 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.842e+02 2.475e+02 2.866e+02 3.336e+02 5.251e+02, threshold=5.732e+02, percent-clipped=0.0 2023-06-20 18:52:21,387 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=618084.0, ans=0.125 2023-06-20 18:52:38,684 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.76 vs. limit=22.5 2023-06-20 18:52:58,009 INFO [train.py:996] (2/4) Epoch 4, batch 11550, loss[loss=0.3313, simple_loss=0.4467, pruned_loss=0.1079, over 21205.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3207, pruned_loss=0.0865, over 4276511.84 frames. ], batch size: 548, lr: 8.11e-03, grad_scale: 32.0 2023-06-20 18:53:28,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=618264.0, ans=0.125 2023-06-20 18:54:31,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=618384.0, ans=0.0 2023-06-20 18:54:54,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=618444.0, ans=0.2 2023-06-20 18:54:56,638 INFO [train.py:996] (2/4) Epoch 4, batch 11600, loss[loss=0.3323, simple_loss=0.4461, pruned_loss=0.1092, over 21613.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.3334, pruned_loss=0.08845, over 4277417.54 frames. ], batch size: 441, lr: 8.11e-03, grad_scale: 32.0 2023-06-20 18:54:58,694 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=618504.0, ans=0.125 2023-06-20 18:55:08,702 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.43 vs. limit=15.0 2023-06-20 18:55:28,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=618564.0, ans=0.0 2023-06-20 18:55:40,400 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.999e+02 2.611e+02 3.024e+02 3.614e+02 6.438e+02, threshold=6.048e+02, percent-clipped=2.0 2023-06-20 18:55:45,955 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.61 vs. limit=6.0 2023-06-20 18:55:46,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=618624.0, ans=0.125 2023-06-20 18:56:00,386 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=618684.0, ans=0.0 2023-06-20 18:56:30,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=618744.0, ans=0.2 2023-06-20 18:56:34,569 INFO [train.py:996] (2/4) Epoch 4, batch 11650, loss[loss=0.2381, simple_loss=0.298, pruned_loss=0.08908, over 20120.00 frames. ], tot_loss[loss=0.2573, simple_loss=0.3377, pruned_loss=0.08844, over 4267908.87 frames. ], batch size: 704, lr: 8.10e-03, grad_scale: 32.0 2023-06-20 18:56:38,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=618804.0, ans=0.0 2023-06-20 18:57:56,461 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=619044.0, ans=0.125 2023-06-20 18:57:56,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=619044.0, ans=0.125 2023-06-20 18:58:03,984 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=619044.0, ans=0.125 2023-06-20 18:58:10,737 INFO [train.py:996] (2/4) Epoch 4, batch 11700, loss[loss=0.2192, simple_loss=0.2776, pruned_loss=0.08034, over 21426.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.3278, pruned_loss=0.08734, over 4269420.03 frames. ], batch size: 195, lr: 8.10e-03, grad_scale: 32.0 2023-06-20 18:58:21,854 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.50 vs. limit=6.0 2023-06-20 18:58:22,096 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.55 vs. limit=22.5 2023-06-20 18:58:22,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=619104.0, ans=0.0 2023-06-20 18:58:47,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=619164.0, ans=0.1 2023-06-20 18:58:52,834 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.951e+02 2.548e+02 2.868e+02 3.351e+02 5.665e+02, threshold=5.736e+02, percent-clipped=0.0 2023-06-20 18:59:51,825 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.38 vs. limit=22.5 2023-06-20 18:59:56,763 INFO [train.py:996] (2/4) Epoch 4, batch 11750, loss[loss=0.2247, simple_loss=0.293, pruned_loss=0.07816, over 21719.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3192, pruned_loss=0.08658, over 4276052.46 frames. ], batch size: 282, lr: 8.10e-03, grad_scale: 32.0 2023-06-20 19:00:18,641 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.20 vs. limit=15.0 2023-06-20 19:00:24,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=619464.0, ans=0.0 2023-06-20 19:01:04,257 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 19:01:23,634 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.13 vs. limit=15.0 2023-06-20 19:01:31,067 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.02 vs. limit=10.0 2023-06-20 19:01:31,447 INFO [train.py:996] (2/4) Epoch 4, batch 11800, loss[loss=0.2364, simple_loss=0.3399, pruned_loss=0.06648, over 21271.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3221, pruned_loss=0.08874, over 4270597.34 frames. ], batch size: 549, lr: 8.10e-03, grad_scale: 32.0 2023-06-20 19:02:13,519 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.964e+02 2.675e+02 3.120e+02 4.087e+02 6.326e+02, threshold=6.239e+02, percent-clipped=4.0 2023-06-20 19:02:22,450 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.11 vs. limit=15.0 2023-06-20 19:02:47,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=619884.0, ans=0.0 2023-06-20 19:02:56,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=619944.0, ans=0.125 2023-06-20 19:03:08,260 INFO [train.py:996] (2/4) Epoch 4, batch 11850, loss[loss=0.2238, simple_loss=0.3159, pruned_loss=0.06589, over 21648.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3231, pruned_loss=0.08807, over 4276673.60 frames. ], batch size: 263, lr: 8.10e-03, grad_scale: 32.0 2023-06-20 19:03:17,145 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=620004.0, ans=0.1 2023-06-20 19:03:41,809 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=620064.0, ans=0.0 2023-06-20 19:03:49,349 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=620064.0, ans=0.125 2023-06-20 19:04:23,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=620184.0, ans=0.0 2023-06-20 19:04:23,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=620184.0, ans=0.125 2023-06-20 19:04:58,307 INFO [train.py:996] (2/4) Epoch 4, batch 11900, loss[loss=0.2355, simple_loss=0.329, pruned_loss=0.07103, over 21250.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3223, pruned_loss=0.08485, over 4272572.03 frames. ], batch size: 548, lr: 8.09e-03, grad_scale: 32.0 2023-06-20 19:05:09,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=620304.0, ans=0.0 2023-06-20 19:05:22,008 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=620364.0, ans=0.04949747468305833 2023-06-20 19:05:23,691 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=620364.0, ans=0.125 2023-06-20 19:05:36,513 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.586e+02 2.326e+02 2.650e+02 3.050e+02 4.543e+02, threshold=5.300e+02, percent-clipped=0.0 2023-06-20 19:05:38,695 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=620424.0, ans=0.0 2023-06-20 19:05:48,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=620424.0, ans=0.0 2023-06-20 19:05:52,614 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.54 vs. limit=15.0 2023-06-20 19:05:56,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=620484.0, ans=0.125 2023-06-20 19:06:08,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=620484.0, ans=0.125 2023-06-20 19:06:10,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=620544.0, ans=0.125 2023-06-20 19:06:37,623 INFO [train.py:996] (2/4) Epoch 4, batch 11950, loss[loss=0.1944, simple_loss=0.2927, pruned_loss=0.0481, over 21741.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3215, pruned_loss=0.08137, over 4270704.26 frames. ], batch size: 316, lr: 8.09e-03, grad_scale: 32.0 2023-06-20 19:06:53,149 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=620664.0, ans=0.125 2023-06-20 19:06:53,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=620664.0, ans=0.0 2023-06-20 19:07:03,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=620664.0, ans=0.125 2023-06-20 19:07:41,816 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=620784.0, ans=0.95 2023-06-20 19:07:53,674 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.74 vs. limit=22.5 2023-06-20 19:07:54,609 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=620844.0, ans=0.0 2023-06-20 19:08:14,757 INFO [train.py:996] (2/4) Epoch 4, batch 12000, loss[loss=0.209, simple_loss=0.2636, pruned_loss=0.07721, over 21365.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3183, pruned_loss=0.07999, over 4265533.75 frames. ], batch size: 160, lr: 8.09e-03, grad_scale: 32.0 2023-06-20 19:08:14,758 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-20 19:09:03,794 INFO [train.py:1028] (2/4) Epoch 4, validation: loss=0.2647, simple_loss=0.362, pruned_loss=0.08364, over 1796401.00 frames. 2023-06-20 19:09:03,794 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-20 19:09:32,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=620964.0, ans=0.0 2023-06-20 19:09:47,055 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.747e+02 2.456e+02 3.065e+02 4.141e+02 8.942e+02, threshold=6.129e+02, percent-clipped=11.0 2023-06-20 19:10:42,511 INFO [train.py:996] (2/4) Epoch 4, batch 12050, loss[loss=0.3106, simple_loss=0.3385, pruned_loss=0.1414, over 21717.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3145, pruned_loss=0.08261, over 4271156.13 frames. ], batch size: 508, lr: 8.09e-03, grad_scale: 32.0 2023-06-20 19:11:36,795 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=621324.0, ans=0.0 2023-06-20 19:11:38,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=621324.0, ans=0.1 2023-06-20 19:12:21,156 INFO [train.py:996] (2/4) Epoch 4, batch 12100, loss[loss=0.2696, simple_loss=0.3366, pruned_loss=0.1013, over 21505.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3175, pruned_loss=0.08724, over 4275419.64 frames. ], batch size: 194, lr: 8.09e-03, grad_scale: 32.0 2023-06-20 19:12:33,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=621504.0, ans=0.125 2023-06-20 19:13:06,468 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.99 vs. limit=6.0 2023-06-20 19:13:09,944 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.247e+02 2.714e+02 2.937e+02 3.590e+02 6.628e+02, threshold=5.874e+02, percent-clipped=1.0 2023-06-20 19:13:23,016 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.58 vs. limit=15.0 2023-06-20 19:13:42,254 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=621684.0, ans=0.125 2023-06-20 19:14:10,350 INFO [train.py:996] (2/4) Epoch 4, batch 12150, loss[loss=0.2174, simple_loss=0.3128, pruned_loss=0.06098, over 21719.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3223, pruned_loss=0.08712, over 4273660.41 frames. ], batch size: 298, lr: 8.09e-03, grad_scale: 32.0 2023-06-20 19:14:13,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=621804.0, ans=0.0 2023-06-20 19:15:16,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=621984.0, ans=0.0 2023-06-20 19:15:47,744 INFO [train.py:996] (2/4) Epoch 4, batch 12200, loss[loss=0.2452, simple_loss=0.301, pruned_loss=0.09476, over 15076.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3187, pruned_loss=0.08624, over 4273112.43 frames. ], batch size: 61, lr: 8.08e-03, grad_scale: 32.0 2023-06-20 19:15:52,694 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=622104.0, ans=0.2 2023-06-20 19:16:30,792 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.879e+02 2.592e+02 3.097e+02 3.968e+02 7.788e+02, threshold=6.193e+02, percent-clipped=3.0 2023-06-20 19:17:18,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=622344.0, ans=0.2 2023-06-20 19:17:25,143 INFO [train.py:996] (2/4) Epoch 4, batch 12250, loss[loss=0.2074, simple_loss=0.289, pruned_loss=0.0629, over 21605.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3108, pruned_loss=0.08366, over 4269848.98 frames. ], batch size: 414, lr: 8.08e-03, grad_scale: 32.0 2023-06-20 19:17:31,525 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=622404.0, ans=0.0 2023-06-20 19:17:59,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=622464.0, ans=0.125 2023-06-20 19:18:05,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=622524.0, ans=0.0 2023-06-20 19:18:15,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=622524.0, ans=0.125 2023-06-20 19:18:32,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=622584.0, ans=0.2 2023-06-20 19:18:48,317 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 19:19:00,630 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.25 vs. limit=15.0 2023-06-20 19:19:00,694 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.53 vs. limit=10.0 2023-06-20 19:19:02,396 INFO [train.py:996] (2/4) Epoch 4, batch 12300, loss[loss=0.2151, simple_loss=0.3002, pruned_loss=0.06498, over 21462.00 frames. ], tot_loss[loss=0.23, simple_loss=0.305, pruned_loss=0.07748, over 4261188.38 frames. ], batch size: 471, lr: 8.08e-03, grad_scale: 32.0 2023-06-20 19:19:09,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=622704.0, ans=0.125 2023-06-20 19:19:32,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=622764.0, ans=0.0 2023-06-20 19:19:45,606 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.476e+02 2.123e+02 2.583e+02 3.049e+02 4.453e+02, threshold=5.165e+02, percent-clipped=0.0 2023-06-20 19:19:46,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=622824.0, ans=0.125 2023-06-20 19:20:08,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=622884.0, ans=0.2 2023-06-20 19:20:08,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=622884.0, ans=0.125 2023-06-20 19:20:11,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=622884.0, ans=0.0 2023-06-20 19:20:42,919 INFO [train.py:996] (2/4) Epoch 4, batch 12350, loss[loss=0.225, simple_loss=0.3028, pruned_loss=0.07362, over 21656.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.307, pruned_loss=0.07594, over 4258120.58 frames. ], batch size: 263, lr: 8.08e-03, grad_scale: 32.0 2023-06-20 19:20:43,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=623004.0, ans=0.125 2023-06-20 19:20:47,079 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.42 vs. limit=22.5 2023-06-20 19:21:03,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=623064.0, ans=0.05 2023-06-20 19:21:04,180 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.46 vs. limit=15.0 2023-06-20 19:21:36,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=623184.0, ans=0.0 2023-06-20 19:21:42,021 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.58 vs. limit=15.0 2023-06-20 19:21:53,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=623244.0, ans=0.125 2023-06-20 19:21:57,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=623244.0, ans=0.2 2023-06-20 19:22:18,913 INFO [train.py:996] (2/4) Epoch 4, batch 12400, loss[loss=0.2387, simple_loss=0.2983, pruned_loss=0.08954, over 21789.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3088, pruned_loss=0.07999, over 4269645.63 frames. ], batch size: 247, lr: 8.08e-03, grad_scale: 32.0 2023-06-20 19:23:01,739 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.827e+02 2.547e+02 2.868e+02 3.412e+02 7.340e+02, threshold=5.736e+02, percent-clipped=3.0 2023-06-20 19:23:48,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=623544.0, ans=0.1 2023-06-20 19:23:55,437 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.42 vs. limit=15.0 2023-06-20 19:23:57,371 INFO [train.py:996] (2/4) Epoch 4, batch 12450, loss[loss=0.2768, simple_loss=0.3519, pruned_loss=0.1009, over 21418.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3133, pruned_loss=0.08326, over 4279398.06 frames. ], batch size: 131, lr: 8.07e-03, grad_scale: 32.0 2023-06-20 19:24:06,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=623604.0, ans=0.125 2023-06-20 19:25:15,255 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.48 vs. limit=12.0 2023-06-20 19:25:23,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=623844.0, ans=0.0 2023-06-20 19:25:43,721 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=623844.0, ans=0.0 2023-06-20 19:25:47,473 INFO [train.py:996] (2/4) Epoch 4, batch 12500, loss[loss=0.2835, simple_loss=0.3708, pruned_loss=0.09807, over 21271.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3242, pruned_loss=0.08753, over 4281197.87 frames. ], batch size: 176, lr: 8.07e-03, grad_scale: 32.0 2023-06-20 19:26:28,713 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.162e+02 2.938e+02 3.235e+02 3.820e+02 6.603e+02, threshold=6.470e+02, percent-clipped=1.0 2023-06-20 19:26:32,229 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 19:26:42,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=624024.0, ans=0.0 2023-06-20 19:26:42,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=624024.0, ans=0.125 2023-06-20 19:27:26,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=624144.0, ans=0.1 2023-06-20 19:27:34,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=624144.0, ans=0.0 2023-06-20 19:27:43,431 INFO [train.py:996] (2/4) Epoch 4, batch 12550, loss[loss=0.246, simple_loss=0.3225, pruned_loss=0.08476, over 21172.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.33, pruned_loss=0.09063, over 4278614.55 frames. ], batch size: 143, lr: 8.07e-03, grad_scale: 32.0 2023-06-20 19:28:22,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=624264.0, ans=0.1 2023-06-20 19:28:36,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=624324.0, ans=0.125 2023-06-20 19:28:44,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=624324.0, ans=0.0 2023-06-20 19:29:30,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=624444.0, ans=0.0 2023-06-20 19:29:38,563 INFO [train.py:996] (2/4) Epoch 4, batch 12600, loss[loss=0.2578, simple_loss=0.3442, pruned_loss=0.08572, over 21637.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.328, pruned_loss=0.08785, over 4273249.53 frames. ], batch size: 414, lr: 8.07e-03, grad_scale: 32.0 2023-06-20 19:29:41,982 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=624504.0, ans=0.125 2023-06-20 19:29:59,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=624564.0, ans=0.5 2023-06-20 19:30:22,334 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=624564.0, ans=0.2 2023-06-20 19:30:23,687 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=624564.0, ans=0.05 2023-06-20 19:30:29,137 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.876e+02 2.456e+02 2.778e+02 3.111e+02 4.488e+02, threshold=5.555e+02, percent-clipped=0.0 2023-06-20 19:30:34,235 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=624624.0, ans=0.1 2023-06-20 19:31:05,892 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 19:31:21,836 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.49 vs. limit=15.0 2023-06-20 19:31:23,385 INFO [train.py:996] (2/4) Epoch 4, batch 12650, loss[loss=0.2232, simple_loss=0.2904, pruned_loss=0.07799, over 21817.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3195, pruned_loss=0.08365, over 4276628.94 frames. ], batch size: 282, lr: 8.07e-03, grad_scale: 32.0 2023-06-20 19:32:02,269 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-06-20 19:32:19,277 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=624924.0, ans=0.125 2023-06-20 19:32:43,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=624984.0, ans=0.125 2023-06-20 19:33:06,883 INFO [train.py:996] (2/4) Epoch 4, batch 12700, loss[loss=0.2659, simple_loss=0.3325, pruned_loss=0.09962, over 21950.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3191, pruned_loss=0.08595, over 4279511.94 frames. ], batch size: 316, lr: 8.06e-03, grad_scale: 16.0 2023-06-20 19:33:09,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=625104.0, ans=0.125 2023-06-20 19:33:25,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=625104.0, ans=0.125 2023-06-20 19:33:51,076 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.043e+02 2.673e+02 3.135e+02 3.667e+02 6.631e+02, threshold=6.269e+02, percent-clipped=1.0 2023-06-20 19:34:14,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=625284.0, ans=0.125 2023-06-20 19:34:18,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=625284.0, ans=0.125 2023-06-20 19:34:21,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=625284.0, ans=0.125 2023-06-20 19:34:32,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=625344.0, ans=0.125 2023-06-20 19:34:34,734 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.56 vs. limit=15.0 2023-06-20 19:34:43,217 INFO [train.py:996] (2/4) Epoch 4, batch 12750, loss[loss=0.2644, simple_loss=0.3393, pruned_loss=0.09472, over 21558.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3224, pruned_loss=0.08768, over 4270758.78 frames. ], batch size: 471, lr: 8.06e-03, grad_scale: 16.0 2023-06-20 19:34:58,539 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=625404.0, ans=0.2 2023-06-20 19:35:06,036 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.47 vs. limit=15.0 2023-06-20 19:35:39,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=625524.0, ans=0.125 2023-06-20 19:35:51,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=625524.0, ans=0.035 2023-06-20 19:35:51,941 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=625524.0, ans=0.2 2023-06-20 19:36:24,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=625644.0, ans=0.125 2023-06-20 19:36:34,918 INFO [train.py:996] (2/4) Epoch 4, batch 12800, loss[loss=0.2637, simple_loss=0.3287, pruned_loss=0.0994, over 21899.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3223, pruned_loss=0.08889, over 4277609.55 frames. ], batch size: 316, lr: 8.06e-03, grad_scale: 32.0 2023-06-20 19:36:56,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=625764.0, ans=0.125 2023-06-20 19:37:23,910 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.918e+02 2.419e+02 2.676e+02 3.175e+02 5.760e+02, threshold=5.353e+02, percent-clipped=0.0 2023-06-20 19:37:33,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=625824.0, ans=0.1 2023-06-20 19:38:37,383 INFO [train.py:996] (2/4) Epoch 4, batch 12850, loss[loss=0.2092, simple_loss=0.2986, pruned_loss=0.05992, over 21628.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.325, pruned_loss=0.09, over 4277310.67 frames. ], batch size: 263, lr: 8.06e-03, grad_scale: 32.0 2023-06-20 19:38:48,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=626004.0, ans=0.1 2023-06-20 19:39:39,740 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.38 vs. limit=12.0 2023-06-20 19:40:15,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=626244.0, ans=0.125 2023-06-20 19:40:22,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=626304.0, ans=0.125 2023-06-20 19:40:23,451 INFO [train.py:996] (2/4) Epoch 4, batch 12900, loss[loss=0.1985, simple_loss=0.2771, pruned_loss=0.06002, over 21345.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3213, pruned_loss=0.0862, over 4272377.02 frames. ], batch size: 176, lr: 8.06e-03, grad_scale: 32.0 2023-06-20 19:40:59,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=626424.0, ans=0.0 2023-06-20 19:41:02,026 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.859e+02 2.262e+02 2.706e+02 3.188e+02 4.993e+02, threshold=5.411e+02, percent-clipped=0.0 2023-06-20 19:41:49,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=626484.0, ans=0.125 2023-06-20 19:42:13,984 INFO [train.py:996] (2/4) Epoch 4, batch 12950, loss[loss=0.27, simple_loss=0.3369, pruned_loss=0.1016, over 21814.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3192, pruned_loss=0.08389, over 4272678.94 frames. ], batch size: 118, lr: 8.05e-03, grad_scale: 32.0 2023-06-20 19:42:27,762 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=626604.0, ans=0.0 2023-06-20 19:42:37,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=626664.0, ans=10.0 2023-06-20 19:43:36,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=626844.0, ans=0.0 2023-06-20 19:43:52,076 INFO [train.py:996] (2/4) Epoch 4, batch 13000, loss[loss=0.2251, simple_loss=0.3, pruned_loss=0.07516, over 21759.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.319, pruned_loss=0.08366, over 4273105.33 frames. ], batch size: 118, lr: 8.05e-03, grad_scale: 32.0 2023-06-20 19:43:52,455 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=626904.0, ans=0.125 2023-06-20 19:44:14,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=626964.0, ans=0.125 2023-06-20 19:44:18,417 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.77 vs. limit=15.0 2023-06-20 19:44:30,821 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.687e+02 2.289e+02 2.689e+02 3.180e+02 4.201e+02, threshold=5.379e+02, percent-clipped=0.0 2023-06-20 19:44:49,574 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.32 vs. limit=6.0 2023-06-20 19:44:58,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=627084.0, ans=0.07 2023-06-20 19:45:06,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=627084.0, ans=0.125 2023-06-20 19:45:29,788 INFO [train.py:996] (2/4) Epoch 4, batch 13050, loss[loss=0.2568, simple_loss=0.3229, pruned_loss=0.09533, over 21902.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3138, pruned_loss=0.08137, over 4276614.80 frames. ], batch size: 414, lr: 8.05e-03, grad_scale: 32.0 2023-06-20 19:46:32,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=627324.0, ans=0.125 2023-06-20 19:46:38,838 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.72 vs. limit=15.0 2023-06-20 19:47:03,933 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=11.17 vs. limit=15.0 2023-06-20 19:47:31,047 INFO [train.py:996] (2/4) Epoch 4, batch 13100, loss[loss=0.2796, simple_loss=0.351, pruned_loss=0.1041, over 21536.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3175, pruned_loss=0.08255, over 4277537.16 frames. ], batch size: 471, lr: 8.05e-03, grad_scale: 32.0 2023-06-20 19:47:47,980 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.76 vs. limit=15.0 2023-06-20 19:48:16,295 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.547e+02 2.575e+02 2.978e+02 3.531e+02 5.580e+02, threshold=5.955e+02, percent-clipped=1.0 2023-06-20 19:48:42,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=627684.0, ans=0.1 2023-06-20 19:49:09,610 INFO [train.py:996] (2/4) Epoch 4, batch 13150, loss[loss=0.2204, simple_loss=0.2424, pruned_loss=0.0992, over 19925.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3183, pruned_loss=0.08586, over 4279980.92 frames. ], batch size: 702, lr: 8.05e-03, grad_scale: 32.0 2023-06-20 19:50:01,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=627924.0, ans=0.2 2023-06-20 19:50:10,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=627984.0, ans=0.0 2023-06-20 19:50:14,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=627984.0, ans=0.0 2023-06-20 19:51:04,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=628044.0, ans=0.125 2023-06-20 19:51:09,812 INFO [train.py:996] (2/4) Epoch 4, batch 13200, loss[loss=0.3115, simple_loss=0.3681, pruned_loss=0.1275, over 21430.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3165, pruned_loss=0.08609, over 4278587.01 frames. ], batch size: 471, lr: 8.04e-03, grad_scale: 32.0 2023-06-20 19:51:54,645 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.911e+02 2.426e+02 2.747e+02 3.167e+02 4.367e+02, threshold=5.495e+02, percent-clipped=0.0 2023-06-20 19:52:27,462 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 19:52:35,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=628344.0, ans=0.125 2023-06-20 19:52:36,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=628344.0, ans=0.125 2023-06-20 19:52:48,943 INFO [train.py:996] (2/4) Epoch 4, batch 13250, loss[loss=0.2403, simple_loss=0.3227, pruned_loss=0.07896, over 21824.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3168, pruned_loss=0.08682, over 4278186.44 frames. ], batch size: 351, lr: 8.04e-03, grad_scale: 32.0 2023-06-20 19:52:57,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=628404.0, ans=0.125 2023-06-20 19:53:33,384 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=628464.0, ans=0.05 2023-06-20 19:53:36,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=628524.0, ans=0.1 2023-06-20 19:53:52,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=628524.0, ans=0.1 2023-06-20 19:53:57,331 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=628584.0, ans=0.0 2023-06-20 19:54:12,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=628644.0, ans=0.125 2023-06-20 19:54:27,652 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=628644.0, ans=0.125 2023-06-20 19:54:38,764 INFO [train.py:996] (2/4) Epoch 4, batch 13300, loss[loss=0.2629, simple_loss=0.3406, pruned_loss=0.09262, over 21694.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3202, pruned_loss=0.0866, over 4278510.28 frames. ], batch size: 351, lr: 8.04e-03, grad_scale: 32.0 2023-06-20 19:55:23,833 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.856e+02 2.499e+02 2.765e+02 3.137e+02 6.068e+02, threshold=5.530e+02, percent-clipped=1.0 2023-06-20 19:55:29,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=628824.0, ans=0.1 2023-06-20 19:55:39,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=628884.0, ans=0.2 2023-06-20 19:56:00,918 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.61 vs. limit=15.0 2023-06-20 19:56:17,435 INFO [train.py:996] (2/4) Epoch 4, batch 13350, loss[loss=0.2999, simple_loss=0.3768, pruned_loss=0.1115, over 21610.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3248, pruned_loss=0.08825, over 4267076.12 frames. ], batch size: 414, lr: 8.04e-03, grad_scale: 32.0 2023-06-20 19:56:26,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=629004.0, ans=0.125 2023-06-20 19:56:47,890 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=629064.0, ans=10.0 2023-06-20 19:56:49,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=629064.0, ans=0.125 2023-06-20 19:57:52,349 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.32 vs. limit=15.0 2023-06-20 19:57:54,438 INFO [train.py:996] (2/4) Epoch 4, batch 13400, loss[loss=0.2492, simple_loss=0.3196, pruned_loss=0.08939, over 21701.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3253, pruned_loss=0.0899, over 4276200.86 frames. ], batch size: 389, lr: 8.04e-03, grad_scale: 32.0 2023-06-20 19:58:06,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=629304.0, ans=0.1 2023-06-20 19:58:38,671 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.126e+02 2.621e+02 2.989e+02 3.422e+02 4.870e+02, threshold=5.978e+02, percent-clipped=0.0 2023-06-20 19:58:48,192 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=629424.0, ans=0.04949747468305833 2023-06-20 19:58:59,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=629484.0, ans=0.125 2023-06-20 19:59:37,647 INFO [train.py:996] (2/4) Epoch 4, batch 13450, loss[loss=0.2642, simple_loss=0.3296, pruned_loss=0.09944, over 21684.00 frames. ], tot_loss[loss=0.256, simple_loss=0.3275, pruned_loss=0.09228, over 4276761.70 frames. ], batch size: 441, lr: 8.04e-03, grad_scale: 32.0 2023-06-20 19:59:54,755 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.15 vs. limit=6.0 2023-06-20 20:00:38,800 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=629784.0, ans=0.125 2023-06-20 20:01:21,159 INFO [train.py:996] (2/4) Epoch 4, batch 13500, loss[loss=0.2374, simple_loss=0.3501, pruned_loss=0.06236, over 20832.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3176, pruned_loss=0.08893, over 4269808.24 frames. ], batch size: 607, lr: 8.03e-03, grad_scale: 32.0 2023-06-20 20:01:26,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=629904.0, ans=0.1 2023-06-20 20:01:55,870 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.883e+02 2.636e+02 2.928e+02 3.414e+02 6.823e+02, threshold=5.856e+02, percent-clipped=1.0 2023-06-20 20:02:15,510 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=630024.0, ans=0.0 2023-06-20 20:03:00,265 INFO [train.py:996] (2/4) Epoch 4, batch 13550, loss[loss=0.2526, simple_loss=0.3267, pruned_loss=0.08928, over 20072.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3217, pruned_loss=0.08915, over 4260636.14 frames. ], batch size: 702, lr: 8.03e-03, grad_scale: 32.0 2023-06-20 20:03:02,266 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=630204.0, ans=0.125 2023-06-20 20:03:02,310 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=630204.0, ans=0.2 2023-06-20 20:03:14,553 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=630264.0, ans=0.0 2023-06-20 20:03:25,574 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.74 vs. limit=22.5 2023-06-20 20:03:27,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=630264.0, ans=0.125 2023-06-20 20:03:52,318 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=630324.0, ans=0.125 2023-06-20 20:03:55,712 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.71 vs. limit=15.0 2023-06-20 20:04:14,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=630384.0, ans=0.05 2023-06-20 20:04:35,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=630444.0, ans=0.125 2023-06-20 20:04:35,905 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.98 vs. limit=22.5 2023-06-20 20:04:37,766 INFO [train.py:996] (2/4) Epoch 4, batch 13600, loss[loss=0.2591, simple_loss=0.3262, pruned_loss=0.096, over 21794.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3229, pruned_loss=0.08999, over 4271853.68 frames. ], batch size: 441, lr: 8.03e-03, grad_scale: 32.0 2023-06-20 20:04:59,649 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.11 vs. limit=15.0 2023-06-20 20:05:17,495 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.885e+02 2.746e+02 3.307e+02 4.061e+02 7.657e+02, threshold=6.614e+02, percent-clipped=3.0 2023-06-20 20:06:14,066 INFO [train.py:996] (2/4) Epoch 4, batch 13650, loss[loss=0.2261, simple_loss=0.2849, pruned_loss=0.08363, over 21201.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3167, pruned_loss=0.08615, over 4277081.56 frames. ], batch size: 159, lr: 8.03e-03, grad_scale: 16.0 2023-06-20 20:06:52,230 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=630924.0, ans=0.125 2023-06-20 20:07:07,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=630924.0, ans=0.125 2023-06-20 20:07:14,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=630984.0, ans=0.1 2023-06-20 20:07:34,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=630984.0, ans=0.035 2023-06-20 20:07:39,383 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=631044.0, ans=0.125 2023-06-20 20:07:52,652 INFO [train.py:996] (2/4) Epoch 4, batch 13700, loss[loss=0.1979, simple_loss=0.249, pruned_loss=0.07345, over 21208.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3124, pruned_loss=0.0858, over 4269502.38 frames. ], batch size: 159, lr: 8.03e-03, grad_scale: 16.0 2023-06-20 20:08:11,604 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=631164.0, ans=0.0 2023-06-20 20:08:35,883 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.30 vs. limit=15.0 2023-06-20 20:08:39,110 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.162e+02 2.840e+02 3.302e+02 4.432e+02 7.267e+02, threshold=6.603e+02, percent-clipped=2.0 2023-06-20 20:09:15,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=631344.0, ans=0.125 2023-06-20 20:09:18,713 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=631344.0, ans=0.125 2023-06-20 20:09:30,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=631404.0, ans=0.015 2023-06-20 20:09:31,696 INFO [train.py:996] (2/4) Epoch 4, batch 13750, loss[loss=0.1906, simple_loss=0.2594, pruned_loss=0.06086, over 21191.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.308, pruned_loss=0.08441, over 4260823.45 frames. ], batch size: 176, lr: 8.02e-03, grad_scale: 16.0 2023-06-20 20:10:33,748 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 20:10:50,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=631584.0, ans=0.2 2023-06-20 20:10:51,352 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.82 vs. limit=22.5 2023-06-20 20:10:57,418 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.46 vs. limit=15.0 2023-06-20 20:11:01,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=631644.0, ans=0.125 2023-06-20 20:11:02,436 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.69 vs. limit=15.0 2023-06-20 20:11:18,439 INFO [train.py:996] (2/4) Epoch 4, batch 13800, loss[loss=0.291, simple_loss=0.3811, pruned_loss=0.1005, over 21891.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3159, pruned_loss=0.08329, over 4254126.44 frames. ], batch size: 372, lr: 8.02e-03, grad_scale: 16.0 2023-06-20 20:11:18,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=631704.0, ans=0.125 2023-06-20 20:11:46,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=631704.0, ans=0.125 2023-06-20 20:11:52,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=631764.0, ans=0.2 2023-06-20 20:12:15,068 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.951e+02 2.546e+02 3.323e+02 4.159e+02 7.075e+02, threshold=6.647e+02, percent-clipped=2.0 2023-06-20 20:12:16,101 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=631824.0, ans=15.0 2023-06-20 20:13:06,098 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 20:13:14,501 INFO [train.py:996] (2/4) Epoch 4, batch 13850, loss[loss=0.3437, simple_loss=0.4046, pruned_loss=0.1414, over 21468.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3214, pruned_loss=0.08372, over 4259738.32 frames. ], batch size: 508, lr: 8.02e-03, grad_scale: 16.0 2023-06-20 20:13:29,990 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.26 vs. limit=6.0 2023-06-20 20:13:38,642 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.69 vs. limit=15.0 2023-06-20 20:13:42,280 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.16 vs. limit=8.0 2023-06-20 20:13:58,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=632124.0, ans=0.2 2023-06-20 20:14:07,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=632124.0, ans=0.125 2023-06-20 20:14:53,320 INFO [train.py:996] (2/4) Epoch 4, batch 13900, loss[loss=0.2619, simple_loss=0.3239, pruned_loss=0.09998, over 21452.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3251, pruned_loss=0.08766, over 4262222.21 frames. ], batch size: 548, lr: 8.02e-03, grad_scale: 16.0 2023-06-20 20:15:39,756 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.652e+02 3.240e+02 3.689e+02 5.944e+02, threshold=6.479e+02, percent-clipped=0.0 2023-06-20 20:16:37,005 INFO [train.py:996] (2/4) Epoch 4, batch 13950, loss[loss=0.2476, simple_loss=0.3169, pruned_loss=0.08911, over 21932.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3264, pruned_loss=0.08967, over 4270941.85 frames. ], batch size: 316, lr: 8.02e-03, grad_scale: 16.0 2023-06-20 20:16:47,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=632604.0, ans=0.125 2023-06-20 20:16:58,502 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.92 vs. limit=12.0 2023-06-20 20:17:16,890 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.85 vs. limit=15.0 2023-06-20 20:18:13,421 INFO [train.py:996] (2/4) Epoch 4, batch 14000, loss[loss=0.1504, simple_loss=0.2173, pruned_loss=0.04174, over 21766.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3252, pruned_loss=0.08828, over 4276699.54 frames. ], batch size: 102, lr: 8.01e-03, grad_scale: 32.0 2023-06-20 20:18:21,540 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-20 20:18:36,355 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=632964.0, ans=0.1 2023-06-20 20:18:53,731 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.636e+02 2.412e+02 2.959e+02 3.758e+02 5.707e+02, threshold=5.918e+02, percent-clipped=0.0 2023-06-20 20:18:55,691 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=633024.0, ans=0.1 2023-06-20 20:19:29,254 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=10.45 vs. limit=10.0 2023-06-20 20:19:32,949 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=633144.0, ans=0.1 2023-06-20 20:19:41,833 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=633144.0, ans=0.125 2023-06-20 20:19:50,093 INFO [train.py:996] (2/4) Epoch 4, batch 14050, loss[loss=0.2012, simple_loss=0.2855, pruned_loss=0.05844, over 21559.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3194, pruned_loss=0.0843, over 4278840.92 frames. ], batch size: 230, lr: 8.01e-03, grad_scale: 32.0 2023-06-20 20:20:25,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=633324.0, ans=0.125 2023-06-20 20:21:04,549 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=633444.0, ans=0.125 2023-06-20 20:21:11,043 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.39 vs. limit=8.0 2023-06-20 20:21:19,477 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=633504.0, ans=0.0 2023-06-20 20:21:25,550 INFO [train.py:996] (2/4) Epoch 4, batch 14100, loss[loss=0.2861, simple_loss=0.3451, pruned_loss=0.1135, over 21510.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3126, pruned_loss=0.08365, over 4275059.88 frames. ], batch size: 389, lr: 8.01e-03, grad_scale: 32.0 2023-06-20 20:21:31,488 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=633504.0, ans=0.125 2023-06-20 20:22:05,807 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.600e+02 2.333e+02 2.713e+02 3.192e+02 5.959e+02, threshold=5.427e+02, percent-clipped=1.0 2023-06-20 20:22:06,253 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=633624.0, ans=0.125 2023-06-20 20:22:06,309 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=633624.0, ans=0.1 2023-06-20 20:22:45,983 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=633744.0, ans=0.125 2023-06-20 20:22:55,900 INFO [train.py:996] (2/4) Epoch 4, batch 14150, loss[loss=0.2306, simple_loss=0.3192, pruned_loss=0.07094, over 21876.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3164, pruned_loss=0.08499, over 4276646.90 frames. ], batch size: 317, lr: 8.01e-03, grad_scale: 32.0 2023-06-20 20:23:15,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=633804.0, ans=0.125 2023-06-20 20:24:30,080 INFO [train.py:996] (2/4) Epoch 4, batch 14200, loss[loss=0.2403, simple_loss=0.3011, pruned_loss=0.08978, over 21515.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3137, pruned_loss=0.0825, over 4278908.83 frames. ], batch size: 195, lr: 8.01e-03, grad_scale: 32.0 2023-06-20 20:24:36,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=634104.0, ans=0.2 2023-06-20 20:24:43,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=634104.0, ans=0.125 2023-06-20 20:25:15,334 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.735e+02 2.204e+02 2.467e+02 2.859e+02 4.192e+02, threshold=4.934e+02, percent-clipped=0.0 2023-06-20 20:25:35,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=634284.0, ans=0.07 2023-06-20 20:26:06,333 INFO [train.py:996] (2/4) Epoch 4, batch 14250, loss[loss=0.1939, simple_loss=0.2563, pruned_loss=0.06572, over 21545.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3067, pruned_loss=0.08126, over 4264828.96 frames. ], batch size: 263, lr: 8.00e-03, grad_scale: 16.0 2023-06-20 20:26:49,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=634524.0, ans=0.125 2023-06-20 20:27:50,752 INFO [train.py:996] (2/4) Epoch 4, batch 14300, loss[loss=0.3777, simple_loss=0.455, pruned_loss=0.1502, over 21579.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3096, pruned_loss=0.08156, over 4263229.12 frames. ], batch size: 441, lr: 8.00e-03, grad_scale: 16.0 2023-06-20 20:28:01,021 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.54 vs. limit=12.0 2023-06-20 20:28:28,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=634764.0, ans=0.2 2023-06-20 20:28:31,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=634824.0, ans=0.125 2023-06-20 20:28:31,715 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=634824.0, ans=0.125 2023-06-20 20:28:33,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=634824.0, ans=0.125 2023-06-20 20:28:36,145 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=634824.0, ans=0.125 2023-06-20 20:28:38,453 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.95 vs. limit=15.0 2023-06-20 20:28:38,737 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.850e+02 2.547e+02 2.874e+02 3.518e+02 5.819e+02, threshold=5.747e+02, percent-clipped=3.0 2023-06-20 20:29:28,328 INFO [train.py:996] (2/4) Epoch 4, batch 14350, loss[loss=0.2714, simple_loss=0.341, pruned_loss=0.1009, over 21609.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3152, pruned_loss=0.08246, over 4252068.78 frames. ], batch size: 471, lr: 8.00e-03, grad_scale: 16.0 2023-06-20 20:29:34,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=635004.0, ans=0.125 2023-06-20 20:30:02,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=635064.0, ans=0.95 2023-06-20 20:30:50,403 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.56 vs. limit=15.0 2023-06-20 20:30:59,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=635244.0, ans=0.0 2023-06-20 20:31:02,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=635244.0, ans=0.125 2023-06-20 20:31:05,228 INFO [train.py:996] (2/4) Epoch 4, batch 14400, loss[loss=0.2282, simple_loss=0.2938, pruned_loss=0.08128, over 21716.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3116, pruned_loss=0.08277, over 4254101.77 frames. ], batch size: 332, lr: 8.00e-03, grad_scale: 32.0 2023-06-20 20:31:35,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=635364.0, ans=0.125 2023-06-20 20:31:53,787 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.870e+02 2.383e+02 2.685e+02 3.261e+02 5.760e+02, threshold=5.369e+02, percent-clipped=1.0 2023-06-20 20:32:33,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=635544.0, ans=0.1 2023-06-20 20:32:41,697 INFO [train.py:996] (2/4) Epoch 4, batch 14450, loss[loss=0.2094, simple_loss=0.272, pruned_loss=0.07334, over 21657.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3075, pruned_loss=0.08364, over 4244574.87 frames. ], batch size: 231, lr: 8.00e-03, grad_scale: 16.0 2023-06-20 20:32:48,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=635604.0, ans=0.125 2023-06-20 20:32:53,175 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=15.0 2023-06-20 20:34:07,432 INFO [train.py:996] (2/4) Epoch 4, batch 14500, loss[loss=0.2382, simple_loss=0.3237, pruned_loss=0.07635, over 21751.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3039, pruned_loss=0.08288, over 4248064.60 frames. ], batch size: 351, lr: 8.00e-03, grad_scale: 16.0 2023-06-20 20:34:21,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=635964.0, ans=0.125 2023-06-20 20:34:50,798 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=636024.0, ans=0.2 2023-06-20 20:34:56,071 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.056e+02 2.605e+02 3.187e+02 4.339e+02 7.236e+02, threshold=6.375e+02, percent-clipped=9.0 2023-06-20 20:35:19,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=636144.0, ans=0.04949747468305833 2023-06-20 20:35:45,553 INFO [train.py:996] (2/4) Epoch 4, batch 14550, loss[loss=0.2744, simple_loss=0.3401, pruned_loss=0.1043, over 21848.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3112, pruned_loss=0.08569, over 4255789.91 frames. ], batch size: 247, lr: 7.99e-03, grad_scale: 16.0 2023-06-20 20:35:47,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=636204.0, ans=0.04949747468305833 2023-06-20 20:37:03,985 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.79 vs. limit=12.0 2023-06-20 20:37:30,026 INFO [train.py:996] (2/4) Epoch 4, batch 14600, loss[loss=0.2849, simple_loss=0.3695, pruned_loss=0.1001, over 21625.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3193, pruned_loss=0.08906, over 4255689.73 frames. ], batch size: 389, lr: 7.99e-03, grad_scale: 16.0 2023-06-20 20:37:41,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=636504.0, ans=0.0 2023-06-20 20:37:43,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=636504.0, ans=0.2 2023-06-20 20:38:04,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=636624.0, ans=0.125 2023-06-20 20:38:12,832 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.049e+02 2.825e+02 3.267e+02 4.052e+02 6.777e+02, threshold=6.533e+02, percent-clipped=1.0 2023-06-20 20:38:16,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=636624.0, ans=0.125 2023-06-20 20:38:32,152 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=636684.0, ans=0.125 2023-06-20 20:39:00,571 INFO [train.py:996] (2/4) Epoch 4, batch 14650, loss[loss=0.2166, simple_loss=0.3068, pruned_loss=0.06323, over 21736.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3207, pruned_loss=0.08794, over 4262340.89 frames. ], batch size: 332, lr: 7.99e-03, grad_scale: 16.0 2023-06-20 20:39:02,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=636804.0, ans=0.0 2023-06-20 20:39:19,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=636804.0, ans=0.125 2023-06-20 20:39:40,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=636924.0, ans=0.125 2023-06-20 20:39:43,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=636924.0, ans=0.125 2023-06-20 20:39:52,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=636924.0, ans=0.2 2023-06-20 20:40:13,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=636984.0, ans=0.0 2023-06-20 20:40:43,439 INFO [train.py:996] (2/4) Epoch 4, batch 14700, loss[loss=0.2504, simple_loss=0.3308, pruned_loss=0.08497, over 21479.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3139, pruned_loss=0.08202, over 4257146.51 frames. ], batch size: 508, lr: 7.99e-03, grad_scale: 16.0 2023-06-20 20:40:51,266 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=637104.0, ans=0.125 2023-06-20 20:41:26,986 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.394e+02 1.902e+02 2.312e+02 2.758e+02 4.493e+02, threshold=4.623e+02, percent-clipped=0.0 2023-06-20 20:41:52,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=637284.0, ans=0.125 2023-06-20 20:42:22,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=637344.0, ans=0.125 2023-06-20 20:42:27,589 INFO [train.py:996] (2/4) Epoch 4, batch 14750, loss[loss=0.2684, simple_loss=0.3311, pruned_loss=0.1029, over 21489.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3197, pruned_loss=0.08564, over 4251431.73 frames. ], batch size: 211, lr: 7.99e-03, grad_scale: 16.0 2023-06-20 20:42:31,264 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=637404.0, ans=0.0 2023-06-20 20:42:35,645 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=637404.0, ans=0.0 2023-06-20 20:43:24,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=637584.0, ans=0.125 2023-06-20 20:44:09,403 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=637644.0, ans=0.2 2023-06-20 20:44:18,765 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=637704.0, ans=0.125 2023-06-20 20:44:19,802 INFO [train.py:996] (2/4) Epoch 4, batch 14800, loss[loss=0.2545, simple_loss=0.3252, pruned_loss=0.09188, over 15319.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3323, pruned_loss=0.09143, over 4251279.02 frames. ], batch size: 60, lr: 7.98e-03, grad_scale: 32.0 2023-06-20 20:44:33,725 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.80 vs. limit=8.0 2023-06-20 20:44:36,156 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=637764.0, ans=0.2 2023-06-20 20:44:37,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=637764.0, ans=0.1 2023-06-20 20:45:05,427 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.148e+02 3.268e+02 3.887e+02 5.109e+02 8.215e+02, threshold=7.774e+02, percent-clipped=33.0 2023-06-20 20:45:39,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=637944.0, ans=0.125 2023-06-20 20:45:56,121 INFO [train.py:996] (2/4) Epoch 4, batch 14850, loss[loss=0.2279, simple_loss=0.2878, pruned_loss=0.08403, over 21571.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3249, pruned_loss=0.09046, over 4251685.98 frames. ], batch size: 247, lr: 7.98e-03, grad_scale: 32.0 2023-06-20 20:47:11,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=638124.0, ans=0.125 2023-06-20 20:47:12,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=638124.0, ans=0.0 2023-06-20 20:47:13,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=638124.0, ans=0.0 2023-06-20 20:47:24,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=638184.0, ans=10.0 2023-06-20 20:47:52,986 INFO [train.py:996] (2/4) Epoch 4, batch 14900, loss[loss=0.2593, simple_loss=0.3377, pruned_loss=0.09049, over 21408.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3251, pruned_loss=0.09138, over 4249299.18 frames. ], batch size: 131, lr: 7.98e-03, grad_scale: 32.0 2023-06-20 20:48:09,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=638304.0, ans=0.125 2023-06-20 20:48:53,474 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.11 vs. limit=6.0 2023-06-20 20:49:02,875 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.039e+02 2.913e+02 3.593e+02 4.421e+02 7.605e+02, threshold=7.185e+02, percent-clipped=0.0 2023-06-20 20:49:06,985 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 20:49:07,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=638424.0, ans=22.5 2023-06-20 20:49:12,162 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.83 vs. limit=15.0 2023-06-20 20:49:51,619 INFO [train.py:996] (2/4) Epoch 4, batch 14950, loss[loss=0.302, simple_loss=0.3628, pruned_loss=0.1206, over 21411.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3274, pruned_loss=0.09158, over 4253287.58 frames. ], batch size: 471, lr: 7.98e-03, grad_scale: 32.0 2023-06-20 20:49:58,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=638604.0, ans=0.125 2023-06-20 20:49:58,943 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.49 vs. limit=12.0 2023-06-20 20:50:34,969 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=638724.0, ans=0.04949747468305833 2023-06-20 20:50:37,905 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=638724.0, ans=0.1 2023-06-20 20:50:41,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=638724.0, ans=0.0 2023-06-20 20:51:29,185 INFO [train.py:996] (2/4) Epoch 4, batch 15000, loss[loss=0.2756, simple_loss=0.3432, pruned_loss=0.1039, over 21755.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3295, pruned_loss=0.09319, over 4256232.80 frames. ], batch size: 441, lr: 7.98e-03, grad_scale: 32.0 2023-06-20 20:51:29,186 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-20 20:52:19,724 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.2.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.7527, 4.2455, 3.9575, 4.2438], device='cuda:2') 2023-06-20 20:52:21,550 INFO [train.py:1028] (2/4) Epoch 4, validation: loss=0.2644, simple_loss=0.3595, pruned_loss=0.08463, over 1796401.00 frames. 2023-06-20 20:52:21,551 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-20 20:52:53,457 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=638964.0, ans=10.0 2023-06-20 20:53:05,094 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.797e+02 2.416e+02 2.838e+02 3.359e+02 5.526e+02, threshold=5.676e+02, percent-clipped=0.0 2023-06-20 20:53:15,430 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=639024.0, ans=0.0 2023-06-20 20:54:04,797 INFO [train.py:996] (2/4) Epoch 4, batch 15050, loss[loss=0.2953, simple_loss=0.37, pruned_loss=0.1103, over 21663.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.3289, pruned_loss=0.09415, over 4252096.87 frames. ], batch size: 441, lr: 7.97e-03, grad_scale: 32.0 2023-06-20 20:54:43,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=639324.0, ans=0.2 2023-06-20 20:55:40,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=639504.0, ans=0.125 2023-06-20 20:55:41,724 INFO [train.py:996] (2/4) Epoch 4, batch 15100, loss[loss=0.2995, simple_loss=0.3722, pruned_loss=0.1134, over 21548.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.3292, pruned_loss=0.09327, over 4243776.43 frames. ], batch size: 414, lr: 7.97e-03, grad_scale: 16.0 2023-06-20 20:56:31,195 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.988e+02 2.752e+02 3.160e+02 3.731e+02 5.799e+02, threshold=6.320e+02, percent-clipped=1.0 2023-06-20 20:57:17,864 INFO [train.py:996] (2/4) Epoch 4, batch 15150, loss[loss=0.251, simple_loss=0.3092, pruned_loss=0.09643, over 21426.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3255, pruned_loss=0.09328, over 4250488.66 frames. ], batch size: 389, lr: 7.97e-03, grad_scale: 16.0 2023-06-20 20:57:18,429 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=639804.0, ans=0.025 2023-06-20 20:57:25,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=639804.0, ans=0.0 2023-06-20 20:58:37,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=640044.0, ans=0.1 2023-06-20 20:58:43,952 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=640044.0, ans=0.05 2023-06-20 20:58:49,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=640044.0, ans=0.125 2023-06-20 20:58:53,780 INFO [train.py:996] (2/4) Epoch 4, batch 15200, loss[loss=0.2138, simple_loss=0.2796, pruned_loss=0.07406, over 21771.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3173, pruned_loss=0.08891, over 4251487.12 frames. ], batch size: 112, lr: 7.97e-03, grad_scale: 32.0 2023-06-20 20:58:58,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=640104.0, ans=0.125 2023-06-20 20:59:04,516 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=640104.0, ans=0.0 2023-06-20 20:59:43,554 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.716e+02 2.484e+02 2.960e+02 3.412e+02 6.984e+02, threshold=5.920e+02, percent-clipped=2.0 2023-06-20 20:59:44,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=640224.0, ans=0.1 2023-06-20 20:59:55,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=640284.0, ans=0.1 2023-06-20 21:00:14,716 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.17 vs. limit=22.5 2023-06-20 21:00:15,679 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=640344.0, ans=0.2 2023-06-20 21:00:18,722 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 21:00:29,813 INFO [train.py:996] (2/4) Epoch 4, batch 15250, loss[loss=0.253, simple_loss=0.3152, pruned_loss=0.09544, over 21792.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3114, pruned_loss=0.08693, over 4254141.03 frames. ], batch size: 112, lr: 7.97e-03, grad_scale: 32.0 2023-06-20 21:01:03,878 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.77 vs. limit=22.5 2023-06-20 21:01:37,309 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 21:02:06,988 INFO [train.py:996] (2/4) Epoch 4, batch 15300, loss[loss=0.2855, simple_loss=0.3398, pruned_loss=0.1156, over 21762.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.314, pruned_loss=0.09029, over 4261776.28 frames. ], batch size: 441, lr: 7.97e-03, grad_scale: 32.0 2023-06-20 21:02:22,869 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.32 vs. limit=15.0 2023-06-20 21:03:12,983 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.63 vs. limit=15.0 2023-06-20 21:03:13,354 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 2.730e+02 3.158e+02 3.743e+02 8.189e+02, threshold=6.315e+02, percent-clipped=2.0 2023-06-20 21:03:48,995 INFO [train.py:996] (2/4) Epoch 4, batch 15350, loss[loss=0.234, simple_loss=0.3408, pruned_loss=0.0636, over 21795.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.3203, pruned_loss=0.09226, over 4266369.28 frames. ], batch size: 298, lr: 7.96e-03, grad_scale: 32.0 2023-06-20 21:04:27,958 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 21:05:04,082 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-20 21:05:19,060 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.29 vs. limit=10.0 2023-06-20 21:05:39,759 INFO [train.py:996] (2/4) Epoch 4, batch 15400, loss[loss=0.2608, simple_loss=0.3246, pruned_loss=0.09852, over 21889.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3201, pruned_loss=0.0905, over 4262540.26 frames. ], batch size: 371, lr: 7.96e-03, grad_scale: 32.0 2023-06-20 21:06:23,592 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.884e+02 2.400e+02 2.718e+02 3.167e+02 5.553e+02, threshold=5.437e+02, percent-clipped=0.0 2023-06-20 21:06:33,188 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.58 vs. limit=15.0 2023-06-20 21:06:49,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=641484.0, ans=0.125 2023-06-20 21:07:08,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=641544.0, ans=0.05 2023-06-20 21:07:15,588 INFO [train.py:996] (2/4) Epoch 4, batch 15450, loss[loss=0.2404, simple_loss=0.3143, pruned_loss=0.08329, over 21861.00 frames. ], tot_loss[loss=0.248, simple_loss=0.317, pruned_loss=0.08946, over 4273317.52 frames. ], batch size: 118, lr: 7.96e-03, grad_scale: 32.0 2023-06-20 21:07:53,154 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.03 vs. limit=22.5 2023-06-20 21:08:14,072 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=641724.0, ans=0.0 2023-06-20 21:08:21,323 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=15.0 2023-06-20 21:08:40,564 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 21:08:57,921 INFO [train.py:996] (2/4) Epoch 4, batch 15500, loss[loss=0.2737, simple_loss=0.349, pruned_loss=0.09919, over 21300.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3198, pruned_loss=0.08909, over 4262706.58 frames. ], batch size: 176, lr: 7.96e-03, grad_scale: 32.0 2023-06-20 21:09:08,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=641904.0, ans=0.125 2023-06-20 21:09:48,558 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.680e+02 2.414e+02 2.758e+02 3.182e+02 6.018e+02, threshold=5.516e+02, percent-clipped=3.0 2023-06-20 21:09:52,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=642024.0, ans=0.125 2023-06-20 21:10:41,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=642144.0, ans=0.125 2023-06-20 21:10:43,739 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.88 vs. limit=15.0 2023-06-20 21:10:49,962 INFO [train.py:996] (2/4) Epoch 4, batch 15550, loss[loss=0.2215, simple_loss=0.3077, pruned_loss=0.06764, over 21689.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3193, pruned_loss=0.08738, over 4265854.71 frames. ], batch size: 298, lr: 7.96e-03, grad_scale: 32.0 2023-06-20 21:12:25,919 INFO [train.py:996] (2/4) Epoch 4, batch 15600, loss[loss=0.2206, simple_loss=0.3031, pruned_loss=0.06906, over 21620.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3124, pruned_loss=0.0852, over 4261669.68 frames. ], batch size: 247, lr: 7.95e-03, grad_scale: 32.0 2023-06-20 21:12:32,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=642504.0, ans=0.1 2023-06-20 21:13:32,814 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.813e+02 2.378e+02 2.705e+02 3.069e+02 6.852e+02, threshold=5.411e+02, percent-clipped=1.0 2023-06-20 21:13:39,441 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=642684.0, ans=0.125 2023-06-20 21:13:39,523 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=642684.0, ans=0.125 2023-06-20 21:13:42,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=642684.0, ans=0.125 2023-06-20 21:14:17,075 INFO [train.py:996] (2/4) Epoch 4, batch 15650, loss[loss=0.237, simple_loss=0.3019, pruned_loss=0.08602, over 21866.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3136, pruned_loss=0.08455, over 4256495.34 frames. ], batch size: 107, lr: 7.95e-03, grad_scale: 32.0 2023-06-20 21:14:22,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=642804.0, ans=0.0 2023-06-20 21:16:19,617 INFO [train.py:996] (2/4) Epoch 4, batch 15700, loss[loss=0.2236, simple_loss=0.3017, pruned_loss=0.07279, over 21538.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.31, pruned_loss=0.08421, over 4251318.72 frames. ], batch size: 389, lr: 7.95e-03, grad_scale: 32.0 2023-06-20 21:17:16,186 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.862e+02 2.295e+02 2.525e+02 2.882e+02 4.287e+02, threshold=5.050e+02, percent-clipped=0.0 2023-06-20 21:17:21,461 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.29 vs. limit=15.0 2023-06-20 21:17:48,511 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.05 vs. limit=15.0 2023-06-20 21:17:52,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=643344.0, ans=0.125 2023-06-20 21:17:58,121 INFO [train.py:996] (2/4) Epoch 4, batch 15750, loss[loss=0.2337, simple_loss=0.2972, pruned_loss=0.08508, over 21213.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3051, pruned_loss=0.08327, over 4252505.26 frames. ], batch size: 143, lr: 7.95e-03, grad_scale: 32.0 2023-06-20 21:17:59,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=643404.0, ans=0.125 2023-06-20 21:18:52,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=643524.0, ans=0.05 2023-06-20 21:19:44,100 INFO [train.py:996] (2/4) Epoch 4, batch 15800, loss[loss=0.2084, simple_loss=0.2733, pruned_loss=0.07171, over 21526.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3009, pruned_loss=0.0833, over 4253959.29 frames. ], batch size: 230, lr: 7.95e-03, grad_scale: 16.0 2023-06-20 21:19:50,286 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=643704.0, ans=0.125 2023-06-20 21:19:50,302 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=643704.0, ans=0.125 2023-06-20 21:20:51,223 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.953e+02 2.525e+02 2.910e+02 3.653e+02 6.469e+02, threshold=5.821e+02, percent-clipped=2.0 2023-06-20 21:21:04,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=643884.0, ans=0.1 2023-06-20 21:21:34,822 INFO [train.py:996] (2/4) Epoch 4, batch 15850, loss[loss=0.2993, simple_loss=0.3479, pruned_loss=0.1253, over 21347.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.303, pruned_loss=0.08532, over 4252948.40 frames. ], batch size: 471, lr: 7.95e-03, grad_scale: 16.0 2023-06-20 21:22:57,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=644244.0, ans=0.125 2023-06-20 21:22:57,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=644244.0, ans=0.125 2023-06-20 21:23:12,749 INFO [train.py:996] (2/4) Epoch 4, batch 15900, loss[loss=0.2493, simple_loss=0.3319, pruned_loss=0.08335, over 21523.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3022, pruned_loss=0.08596, over 4251539.61 frames. ], batch size: 389, lr: 7.94e-03, grad_scale: 16.0 2023-06-20 21:23:18,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=644304.0, ans=0.2 2023-06-20 21:23:37,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=644364.0, ans=0.125 2023-06-20 21:24:03,332 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.27 vs. limit=10.0 2023-06-20 21:24:05,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=644424.0, ans=0.125 2023-06-20 21:24:14,690 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.963e+02 2.520e+02 3.020e+02 3.568e+02 5.069e+02, threshold=6.040e+02, percent-clipped=0.0 2023-06-20 21:24:46,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=644544.0, ans=0.125 2023-06-20 21:24:51,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff3.min_abs, batch_count=644544.0, ans=0.2 2023-06-20 21:24:53,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=644604.0, ans=10.0 2023-06-20 21:24:54,178 INFO [train.py:996] (2/4) Epoch 4, batch 15950, loss[loss=0.2108, simple_loss=0.2793, pruned_loss=0.07115, over 15870.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3018, pruned_loss=0.0839, over 4255408.01 frames. ], batch size: 62, lr: 7.94e-03, grad_scale: 16.0 2023-06-20 21:25:16,612 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 21:25:21,614 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-06-20 21:25:22,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=644664.0, ans=0.125 2023-06-20 21:25:47,277 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.83 vs. limit=15.0 2023-06-20 21:25:50,449 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.60 vs. limit=15.0 2023-06-20 21:26:22,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=644844.0, ans=0.1 2023-06-20 21:26:33,940 INFO [train.py:996] (2/4) Epoch 4, batch 16000, loss[loss=0.208, simple_loss=0.284, pruned_loss=0.066, over 21366.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3041, pruned_loss=0.08153, over 4246551.73 frames. ], batch size: 176, lr: 7.94e-03, grad_scale: 32.0 2023-06-20 21:26:41,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=644904.0, ans=0.1 2023-06-20 21:26:41,810 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=644904.0, ans=0.125 2023-06-20 21:26:46,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=644904.0, ans=0.125 2023-06-20 21:27:21,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=645024.0, ans=0.125 2023-06-20 21:27:32,069 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.833e+02 2.333e+02 2.770e+02 3.330e+02 6.195e+02, threshold=5.540e+02, percent-clipped=2.0 2023-06-20 21:28:06,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=645144.0, ans=0.2 2023-06-20 21:28:09,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=645144.0, ans=0.125 2023-06-20 21:28:20,585 INFO [train.py:996] (2/4) Epoch 4, batch 16050, loss[loss=0.3126, simple_loss=0.4036, pruned_loss=0.1108, over 21661.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3069, pruned_loss=0.07938, over 4249453.72 frames. ], batch size: 441, lr: 7.94e-03, grad_scale: 32.0 2023-06-20 21:29:40,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=645384.0, ans=0.125 2023-06-20 21:29:46,394 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=645444.0, ans=0.0 2023-06-20 21:30:05,003 INFO [train.py:996] (2/4) Epoch 4, batch 16100, loss[loss=0.2638, simple_loss=0.3215, pruned_loss=0.103, over 21828.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3101, pruned_loss=0.08086, over 4259366.71 frames. ], batch size: 441, lr: 7.94e-03, grad_scale: 32.0 2023-06-20 21:30:15,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=645504.0, ans=0.125 2023-06-20 21:31:00,659 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.987e+02 2.499e+02 3.072e+02 4.062e+02 6.589e+02, threshold=6.145e+02, percent-clipped=6.0 2023-06-20 21:31:12,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=645684.0, ans=0.0 2023-06-20 21:31:35,888 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=645744.0, ans=0.125 2023-06-20 21:31:41,506 INFO [train.py:996] (2/4) Epoch 4, batch 16150, loss[loss=0.2445, simple_loss=0.3168, pruned_loss=0.08608, over 21886.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3114, pruned_loss=0.08339, over 4266241.19 frames. ], batch size: 332, lr: 7.93e-03, grad_scale: 32.0 2023-06-20 21:32:09,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=645864.0, ans=0.95 2023-06-20 21:32:38,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=645924.0, ans=0.125 2023-06-20 21:32:45,833 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=645984.0, ans=0.04949747468305833 2023-06-20 21:33:10,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=646044.0, ans=0.0 2023-06-20 21:33:16,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=646104.0, ans=0.125 2023-06-20 21:33:17,671 INFO [train.py:996] (2/4) Epoch 4, batch 16200, loss[loss=0.2856, simple_loss=0.3586, pruned_loss=0.1062, over 21220.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3177, pruned_loss=0.08554, over 4274342.77 frames. ], batch size: 143, lr: 7.93e-03, grad_scale: 32.0 2023-06-20 21:33:25,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=646104.0, ans=0.2 2023-06-20 21:33:41,789 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=646164.0, ans=0.2 2023-06-20 21:33:42,340 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.27 vs. limit=15.0 2023-06-20 21:33:58,363 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.50 vs. limit=15.0 2023-06-20 21:34:14,050 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.972e+02 2.796e+02 3.136e+02 3.765e+02 7.945e+02, threshold=6.271e+02, percent-clipped=2.0 2023-06-20 21:34:43,979 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.97 vs. limit=12.0 2023-06-20 21:34:53,533 INFO [train.py:996] (2/4) Epoch 4, batch 16250, loss[loss=0.1928, simple_loss=0.2579, pruned_loss=0.06381, over 21174.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3173, pruned_loss=0.0848, over 4273172.18 frames. ], batch size: 143, lr: 7.93e-03, grad_scale: 32.0 2023-06-20 21:34:55,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=646404.0, ans=0.125 2023-06-20 21:35:43,777 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.89 vs. limit=15.0 2023-06-20 21:36:30,014 INFO [train.py:996] (2/4) Epoch 4, batch 16300, loss[loss=0.1547, simple_loss=0.2057, pruned_loss=0.05186, over 17034.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3097, pruned_loss=0.08096, over 4253508.78 frames. ], batch size: 62, lr: 7.93e-03, grad_scale: 32.0 2023-06-20 21:36:39,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=646704.0, ans=0.125 2023-06-20 21:37:02,358 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.16 vs. limit=22.5 2023-06-20 21:37:14,941 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 21:37:19,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=646824.0, ans=0.04949747468305833 2023-06-20 21:37:28,339 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.589e+02 2.180e+02 2.582e+02 2.940e+02 4.968e+02, threshold=5.164e+02, percent-clipped=0.0 2023-06-20 21:37:51,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=646884.0, ans=0.0 2023-06-20 21:38:10,092 INFO [train.py:996] (2/4) Epoch 4, batch 16350, loss[loss=0.249, simple_loss=0.3182, pruned_loss=0.08993, over 21703.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3097, pruned_loss=0.08217, over 4256199.27 frames. ], batch size: 298, lr: 7.93e-03, grad_scale: 32.0 2023-06-20 21:39:25,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=647184.0, ans=0.0 2023-06-20 21:39:33,340 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=647244.0, ans=0.125 2023-06-20 21:39:38,769 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.82 vs. limit=15.0 2023-06-20 21:39:43,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=647244.0, ans=0.125 2023-06-20 21:39:53,171 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.13 vs. limit=15.0 2023-06-20 21:39:55,228 INFO [train.py:996] (2/4) Epoch 4, batch 16400, loss[loss=0.226, simple_loss=0.2913, pruned_loss=0.08038, over 21806.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3129, pruned_loss=0.08387, over 4256463.00 frames. ], batch size: 247, lr: 7.92e-03, grad_scale: 32.0 2023-06-20 21:40:21,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff2.min_abs, batch_count=647364.0, ans=0.1 2023-06-20 21:40:38,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=647424.0, ans=0.125 2023-06-20 21:40:52,418 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.949e+02 2.622e+02 2.917e+02 3.537e+02 6.383e+02, threshold=5.834e+02, percent-clipped=4.0 2023-06-20 21:41:17,248 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=647544.0, ans=0.1 2023-06-20 21:41:32,876 INFO [train.py:996] (2/4) Epoch 4, batch 16450, loss[loss=0.2328, simple_loss=0.3032, pruned_loss=0.0812, over 21923.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3125, pruned_loss=0.08516, over 4265423.38 frames. ], batch size: 124, lr: 7.92e-03, grad_scale: 32.0 2023-06-20 21:41:49,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=647604.0, ans=0.1 2023-06-20 21:42:21,679 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=647724.0, ans=0.2 2023-06-20 21:42:30,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=647724.0, ans=0.0 2023-06-20 21:42:47,061 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=647784.0, ans=0.125 2023-06-20 21:43:11,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=647904.0, ans=0.125 2023-06-20 21:43:17,620 INFO [train.py:996] (2/4) Epoch 4, batch 16500, loss[loss=0.229, simple_loss=0.2991, pruned_loss=0.07944, over 21818.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3102, pruned_loss=0.08509, over 4275031.83 frames. ], batch size: 316, lr: 7.92e-03, grad_scale: 32.0 2023-06-20 21:43:32,436 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.32 vs. limit=15.0 2023-06-20 21:43:53,627 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=647964.0, ans=0.2 2023-06-20 21:44:04,859 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=648024.0, ans=0.125 2023-06-20 21:44:24,338 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 2.642e+02 3.003e+02 3.781e+02 6.239e+02, threshold=6.006e+02, percent-clipped=1.0 2023-06-20 21:44:44,983 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=648084.0, ans=0.1 2023-06-20 21:45:09,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=648144.0, ans=0.0 2023-06-20 21:45:10,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=648144.0, ans=0.0 2023-06-20 21:45:12,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=648144.0, ans=0.125 2023-06-20 21:45:32,253 INFO [train.py:996] (2/4) Epoch 4, batch 16550, loss[loss=0.2682, simple_loss=0.3469, pruned_loss=0.09473, over 19913.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3072, pruned_loss=0.08158, over 4274097.46 frames. ], batch size: 702, lr: 7.92e-03, grad_scale: 32.0 2023-06-20 21:45:57,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=648264.0, ans=0.125 2023-06-20 21:46:13,155 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=648324.0, ans=0.2 2023-06-20 21:46:38,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=648384.0, ans=0.125 2023-06-20 21:46:45,756 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=648384.0, ans=0.125 2023-06-20 21:46:47,105 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=648384.0, ans=0.125 2023-06-20 21:47:04,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=648384.0, ans=0.0 2023-06-20 21:47:25,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=648444.0, ans=0.125 2023-06-20 21:47:29,698 INFO [train.py:996] (2/4) Epoch 4, batch 16600, loss[loss=0.2724, simple_loss=0.3545, pruned_loss=0.09519, over 21156.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3179, pruned_loss=0.08608, over 4277884.43 frames. ], batch size: 143, lr: 7.92e-03, grad_scale: 32.0 2023-06-20 21:48:01,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=648564.0, ans=0.0 2023-06-20 21:48:22,117 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.763e+02 2.551e+02 2.918e+02 3.524e+02 5.305e+02, threshold=5.835e+02, percent-clipped=0.0 2023-06-20 21:48:44,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=648684.0, ans=0.125 2023-06-20 21:48:52,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=648744.0, ans=0.125 2023-06-20 21:49:13,944 INFO [train.py:996] (2/4) Epoch 4, batch 16650, loss[loss=0.2748, simple_loss=0.3457, pruned_loss=0.1019, over 21995.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3298, pruned_loss=0.09003, over 4277972.47 frames. ], batch size: 317, lr: 7.92e-03, grad_scale: 32.0 2023-06-20 21:49:16,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=648804.0, ans=0.125 2023-06-20 21:49:19,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=648804.0, ans=0.0 2023-06-20 21:50:17,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=648924.0, ans=0.2 2023-06-20 21:50:42,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=649044.0, ans=0.125 2023-06-20 21:50:55,791 INFO [train.py:996] (2/4) Epoch 4, batch 16700, loss[loss=0.1875, simple_loss=0.2576, pruned_loss=0.05868, over 21453.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3305, pruned_loss=0.0908, over 4273237.40 frames. ], batch size: 194, lr: 7.91e-03, grad_scale: 32.0 2023-06-20 21:50:56,972 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.99 vs. limit=15.0 2023-06-20 21:52:00,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=649224.0, ans=0.125 2023-06-20 21:52:03,131 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.991e+02 2.731e+02 3.045e+02 3.440e+02 5.036e+02, threshold=6.090e+02, percent-clipped=0.0 2023-06-20 21:52:03,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=649224.0, ans=0.1 2023-06-20 21:52:55,500 INFO [train.py:996] (2/4) Epoch 4, batch 16750, loss[loss=0.2731, simple_loss=0.3498, pruned_loss=0.09822, over 21931.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.3324, pruned_loss=0.09243, over 4275776.59 frames. ], batch size: 317, lr: 7.91e-03, grad_scale: 32.0 2023-06-20 21:53:14,244 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.46 vs. limit=10.0 2023-06-20 21:53:16,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=649404.0, ans=0.0 2023-06-20 21:53:39,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=649464.0, ans=0.04949747468305833 2023-06-20 21:53:48,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=649524.0, ans=0.0 2023-06-20 21:54:30,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=649644.0, ans=0.125 2023-06-20 21:54:32,149 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=649644.0, ans=0.1 2023-06-20 21:54:44,326 INFO [train.py:996] (2/4) Epoch 4, batch 16800, loss[loss=0.3125, simple_loss=0.3963, pruned_loss=0.1144, over 19874.00 frames. ], tot_loss[loss=0.2614, simple_loss=0.3371, pruned_loss=0.09289, over 4274243.78 frames. ], batch size: 703, lr: 7.91e-03, grad_scale: 32.0 2023-06-20 21:54:50,715 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=649704.0, ans=0.125 2023-06-20 21:55:44,092 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.214e+02 2.879e+02 3.327e+02 4.049e+02 9.640e+02, threshold=6.654e+02, percent-clipped=7.0 2023-06-20 21:56:17,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=649884.0, ans=0.1 2023-06-20 21:56:21,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=649884.0, ans=0.125 2023-06-20 21:56:23,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=649884.0, ans=0.125 2023-06-20 21:56:33,394 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=649944.0, ans=0.125 2023-06-20 21:56:51,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=649944.0, ans=0.125 2023-06-20 21:56:54,187 INFO [train.py:996] (2/4) Epoch 4, batch 16850, loss[loss=0.2249, simple_loss=0.2898, pruned_loss=0.08, over 21809.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.3335, pruned_loss=0.09294, over 4273251.20 frames. ], batch size: 247, lr: 7.91e-03, grad_scale: 32.0 2023-06-20 21:57:18,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=650064.0, ans=0.125 2023-06-20 21:57:22,833 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=650064.0, ans=0.125 2023-06-20 21:57:26,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=650064.0, ans=22.5 2023-06-20 21:57:38,088 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.26 vs. limit=12.0 2023-06-20 21:57:38,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=650124.0, ans=0.1 2023-06-20 21:57:54,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=650184.0, ans=0.0 2023-06-20 21:58:30,574 INFO [train.py:996] (2/4) Epoch 4, batch 16900, loss[loss=0.2057, simple_loss=0.2752, pruned_loss=0.06807, over 21568.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3278, pruned_loss=0.09099, over 4280872.98 frames. ], batch size: 230, lr: 7.91e-03, grad_scale: 32.0 2023-06-20 21:58:37,167 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 21:59:12,233 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 21:59:16,042 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.755e+02 2.471e+02 2.836e+02 3.403e+02 5.773e+02, threshold=5.671e+02, percent-clipped=0.0 2023-06-20 22:00:07,487 INFO [train.py:996] (2/4) Epoch 4, batch 16950, loss[loss=0.253, simple_loss=0.3104, pruned_loss=0.09779, over 21907.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3203, pruned_loss=0.08922, over 4278844.45 frames. ], batch size: 316, lr: 7.90e-03, grad_scale: 32.0 2023-06-20 22:01:08,043 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=650784.0, ans=0.07 2023-06-20 22:01:54,116 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=17.52 vs. limit=15.0 2023-06-20 22:02:01,824 INFO [train.py:996] (2/4) Epoch 4, batch 17000, loss[loss=0.2443, simple_loss=0.3086, pruned_loss=0.08997, over 21737.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3172, pruned_loss=0.08946, over 4287291.89 frames. ], batch size: 230, lr: 7.90e-03, grad_scale: 32.0 2023-06-20 22:02:37,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=651024.0, ans=0.1 2023-06-20 22:02:47,947 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.109e+02 2.716e+02 3.273e+02 3.995e+02 5.622e+02, threshold=6.546e+02, percent-clipped=0.0 2023-06-20 22:03:28,850 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=651144.0, ans=0.125 2023-06-20 22:03:38,868 INFO [train.py:996] (2/4) Epoch 4, batch 17050, loss[loss=0.2741, simple_loss=0.3649, pruned_loss=0.09169, over 21802.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3252, pruned_loss=0.09271, over 4288762.88 frames. ], batch size: 332, lr: 7.90e-03, grad_scale: 32.0 2023-06-20 22:03:58,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=651264.0, ans=0.125 2023-06-20 22:04:18,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=651324.0, ans=0.0 2023-06-20 22:04:18,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=651324.0, ans=0.1 2023-06-20 22:04:18,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=651324.0, ans=0.2 2023-06-20 22:04:45,786 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.25 vs. limit=15.0 2023-06-20 22:04:47,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=651384.0, ans=0.0 2023-06-20 22:05:13,762 INFO [train.py:996] (2/4) Epoch 4, batch 17100, loss[loss=0.2652, simple_loss=0.3364, pruned_loss=0.09703, over 21776.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3248, pruned_loss=0.09334, over 4293078.19 frames. ], batch size: 112, lr: 7.90e-03, grad_scale: 32.0 2023-06-20 22:05:45,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=651564.0, ans=0.125 2023-06-20 22:05:59,547 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.062e+02 2.611e+02 3.092e+02 3.570e+02 5.618e+02, threshold=6.184e+02, percent-clipped=0.0 2023-06-20 22:06:16,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=651684.0, ans=0.2 2023-06-20 22:06:20,797 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=651684.0, ans=0.0 2023-06-20 22:06:51,001 INFO [train.py:996] (2/4) Epoch 4, batch 17150, loss[loss=0.2127, simple_loss=0.2898, pruned_loss=0.06784, over 21667.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3191, pruned_loss=0.09191, over 4295962.80 frames. ], batch size: 389, lr: 7.90e-03, grad_scale: 32.0 2023-06-20 22:07:33,823 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=651864.0, ans=0.5 2023-06-20 22:08:38,504 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.66 vs. limit=15.0 2023-06-20 22:08:44,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=652044.0, ans=0.1 2023-06-20 22:08:45,396 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.67 vs. limit=15.0 2023-06-20 22:08:48,887 INFO [train.py:996] (2/4) Epoch 4, batch 17200, loss[loss=0.2703, simple_loss=0.3345, pruned_loss=0.103, over 21768.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3189, pruned_loss=0.09108, over 4296440.57 frames. ], batch size: 441, lr: 7.90e-03, grad_scale: 32.0 2023-06-20 22:09:32,751 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=652224.0, ans=0.125 2023-06-20 22:09:48,595 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.879e+02 2.497e+02 2.913e+02 3.458e+02 6.747e+02, threshold=5.827e+02, percent-clipped=3.0 2023-06-20 22:10:26,449 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=652344.0, ans=0.125 2023-06-20 22:10:29,125 INFO [train.py:996] (2/4) Epoch 4, batch 17250, loss[loss=0.2491, simple_loss=0.3235, pruned_loss=0.08734, over 21466.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3222, pruned_loss=0.09266, over 4295373.99 frames. ], batch size: 211, lr: 7.89e-03, grad_scale: 32.0 2023-06-20 22:10:37,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=652404.0, ans=0.0 2023-06-20 22:11:15,604 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=652524.0, ans=0.125 2023-06-20 22:12:03,304 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.38 vs. limit=22.5 2023-06-20 22:12:08,322 INFO [train.py:996] (2/4) Epoch 4, batch 17300, loss[loss=0.2948, simple_loss=0.3794, pruned_loss=0.1051, over 20801.00 frames. ], tot_loss[loss=0.2633, simple_loss=0.3322, pruned_loss=0.09718, over 4296814.33 frames. ], batch size: 607, lr: 7.89e-03, grad_scale: 32.0 2023-06-20 22:12:52,081 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.67 vs. limit=22.5 2023-06-20 22:13:00,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=652824.0, ans=0.0 2023-06-20 22:13:03,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=652824.0, ans=0.0 2023-06-20 22:13:13,036 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.061e+02 2.676e+02 3.122e+02 3.587e+02 4.716e+02, threshold=6.244e+02, percent-clipped=0.0 2023-06-20 22:13:43,149 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.62 vs. limit=15.0 2023-06-20 22:13:52,571 INFO [train.py:996] (2/4) Epoch 4, batch 17350, loss[loss=0.2845, simple_loss=0.3679, pruned_loss=0.1006, over 21486.00 frames. ], tot_loss[loss=0.2621, simple_loss=0.3321, pruned_loss=0.096, over 4288797.73 frames. ], batch size: 471, lr: 7.89e-03, grad_scale: 32.0 2023-06-20 22:14:17,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=653004.0, ans=0.125 2023-06-20 22:14:19,413 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.98 vs. limit=15.0 2023-06-20 22:14:37,362 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.03 vs. limit=22.5 2023-06-20 22:14:38,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=653064.0, ans=0.0 2023-06-20 22:15:01,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=653184.0, ans=0.0 2023-06-20 22:15:43,270 INFO [train.py:996] (2/4) Epoch 4, batch 17400, loss[loss=0.2629, simple_loss=0.3649, pruned_loss=0.08045, over 20699.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3293, pruned_loss=0.09159, over 4287828.00 frames. ], batch size: 608, lr: 7.89e-03, grad_scale: 32.0 2023-06-20 22:16:11,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=653364.0, ans=10.0 2023-06-20 22:16:13,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=653364.0, ans=0.125 2023-06-20 22:16:38,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=653424.0, ans=0.125 2023-06-20 22:16:49,919 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.985e+02 2.607e+02 3.173e+02 3.817e+02 7.670e+02, threshold=6.346e+02, percent-clipped=2.0 2023-06-20 22:17:20,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=653544.0, ans=0.07 2023-06-20 22:17:37,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=653544.0, ans=0.2 2023-06-20 22:17:40,512 INFO [train.py:996] (2/4) Epoch 4, batch 17450, loss[loss=0.2224, simple_loss=0.315, pruned_loss=0.06495, over 21588.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3223, pruned_loss=0.08787, over 4278364.19 frames. ], batch size: 441, lr: 7.89e-03, grad_scale: 32.0 2023-06-20 22:18:01,109 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=653664.0, ans=0.125 2023-06-20 22:19:11,783 INFO [train.py:996] (2/4) Epoch 4, batch 17500, loss[loss=0.2377, simple_loss=0.2954, pruned_loss=0.08998, over 21570.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3168, pruned_loss=0.08501, over 4282306.74 frames. ], batch size: 212, lr: 7.89e-03, grad_scale: 32.0 2023-06-20 22:20:00,467 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.618e+02 2.227e+02 2.545e+02 2.907e+02 4.795e+02, threshold=5.089e+02, percent-clipped=0.0 2023-06-20 22:20:39,071 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.12 vs. limit=15.0 2023-06-20 22:20:42,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=654144.0, ans=0.125 2023-06-20 22:20:46,786 INFO [train.py:996] (2/4) Epoch 4, batch 17550, loss[loss=0.2187, simple_loss=0.3067, pruned_loss=0.06542, over 21440.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3162, pruned_loss=0.08369, over 4291434.33 frames. ], batch size: 211, lr: 7.88e-03, grad_scale: 16.0 2023-06-20 22:20:47,784 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.22 vs. limit=12.0 2023-06-20 22:20:51,874 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=654204.0, ans=0.0 2023-06-20 22:20:54,724 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=654204.0, ans=0.05 2023-06-20 22:21:08,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=654264.0, ans=0.125 2023-06-20 22:21:09,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=654264.0, ans=0.125 2023-06-20 22:21:11,359 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=654264.0, ans=0.2 2023-06-20 22:21:22,445 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.83 vs. limit=15.0 2023-06-20 22:22:15,394 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=654444.0, ans=0.125 2023-06-20 22:22:22,681 INFO [train.py:996] (2/4) Epoch 4, batch 17600, loss[loss=0.2433, simple_loss=0.3192, pruned_loss=0.08377, over 21403.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3196, pruned_loss=0.08441, over 4270463.23 frames. ], batch size: 176, lr: 7.88e-03, grad_scale: 32.0 2023-06-20 22:22:37,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=654564.0, ans=0.2 2023-06-20 22:22:55,843 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=654564.0, ans=0.04949747468305833 2023-06-20 22:23:12,014 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.874e+02 2.529e+02 3.022e+02 3.829e+02 6.673e+02, threshold=6.045e+02, percent-clipped=10.0 2023-06-20 22:23:19,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=654684.0, ans=0.125 2023-06-20 22:23:19,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=654684.0, ans=0.1 2023-06-20 22:23:20,847 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=654684.0, ans=0.0 2023-06-20 22:23:48,306 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.49 vs. limit=15.0 2023-06-20 22:23:58,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=654804.0, ans=0.2 2023-06-20 22:23:59,704 INFO [train.py:996] (2/4) Epoch 4, batch 17650, loss[loss=0.2082, simple_loss=0.2851, pruned_loss=0.06567, over 21860.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3173, pruned_loss=0.08433, over 4262570.79 frames. ], batch size: 317, lr: 7.88e-03, grad_scale: 32.0 2023-06-20 22:24:03,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=654804.0, ans=0.0 2023-06-20 22:24:27,639 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.49 vs. limit=12.0 2023-06-20 22:24:27,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=654864.0, ans=6.0 2023-06-20 22:24:37,451 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=654924.0, ans=0.0 2023-06-20 22:25:31,921 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.18 vs. limit=6.0 2023-06-20 22:25:33,393 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.95 vs. limit=22.5 2023-06-20 22:25:36,584 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.57 vs. limit=15.0 2023-06-20 22:25:36,825 INFO [train.py:996] (2/4) Epoch 4, batch 17700, loss[loss=0.2418, simple_loss=0.3304, pruned_loss=0.0766, over 21677.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3122, pruned_loss=0.08196, over 4267563.53 frames. ], batch size: 332, lr: 7.88e-03, grad_scale: 32.0 2023-06-20 22:25:51,076 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-06-20 22:25:51,993 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=655104.0, ans=0.125 2023-06-20 22:25:52,679 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.45 vs. limit=22.5 2023-06-20 22:26:12,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=655164.0, ans=0.0 2023-06-20 22:26:42,317 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.837e+02 2.561e+02 3.035e+02 3.633e+02 7.633e+02, threshold=6.070e+02, percent-clipped=4.0 2023-06-20 22:27:25,850 INFO [train.py:996] (2/4) Epoch 4, batch 17750, loss[loss=0.2685, simple_loss=0.3416, pruned_loss=0.09765, over 21759.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3186, pruned_loss=0.08484, over 4263181.37 frames. ], batch size: 332, lr: 7.88e-03, grad_scale: 32.0 2023-06-20 22:27:40,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=655404.0, ans=0.125 2023-06-20 22:27:59,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=655464.0, ans=0.125 2023-06-20 22:28:34,808 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=655584.0, ans=0.0 2023-06-20 22:28:54,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=655584.0, ans=0.125 2023-06-20 22:29:23,617 INFO [train.py:996] (2/4) Epoch 4, batch 17800, loss[loss=0.2011, simple_loss=0.2797, pruned_loss=0.06126, over 21350.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3185, pruned_loss=0.08408, over 4261980.53 frames. ], batch size: 211, lr: 7.87e-03, grad_scale: 32.0 2023-06-20 22:29:31,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=655704.0, ans=0.95 2023-06-20 22:29:42,738 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.44 vs. limit=15.0 2023-06-20 22:29:50,379 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 22:30:08,933 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.00 vs. limit=15.0 2023-06-20 22:30:30,537 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.734e+02 2.510e+02 2.835e+02 3.285e+02 4.580e+02, threshold=5.670e+02, percent-clipped=0.0 2023-06-20 22:31:14,518 INFO [train.py:996] (2/4) Epoch 4, batch 17850, loss[loss=0.2494, simple_loss=0.3175, pruned_loss=0.09062, over 21625.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3186, pruned_loss=0.08392, over 4269796.65 frames. ], batch size: 230, lr: 7.87e-03, grad_scale: 32.0 2023-06-20 22:31:45,387 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=22.5 2023-06-20 22:31:56,498 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 22:32:44,029 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.39 vs. limit=15.0 2023-06-20 22:32:44,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=656244.0, ans=0.125 2023-06-20 22:33:13,386 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.41 vs. limit=15.0 2023-06-20 22:33:25,136 INFO [train.py:996] (2/4) Epoch 4, batch 17900, loss[loss=0.283, simple_loss=0.3577, pruned_loss=0.1041, over 21595.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.325, pruned_loss=0.08705, over 4269252.14 frames. ], batch size: 389, lr: 7.87e-03, grad_scale: 32.0 2023-06-20 22:33:25,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=656304.0, ans=0.125 2023-06-20 22:33:49,340 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 22:34:13,210 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.59 vs. limit=15.0 2023-06-20 22:34:18,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=656424.0, ans=0.0 2023-06-20 22:34:19,837 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.048e+02 2.504e+02 2.807e+02 3.225e+02 4.472e+02, threshold=5.613e+02, percent-clipped=0.0 2023-06-20 22:34:24,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=656484.0, ans=0.125 2023-06-20 22:34:52,186 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=656484.0, ans=0.0 2023-06-20 22:35:06,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=656544.0, ans=0.0 2023-06-20 22:35:20,775 INFO [train.py:996] (2/4) Epoch 4, batch 17950, loss[loss=0.2056, simple_loss=0.287, pruned_loss=0.06211, over 21497.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.324, pruned_loss=0.08367, over 4267473.31 frames. ], batch size: 195, lr: 7.87e-03, grad_scale: 32.0 2023-06-20 22:35:47,839 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.21 vs. limit=12.0 2023-06-20 22:36:07,215 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.00 vs. limit=12.0 2023-06-20 22:36:10,098 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.54 vs. limit=15.0 2023-06-20 22:36:15,509 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=656784.0, ans=0.125 2023-06-20 22:36:18,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=656784.0, ans=0.2 2023-06-20 22:36:58,651 INFO [train.py:996] (2/4) Epoch 4, batch 18000, loss[loss=0.2056, simple_loss=0.2627, pruned_loss=0.07422, over 21320.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.316, pruned_loss=0.08141, over 4263590.97 frames. ], batch size: 160, lr: 7.87e-03, grad_scale: 32.0 2023-06-20 22:36:58,652 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-20 22:37:55,295 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.4.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([2.2266, 3.4380, 3.1673, 2.1530], device='cuda:2') 2023-06-20 22:37:57,973 INFO [train.py:1028] (2/4) Epoch 4, validation: loss=0.2692, simple_loss=0.3694, pruned_loss=0.08448, over 1796401.00 frames. 2023-06-20 22:37:57,974 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-20 22:38:40,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=657024.0, ans=0.125 2023-06-20 22:38:53,060 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.563e+02 2.214e+02 2.674e+02 3.135e+02 4.981e+02, threshold=5.348e+02, percent-clipped=0.0 2023-06-20 22:39:25,288 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.01 vs. limit=6.0 2023-06-20 22:39:36,677 INFO [train.py:996] (2/4) Epoch 4, batch 18050, loss[loss=0.2493, simple_loss=0.328, pruned_loss=0.08531, over 21835.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3113, pruned_loss=0.08099, over 4255466.02 frames. ], batch size: 124, lr: 7.87e-03, grad_scale: 32.0 2023-06-20 22:39:40,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=657204.0, ans=0.0 2023-06-20 22:39:56,501 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=657264.0, ans=0.125 2023-06-20 22:39:56,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=657264.0, ans=0.0 2023-06-20 22:40:03,616 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.22 vs. limit=22.5 2023-06-20 22:40:16,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=657324.0, ans=0.125 2023-06-20 22:41:20,646 INFO [train.py:996] (2/4) Epoch 4, batch 18100, loss[loss=0.2487, simple_loss=0.3461, pruned_loss=0.07568, over 21679.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3163, pruned_loss=0.08388, over 4262064.94 frames. ], batch size: 351, lr: 7.86e-03, grad_scale: 32.0 2023-06-20 22:41:32,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=657504.0, ans=0.09899494936611666 2023-06-20 22:41:45,736 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=657564.0, ans=0.125 2023-06-20 22:41:48,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=657564.0, ans=0.0 2023-06-20 22:41:49,236 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.57 vs. limit=15.0 2023-06-20 22:42:10,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=657624.0, ans=0.0 2023-06-20 22:42:21,098 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.095e+02 2.529e+02 2.874e+02 3.352e+02 6.003e+02, threshold=5.748e+02, percent-clipped=2.0 2023-06-20 22:42:22,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=657684.0, ans=0.1 2023-06-20 22:42:33,413 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=657684.0, ans=0.1 2023-06-20 22:42:44,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=657744.0, ans=0.0 2023-06-20 22:42:58,558 INFO [train.py:996] (2/4) Epoch 4, batch 18150, loss[loss=0.258, simple_loss=0.3186, pruned_loss=0.09868, over 21498.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3186, pruned_loss=0.08436, over 4268187.98 frames. ], batch size: 441, lr: 7.86e-03, grad_scale: 32.0 2023-06-20 22:43:03,270 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 22:43:14,208 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.79 vs. limit=22.5 2023-06-20 22:43:14,235 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=657864.0, ans=15.0 2023-06-20 22:43:14,253 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.89 vs. limit=15.0 2023-06-20 22:43:32,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=657924.0, ans=0.1 2023-06-20 22:43:41,230 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=657924.0, ans=0.125 2023-06-20 22:43:45,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=657924.0, ans=0.1 2023-06-20 22:44:36,738 INFO [train.py:996] (2/4) Epoch 4, batch 18200, loss[loss=0.2597, simple_loss=0.3119, pruned_loss=0.1037, over 21412.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3121, pruned_loss=0.0838, over 4271154.83 frames. ], batch size: 473, lr: 7.86e-03, grad_scale: 32.0 2023-06-20 22:44:52,545 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.76 vs. limit=15.0 2023-06-20 22:45:02,790 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.79 vs. limit=15.0 2023-06-20 22:45:05,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=658224.0, ans=0.1 2023-06-20 22:45:32,362 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.656e+02 2.336e+02 2.746e+02 3.220e+02 5.960e+02, threshold=5.492e+02, percent-clipped=1.0 2023-06-20 22:45:39,556 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.44 vs. limit=10.0 2023-06-20 22:45:56,940 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.47 vs. limit=12.0 2023-06-20 22:46:00,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=658344.0, ans=0.125 2023-06-20 22:46:08,035 INFO [train.py:996] (2/4) Epoch 4, batch 18250, loss[loss=0.1728, simple_loss=0.2474, pruned_loss=0.04911, over 21759.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3031, pruned_loss=0.07994, over 4276097.79 frames. ], batch size: 124, lr: 7.86e-03, grad_scale: 16.0 2023-06-20 22:46:09,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=658404.0, ans=0.0 2023-06-20 22:46:12,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=658404.0, ans=0.0 2023-06-20 22:46:22,755 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.90 vs. limit=15.0 2023-06-20 22:46:49,537 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.80 vs. limit=15.0 2023-06-20 22:46:52,578 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.87 vs. limit=15.0 2023-06-20 22:47:45,376 INFO [train.py:996] (2/4) Epoch 4, batch 18300, loss[loss=0.2647, simple_loss=0.3672, pruned_loss=0.08106, over 21876.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3042, pruned_loss=0.08043, over 4282004.63 frames. ], batch size: 316, lr: 7.86e-03, grad_scale: 16.0 2023-06-20 22:47:49,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=658704.0, ans=0.5 2023-06-20 22:48:36,816 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.798e+02 2.526e+02 2.841e+02 3.430e+02 6.700e+02, threshold=5.681e+02, percent-clipped=2.0 2023-06-20 22:49:20,935 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-06-20 22:49:22,775 INFO [train.py:996] (2/4) Epoch 4, batch 18350, loss[loss=0.2646, simple_loss=0.3688, pruned_loss=0.08016, over 20958.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3084, pruned_loss=0.08034, over 4269332.33 frames. ], batch size: 607, lr: 7.85e-03, grad_scale: 16.0 2023-06-20 22:49:54,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=659064.0, ans=0.125 2023-06-20 22:50:04,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=659124.0, ans=0.5 2023-06-20 22:50:46,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=659244.0, ans=0.0 2023-06-20 22:50:52,554 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=659304.0, ans=0.125 2023-06-20 22:50:52,652 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=659304.0, ans=0.5 2023-06-20 22:50:53,554 INFO [train.py:996] (2/4) Epoch 4, batch 18400, loss[loss=0.2258, simple_loss=0.2906, pruned_loss=0.08053, over 21470.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3046, pruned_loss=0.07959, over 4274260.41 frames. ], batch size: 441, lr: 7.85e-03, grad_scale: 32.0 2023-06-20 22:51:08,967 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.87 vs. limit=15.0 2023-06-20 22:51:48,709 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=659424.0, ans=0.125 2023-06-20 22:52:08,247 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.801e+02 2.306e+02 2.639e+02 3.073e+02 4.656e+02, threshold=5.278e+02, percent-clipped=0.0 2023-06-20 22:52:43,448 INFO [train.py:996] (2/4) Epoch 4, batch 18450, loss[loss=0.2139, simple_loss=0.2883, pruned_loss=0.0697, over 21597.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3001, pruned_loss=0.07521, over 4274241.21 frames. ], batch size: 391, lr: 7.85e-03, grad_scale: 16.0 2023-06-20 22:53:41,746 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=659784.0, ans=0.07 2023-06-20 22:54:19,821 INFO [train.py:996] (2/4) Epoch 4, batch 18500, loss[loss=0.2191, simple_loss=0.2977, pruned_loss=0.07026, over 21180.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2954, pruned_loss=0.07463, over 4267122.36 frames. ], batch size: 548, lr: 7.85e-03, grad_scale: 16.0 2023-06-20 22:55:11,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=660024.0, ans=0.125 2023-06-20 22:55:22,934 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.463e+02 2.377e+02 2.788e+02 3.386e+02 5.871e+02, threshold=5.576e+02, percent-clipped=1.0 2023-06-20 22:55:24,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=660084.0, ans=0.1 2023-06-20 22:55:52,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=660144.0, ans=0.125 2023-06-20 22:55:56,918 INFO [train.py:996] (2/4) Epoch 4, batch 18550, loss[loss=0.2391, simple_loss=0.2929, pruned_loss=0.09267, over 21758.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2951, pruned_loss=0.07403, over 4264637.05 frames. ], batch size: 102, lr: 7.85e-03, grad_scale: 16.0 2023-06-20 22:56:10,433 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=660204.0, ans=0.125 2023-06-20 22:56:35,318 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=660264.0, ans=0.0 2023-06-20 22:57:45,991 INFO [train.py:996] (2/4) Epoch 4, batch 18600, loss[loss=0.2013, simple_loss=0.2681, pruned_loss=0.06724, over 21217.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2945, pruned_loss=0.07578, over 4263686.02 frames. ], batch size: 143, lr: 7.85e-03, grad_scale: 16.0 2023-06-20 22:57:47,878 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 22:57:58,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=660504.0, ans=0.125 2023-06-20 22:58:39,776 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=660684.0, ans=0.1 2023-06-20 22:58:42,412 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.875e+02 2.410e+02 2.851e+02 3.256e+02 5.084e+02, threshold=5.701e+02, percent-clipped=0.0 2023-06-20 22:59:00,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=660744.0, ans=0.125 2023-06-20 22:59:15,910 INFO [train.py:996] (2/4) Epoch 4, batch 18650, loss[loss=0.1982, simple_loss=0.267, pruned_loss=0.06472, over 21588.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2917, pruned_loss=0.07504, over 4262056.32 frames. ], batch size: 263, lr: 7.84e-03, grad_scale: 16.0 2023-06-20 22:59:31,293 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=660804.0, ans=0.2 2023-06-20 23:01:02,981 INFO [train.py:996] (2/4) Epoch 4, batch 18700, loss[loss=0.2513, simple_loss=0.3172, pruned_loss=0.09266, over 21865.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.292, pruned_loss=0.07742, over 4271840.87 frames. ], batch size: 118, lr: 7.84e-03, grad_scale: 16.0 2023-06-20 23:01:33,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=661164.0, ans=0.125 2023-06-20 23:02:00,973 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=661224.0, ans=0.1 2023-06-20 23:02:06,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=661224.0, ans=0.125 2023-06-20 23:02:25,371 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.956e+02 2.377e+02 2.645e+02 3.014e+02 5.356e+02, threshold=5.290e+02, percent-clipped=0.0 2023-06-20 23:02:27,576 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=661284.0, ans=0.0 2023-06-20 23:03:03,473 INFO [train.py:996] (2/4) Epoch 4, batch 18750, loss[loss=0.257, simple_loss=0.3241, pruned_loss=0.09493, over 21394.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.2959, pruned_loss=0.08095, over 4264437.58 frames. ], batch size: 211, lr: 7.84e-03, grad_scale: 16.0 2023-06-20 23:03:03,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=661404.0, ans=0.125 2023-06-20 23:03:20,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_na.min_abs, batch_count=661404.0, ans=0.02 2023-06-20 23:04:21,719 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=661584.0, ans=0.1 2023-06-20 23:04:21,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=661584.0, ans=0.1 2023-06-20 23:04:37,161 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.42 vs. limit=15.0 2023-06-20 23:04:39,201 INFO [train.py:996] (2/4) Epoch 4, batch 18800, loss[loss=0.2329, simple_loss=0.3285, pruned_loss=0.06865, over 21339.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.2994, pruned_loss=0.08165, over 4251175.09 frames. ], batch size: 548, lr: 7.84e-03, grad_scale: 32.0 2023-06-20 23:04:41,131 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=661704.0, ans=0.2 2023-06-20 23:04:44,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=661704.0, ans=0.2 2023-06-20 23:04:50,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=661704.0, ans=0.125 2023-06-20 23:05:03,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=661764.0, ans=0.125 2023-06-20 23:05:11,040 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.66 vs. limit=12.0 2023-06-20 23:05:45,986 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.590e+02 2.344e+02 2.808e+02 3.404e+02 5.687e+02, threshold=5.616e+02, percent-clipped=4.0 2023-06-20 23:05:55,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=661884.0, ans=0.125 2023-06-20 23:05:55,818 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.37 vs. limit=12.0 2023-06-20 23:06:02,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=661944.0, ans=0.125 2023-06-20 23:06:13,767 INFO [train.py:996] (2/4) Epoch 4, batch 18850, loss[loss=0.1743, simple_loss=0.2593, pruned_loss=0.04465, over 21356.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2956, pruned_loss=0.07664, over 4255985.11 frames. ], batch size: 211, lr: 7.84e-03, grad_scale: 32.0 2023-06-20 23:06:34,949 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 23:06:42,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=662064.0, ans=0.125 2023-06-20 23:07:35,427 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=662244.0, ans=0.125 2023-06-20 23:07:51,166 INFO [train.py:996] (2/4) Epoch 4, batch 18900, loss[loss=0.2161, simple_loss=0.2733, pruned_loss=0.07949, over 14751.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2917, pruned_loss=0.07675, over 4245019.16 frames. ], batch size: 61, lr: 7.84e-03, grad_scale: 32.0 2023-06-20 23:07:54,760 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=662304.0, ans=0.125 2023-06-20 23:08:35,347 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=12.0 2023-06-20 23:08:48,808 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 2.224e+02 2.639e+02 3.131e+02 5.534e+02, threshold=5.278e+02, percent-clipped=0.0 2023-06-20 23:09:07,639 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 23:09:23,289 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=662544.0, ans=0.125 2023-06-20 23:09:28,927 INFO [train.py:996] (2/4) Epoch 4, batch 18950, loss[loss=0.2622, simple_loss=0.351, pruned_loss=0.08667, over 21646.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.2927, pruned_loss=0.07887, over 4248883.28 frames. ], batch size: 263, lr: 7.83e-03, grad_scale: 32.0 2023-06-20 23:09:30,030 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=22.5 2023-06-20 23:10:24,454 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.76 vs. limit=12.0 2023-06-20 23:10:47,838 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-06-20 23:11:22,836 INFO [train.py:996] (2/4) Epoch 4, batch 19000, loss[loss=0.277, simple_loss=0.3432, pruned_loss=0.1054, over 21728.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3027, pruned_loss=0.08134, over 4254237.09 frames. ], batch size: 298, lr: 7.83e-03, grad_scale: 32.0 2023-06-20 23:12:21,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=663024.0, ans=0.0 2023-06-20 23:12:25,113 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.011e+02 2.761e+02 3.068e+02 3.870e+02 6.410e+02, threshold=6.137e+02, percent-clipped=5.0 2023-06-20 23:12:33,676 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.05 vs. limit=22.5 2023-06-20 23:12:37,494 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=663144.0, ans=0.125 2023-06-20 23:12:58,988 INFO [train.py:996] (2/4) Epoch 4, batch 19050, loss[loss=0.3132, simple_loss=0.3508, pruned_loss=0.1378, over 21633.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3086, pruned_loss=0.08563, over 4267567.55 frames. ], batch size: 507, lr: 7.83e-03, grad_scale: 32.0 2023-06-20 23:13:00,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=663204.0, ans=0.125 2023-06-20 23:13:14,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=663264.0, ans=0.0 2023-06-20 23:13:21,654 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 23:14:18,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=663444.0, ans=0.125 2023-06-20 23:14:35,607 INFO [train.py:996] (2/4) Epoch 4, batch 19100, loss[loss=0.2018, simple_loss=0.2723, pruned_loss=0.06566, over 21586.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3076, pruned_loss=0.08626, over 4278564.19 frames. ], batch size: 263, lr: 7.83e-03, grad_scale: 32.0 2023-06-20 23:14:44,779 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=11.38 vs. limit=15.0 2023-06-20 23:15:08,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=663564.0, ans=0.2 2023-06-20 23:15:46,363 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.112e+02 2.597e+02 2.867e+02 3.510e+02 4.906e+02, threshold=5.733e+02, percent-clipped=0.0 2023-06-20 23:16:19,236 INFO [train.py:996] (2/4) Epoch 4, batch 19150, loss[loss=0.3678, simple_loss=0.4417, pruned_loss=0.1469, over 21417.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3122, pruned_loss=0.08823, over 4277076.52 frames. ], batch size: 507, lr: 7.83e-03, grad_scale: 16.0 2023-06-20 23:16:46,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=663804.0, ans=0.0 2023-06-20 23:17:15,876 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=663864.0, ans=0.125 2023-06-20 23:17:25,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=663924.0, ans=0.125 2023-06-20 23:18:08,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=664104.0, ans=0.125 2023-06-20 23:18:09,189 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.25 vs. limit=22.5 2023-06-20 23:18:09,658 INFO [train.py:996] (2/4) Epoch 4, batch 19200, loss[loss=0.2731, simple_loss=0.376, pruned_loss=0.08506, over 21653.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3203, pruned_loss=0.08792, over 4278372.21 frames. ], batch size: 389, lr: 7.82e-03, grad_scale: 32.0 2023-06-20 23:18:23,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=664104.0, ans=0.125 2023-06-20 23:18:58,717 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.61 vs. limit=10.0 2023-06-20 23:19:08,557 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=664224.0, ans=0.125 2023-06-20 23:19:19,304 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 2.481e+02 2.914e+02 3.556e+02 7.085e+02, threshold=5.828e+02, percent-clipped=2.0 2023-06-20 23:19:22,598 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 23:19:45,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=664344.0, ans=0.125 2023-06-20 23:19:54,527 INFO [train.py:996] (2/4) Epoch 4, batch 19250, loss[loss=0.2838, simple_loss=0.4003, pruned_loss=0.08362, over 20758.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3207, pruned_loss=0.08247, over 4281541.25 frames. ], batch size: 607, lr: 7.82e-03, grad_scale: 32.0 2023-06-20 23:20:28,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=664404.0, ans=0.125 2023-06-20 23:20:40,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=664464.0, ans=0.0 2023-06-20 23:21:05,392 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=664584.0, ans=0.2 2023-06-20 23:21:26,851 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2023-06-20 23:21:41,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=664704.0, ans=0.0 2023-06-20 23:21:42,208 INFO [train.py:996] (2/4) Epoch 4, batch 19300, loss[loss=0.1878, simple_loss=0.2756, pruned_loss=0.05005, over 21797.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3179, pruned_loss=0.08227, over 4292472.96 frames. ], batch size: 102, lr: 7.82e-03, grad_scale: 32.0 2023-06-20 23:21:42,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=664704.0, ans=0.0 2023-06-20 23:22:17,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=664764.0, ans=0.125 2023-06-20 23:22:22,134 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=664764.0, ans=0.0 2023-06-20 23:22:45,398 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=664884.0, ans=0.125 2023-06-20 23:22:49,106 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.685e+02 2.366e+02 2.667e+02 3.289e+02 5.476e+02, threshold=5.333e+02, percent-clipped=0.0 2023-06-20 23:23:22,587 INFO [train.py:996] (2/4) Epoch 4, batch 19350, loss[loss=0.273, simple_loss=0.3455, pruned_loss=0.1002, over 21571.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.312, pruned_loss=0.07909, over 4288774.56 frames. ], batch size: 509, lr: 7.82e-03, grad_scale: 32.0 2023-06-20 23:23:54,027 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=665064.0, ans=0.0 2023-06-20 23:24:45,523 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=665244.0, ans=0.125 2023-06-20 23:24:47,890 INFO [train.py:996] (2/4) Epoch 4, batch 19400, loss[loss=0.2267, simple_loss=0.2946, pruned_loss=0.07941, over 21843.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.308, pruned_loss=0.0775, over 4286927.55 frames. ], batch size: 247, lr: 7.82e-03, grad_scale: 32.0 2023-06-20 23:25:07,115 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=665304.0, ans=0.125 2023-06-20 23:25:57,345 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.512e+02 2.319e+02 2.845e+02 3.666e+02 6.009e+02, threshold=5.690e+02, percent-clipped=2.0 2023-06-20 23:26:14,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=665544.0, ans=0.125 2023-06-20 23:26:21,144 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=665544.0, ans=0.125 2023-06-20 23:26:29,721 INFO [train.py:996] (2/4) Epoch 4, batch 19450, loss[loss=0.2579, simple_loss=0.3014, pruned_loss=0.1072, over 21397.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3059, pruned_loss=0.07964, over 4287804.88 frames. ], batch size: 473, lr: 7.82e-03, grad_scale: 32.0 2023-06-20 23:26:53,611 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.26 vs. limit=12.0 2023-06-20 23:27:25,889 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=665784.0, ans=0.1 2023-06-20 23:27:38,448 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.16 vs. limit=10.0 2023-06-20 23:27:56,254 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=665844.0, ans=0.125 2023-06-20 23:28:07,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=665844.0, ans=0.2 2023-06-20 23:28:18,732 INFO [train.py:996] (2/4) Epoch 4, batch 19500, loss[loss=0.218, simple_loss=0.2859, pruned_loss=0.07504, over 21667.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3005, pruned_loss=0.08052, over 4280353.02 frames. ], batch size: 333, lr: 7.81e-03, grad_scale: 32.0 2023-06-20 23:28:20,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=665904.0, ans=0.0 2023-06-20 23:28:49,899 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=665964.0, ans=0.1 2023-06-20 23:28:52,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=665964.0, ans=0.05 2023-06-20 23:28:58,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=666024.0, ans=0.125 2023-06-20 23:29:07,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=666024.0, ans=0.0 2023-06-20 23:29:09,621 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=19.58 vs. limit=15.0 2023-06-20 23:29:20,544 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.962e+02 2.739e+02 3.335e+02 3.880e+02 8.921e+02, threshold=6.671e+02, percent-clipped=3.0 2023-06-20 23:29:34,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=666144.0, ans=0.0 2023-06-20 23:29:45,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=666144.0, ans=0.035 2023-06-20 23:29:47,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=666144.0, ans=0.0 2023-06-20 23:29:56,766 INFO [train.py:996] (2/4) Epoch 4, batch 19550, loss[loss=0.1731, simple_loss=0.2352, pruned_loss=0.05555, over 21855.00 frames. ], tot_loss[loss=0.227, simple_loss=0.2963, pruned_loss=0.07887, over 4272244.75 frames. ], batch size: 107, lr: 7.81e-03, grad_scale: 16.0 2023-06-20 23:30:09,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=666204.0, ans=0.125 2023-06-20 23:30:14,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=666204.0, ans=0.125 2023-06-20 23:30:30,206 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 23:30:33,623 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-20 23:31:28,253 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.16 vs. limit=15.0 2023-06-20 23:31:38,473 INFO [train.py:996] (2/4) Epoch 4, batch 19600, loss[loss=0.2404, simple_loss=0.3025, pruned_loss=0.08918, over 21937.00 frames. ], tot_loss[loss=0.23, simple_loss=0.2997, pruned_loss=0.08019, over 4276732.98 frames. ], batch size: 316, lr: 7.81e-03, grad_scale: 32.0 2023-06-20 23:31:53,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=666564.0, ans=0.125 2023-06-20 23:31:57,794 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=666564.0, ans=0.0 2023-06-20 23:32:11,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=666624.0, ans=0.2 2023-06-20 23:32:25,509 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=666624.0, ans=0.1 2023-06-20 23:32:32,600 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.820e+02 2.560e+02 2.903e+02 3.359e+02 5.140e+02, threshold=5.805e+02, percent-clipped=0.0 2023-06-20 23:32:37,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=666684.0, ans=0.125 2023-06-20 23:32:39,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=666684.0, ans=0.125 2023-06-20 23:32:42,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=666744.0, ans=0.125 2023-06-20 23:33:14,396 INFO [train.py:996] (2/4) Epoch 4, batch 19650, loss[loss=0.2712, simple_loss=0.3463, pruned_loss=0.09802, over 21783.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3063, pruned_loss=0.08499, over 4285419.75 frames. ], batch size: 124, lr: 7.81e-03, grad_scale: 32.0 2023-06-20 23:33:17,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=666804.0, ans=0.1 2023-06-20 23:33:51,904 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=666924.0, ans=0.0 2023-06-20 23:34:51,466 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=667104.0, ans=0.0 2023-06-20 23:34:52,444 INFO [train.py:996] (2/4) Epoch 4, batch 19700, loss[loss=0.2225, simple_loss=0.275, pruned_loss=0.08501, over 21097.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3103, pruned_loss=0.08538, over 4287858.20 frames. ], batch size: 143, lr: 7.81e-03, grad_scale: 32.0 2023-06-20 23:35:14,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=667164.0, ans=0.95 2023-06-20 23:36:21,833 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.270e+02 2.712e+02 3.222e+02 3.780e+02 5.761e+02, threshold=6.445e+02, percent-clipped=0.0 2023-06-20 23:36:31,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=667284.0, ans=0.95 2023-06-20 23:36:33,312 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.36 vs. limit=5.0 2023-06-20 23:36:52,635 INFO [train.py:996] (2/4) Epoch 4, batch 19750, loss[loss=0.3988, simple_loss=0.4664, pruned_loss=0.1656, over 21457.00 frames. ], tot_loss[loss=0.247, simple_loss=0.32, pruned_loss=0.08701, over 4285233.94 frames. ], batch size: 507, lr: 7.81e-03, grad_scale: 32.0 2023-06-20 23:36:53,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=667404.0, ans=0.04949747468305833 2023-06-20 23:37:01,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=667404.0, ans=0.125 2023-06-20 23:37:01,968 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.52 vs. limit=10.0 2023-06-20 23:37:02,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=667404.0, ans=0.2 2023-06-20 23:37:10,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=667404.0, ans=0.125 2023-06-20 23:37:18,006 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.60 vs. limit=15.0 2023-06-20 23:38:12,550 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.97 vs. limit=15.0 2023-06-20 23:38:39,847 INFO [train.py:996] (2/4) Epoch 4, batch 19800, loss[loss=0.2297, simple_loss=0.3117, pruned_loss=0.07384, over 21674.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3192, pruned_loss=0.08776, over 4283369.49 frames. ], batch size: 389, lr: 7.80e-03, grad_scale: 32.0 2023-06-20 23:39:13,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=667764.0, ans=0.025 2023-06-20 23:39:28,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=667764.0, ans=0.125 2023-06-20 23:40:03,067 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.688e+02 2.339e+02 2.763e+02 3.291e+02 5.461e+02, threshold=5.526e+02, percent-clipped=0.0 2023-06-20 23:40:34,090 INFO [train.py:996] (2/4) Epoch 4, batch 19850, loss[loss=0.2173, simple_loss=0.3044, pruned_loss=0.0651, over 21626.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3107, pruned_loss=0.08202, over 4277122.47 frames. ], batch size: 389, lr: 7.80e-03, grad_scale: 32.0 2023-06-20 23:41:01,765 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.14 vs. limit=22.5 2023-06-20 23:42:06,451 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=668244.0, ans=0.0 2023-06-20 23:42:11,833 INFO [train.py:996] (2/4) Epoch 4, batch 19900, loss[loss=0.2057, simple_loss=0.2664, pruned_loss=0.07249, over 15347.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3102, pruned_loss=0.07926, over 4262220.79 frames. ], batch size: 60, lr: 7.80e-03, grad_scale: 32.0 2023-06-20 23:42:28,995 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.77 vs. limit=6.0 2023-06-20 23:42:31,806 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-20 23:42:49,537 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=668364.0, ans=0.125 2023-06-20 23:43:18,253 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.891e+02 2.481e+02 3.021e+02 3.710e+02 5.505e+02, threshold=6.042e+02, percent-clipped=0.0 2023-06-20 23:43:20,338 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=668484.0, ans=0.025 2023-06-20 23:43:46,349 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.58 vs. limit=15.0 2023-06-20 23:43:49,818 INFO [train.py:996] (2/4) Epoch 4, batch 19950, loss[loss=0.2084, simple_loss=0.2768, pruned_loss=0.07003, over 21777.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3031, pruned_loss=0.07891, over 4270165.61 frames. ], batch size: 351, lr: 7.80e-03, grad_scale: 32.0 2023-06-20 23:45:23,652 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=668844.0, ans=0.1 2023-06-20 23:45:32,331 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=668844.0, ans=0.0 2023-06-20 23:45:38,931 INFO [train.py:996] (2/4) Epoch 4, batch 20000, loss[loss=0.2201, simple_loss=0.3049, pruned_loss=0.0676, over 19818.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3053, pruned_loss=0.07897, over 4270367.13 frames. ], batch size: 702, lr: 7.80e-03, grad_scale: 32.0 2023-06-20 23:45:43,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=668904.0, ans=0.2 2023-06-20 23:46:24,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=668964.0, ans=0.125 2023-06-20 23:46:52,070 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.961e+02 2.484e+02 2.763e+02 3.251e+02 5.110e+02, threshold=5.527e+02, percent-clipped=0.0 2023-06-20 23:47:05,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=669084.0, ans=0.1 2023-06-20 23:47:22,970 INFO [train.py:996] (2/4) Epoch 4, batch 20050, loss[loss=0.2291, simple_loss=0.3011, pruned_loss=0.07859, over 21859.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3078, pruned_loss=0.08177, over 4281796.96 frames. ], batch size: 282, lr: 7.79e-03, grad_scale: 32.0 2023-06-20 23:47:45,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=669264.0, ans=0.1 2023-06-20 23:48:33,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=669384.0, ans=0.125 2023-06-20 23:48:38,602 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=669384.0, ans=0.125 2023-06-20 23:48:38,644 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=669384.0, ans=0.1 2023-06-20 23:48:46,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=669444.0, ans=0.125 2023-06-20 23:49:06,457 INFO [train.py:996] (2/4) Epoch 4, batch 20100, loss[loss=0.2633, simple_loss=0.3441, pruned_loss=0.09125, over 21385.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3111, pruned_loss=0.0845, over 4292200.16 frames. ], batch size: 194, lr: 7.79e-03, grad_scale: 32.0 2023-06-20 23:49:07,625 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.04 vs. limit=12.0 2023-06-20 23:49:19,891 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.52 vs. limit=15.0 2023-06-20 23:49:58,169 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=669624.0, ans=0.0 2023-06-20 23:50:12,789 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.857e+02 2.516e+02 2.838e+02 3.321e+02 5.094e+02, threshold=5.677e+02, percent-clipped=0.0 2023-06-20 23:50:44,843 INFO [train.py:996] (2/4) Epoch 4, batch 20150, loss[loss=0.3023, simple_loss=0.368, pruned_loss=0.1183, over 21589.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3215, pruned_loss=0.08832, over 4293449.96 frames. ], batch size: 389, lr: 7.79e-03, grad_scale: 32.0 2023-06-20 23:50:55,475 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 23:53:08,049 INFO [train.py:996] (2/4) Epoch 4, batch 20200, loss[loss=0.2425, simple_loss=0.3232, pruned_loss=0.08089, over 21781.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.3275, pruned_loss=0.09144, over 4289018.46 frames. ], batch size: 282, lr: 7.79e-03, grad_scale: 32.0 2023-06-20 23:53:08,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=670104.0, ans=10.0 2023-06-20 23:53:17,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=670104.0, ans=0.04949747468305833 2023-06-20 23:53:50,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=670224.0, ans=0.1 2023-06-20 23:54:20,154 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.895e+02 2.584e+02 3.030e+02 3.747e+02 5.271e+02, threshold=6.060e+02, percent-clipped=0.0 2023-06-20 23:54:53,774 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=670344.0, ans=0.0 2023-06-20 23:54:56,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=670344.0, ans=0.025 2023-06-20 23:55:04,485 INFO [train.py:996] (2/4) Epoch 4, batch 20250, loss[loss=0.2269, simple_loss=0.2999, pruned_loss=0.07695, over 21298.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.3278, pruned_loss=0.08982, over 4279797.68 frames. ], batch size: 143, lr: 7.79e-03, grad_scale: 32.0 2023-06-20 23:55:25,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=670464.0, ans=0.125 2023-06-20 23:55:50,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=670524.0, ans=0.1 2023-06-20 23:56:35,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=670644.0, ans=0.0 2023-06-20 23:56:58,674 INFO [train.py:996] (2/4) Epoch 4, batch 20300, loss[loss=0.2714, simple_loss=0.3546, pruned_loss=0.09406, over 21473.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3239, pruned_loss=0.08554, over 4279588.62 frames. ], batch size: 471, lr: 7.79e-03, grad_scale: 32.0 2023-06-20 23:57:53,748 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.921e+02 2.338e+02 2.634e+02 2.984e+02 5.802e+02, threshold=5.268e+02, percent-clipped=0.0 2023-06-20 23:58:11,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=670944.0, ans=0.125 2023-06-20 23:58:29,867 INFO [train.py:996] (2/4) Epoch 4, batch 20350, loss[loss=0.2626, simple_loss=0.329, pruned_loss=0.09809, over 21815.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3238, pruned_loss=0.08588, over 4265941.26 frames. ], batch size: 351, lr: 7.78e-03, grad_scale: 32.0 2023-06-20 23:58:56,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=671064.0, ans=0.1 2023-06-20 23:59:01,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=671064.0, ans=0.0 2023-06-20 23:59:34,932 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=12.0 2023-06-20 23:59:51,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=671244.0, ans=0.1 2023-06-20 23:59:54,685 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=671244.0, ans=0.125 2023-06-21 00:00:05,327 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.04 vs. limit=6.0 2023-06-21 00:00:05,660 INFO [train.py:996] (2/4) Epoch 4, batch 20400, loss[loss=0.2729, simple_loss=0.3323, pruned_loss=0.1068, over 21229.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3244, pruned_loss=0.08737, over 4254914.52 frames. ], batch size: 143, lr: 7.78e-03, grad_scale: 32.0 2023-06-21 00:00:38,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=671364.0, ans=0.0 2023-06-21 00:00:43,302 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=671424.0, ans=0.1 2023-06-21 00:00:54,730 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=671424.0, ans=0.125 2023-06-21 00:01:01,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=671484.0, ans=0.0 2023-06-21 00:01:05,908 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.044e+02 2.821e+02 3.170e+02 3.737e+02 5.615e+02, threshold=6.339e+02, percent-clipped=2.0 2023-06-21 00:01:20,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=671544.0, ans=0.1 2023-06-21 00:01:47,155 INFO [train.py:996] (2/4) Epoch 4, batch 20450, loss[loss=0.2873, simple_loss=0.3409, pruned_loss=0.1168, over 21581.00 frames. ], tot_loss[loss=0.253, simple_loss=0.3259, pruned_loss=0.09007, over 4245676.83 frames. ], batch size: 471, lr: 7.78e-03, grad_scale: 32.0 2023-06-21 00:01:49,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=671604.0, ans=0.04949747468305833 2023-06-21 00:01:56,217 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=671604.0, ans=0.125 2023-06-21 00:02:15,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=671724.0, ans=0.125 2023-06-21 00:02:15,763 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.64 vs. limit=15.0 2023-06-21 00:02:16,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=671724.0, ans=0.125 2023-06-21 00:02:16,691 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=671724.0, ans=0.125 2023-06-21 00:02:56,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=671844.0, ans=0.125 2023-06-21 00:03:02,474 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 00:03:06,847 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=671844.0, ans=0.2 2023-06-21 00:03:16,203 INFO [train.py:996] (2/4) Epoch 4, batch 20500, loss[loss=0.2179, simple_loss=0.2842, pruned_loss=0.07578, over 21668.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.323, pruned_loss=0.09117, over 4262205.84 frames. ], batch size: 231, lr: 7.78e-03, grad_scale: 32.0 2023-06-21 00:03:20,229 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.63 vs. limit=12.0 2023-06-21 00:03:25,368 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=671904.0, ans=0.0 2023-06-21 00:04:07,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=672024.0, ans=0.1 2023-06-21 00:04:14,696 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.874e+02 2.579e+02 2.936e+02 3.494e+02 5.643e+02, threshold=5.872e+02, percent-clipped=0.0 2023-06-21 00:04:56,151 INFO [train.py:996] (2/4) Epoch 4, batch 20550, loss[loss=0.2973, simple_loss=0.3933, pruned_loss=0.1007, over 19783.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3165, pruned_loss=0.08979, over 4249615.63 frames. ], batch size: 702, lr: 7.78e-03, grad_scale: 32.0 2023-06-21 00:04:56,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=672204.0, ans=0.125 2023-06-21 00:05:01,833 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.21 vs. limit=15.0 2023-06-21 00:05:04,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=672204.0, ans=0.0 2023-06-21 00:05:35,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=672324.0, ans=10.0 2023-06-21 00:05:37,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff3.min_abs, batch_count=672324.0, ans=0.2 2023-06-21 00:05:41,367 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.76 vs. limit=12.0 2023-06-21 00:05:56,674 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=672384.0, ans=0.125 2023-06-21 00:06:42,212 INFO [train.py:996] (2/4) Epoch 4, batch 20600, loss[loss=0.2054, simple_loss=0.2879, pruned_loss=0.06141, over 19986.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3195, pruned_loss=0.08856, over 4238594.64 frames. ], batch size: 703, lr: 7.78e-03, grad_scale: 32.0 2023-06-21 00:06:43,964 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=672504.0, ans=0.0 2023-06-21 00:06:47,266 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.07 vs. limit=15.0 2023-06-21 00:07:03,379 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=672564.0, ans=10.0 2023-06-21 00:07:03,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=672564.0, ans=0.5 2023-06-21 00:07:07,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=672564.0, ans=0.09899494936611666 2023-06-21 00:07:09,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=672564.0, ans=0.0 2023-06-21 00:07:41,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=672684.0, ans=0.1 2023-06-21 00:07:42,107 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.963e+02 2.517e+02 2.771e+02 3.210e+02 6.753e+02, threshold=5.541e+02, percent-clipped=2.0 2023-06-21 00:07:48,717 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=672684.0, ans=0.0 2023-06-21 00:08:18,035 INFO [train.py:996] (2/4) Epoch 4, batch 20650, loss[loss=0.2173, simple_loss=0.2773, pruned_loss=0.07867, over 21600.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3146, pruned_loss=0.08823, over 4239111.94 frames. ], batch size: 263, lr: 7.77e-03, grad_scale: 32.0 2023-06-21 00:08:19,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=672804.0, ans=0.0 2023-06-21 00:08:35,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=672864.0, ans=0.125 2023-06-21 00:09:05,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=672924.0, ans=0.125 2023-06-21 00:09:34,754 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.89 vs. limit=15.0 2023-06-21 00:09:42,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=673044.0, ans=0.0 2023-06-21 00:09:44,876 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.46 vs. limit=22.5 2023-06-21 00:09:54,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=673044.0, ans=0.125 2023-06-21 00:09:56,821 INFO [train.py:996] (2/4) Epoch 4, batch 20700, loss[loss=0.3171, simple_loss=0.3804, pruned_loss=0.1269, over 21502.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3072, pruned_loss=0.08467, over 4240363.60 frames. ], batch size: 508, lr: 7.77e-03, grad_scale: 32.0 2023-06-21 00:10:20,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=673164.0, ans=0.1 2023-06-21 00:10:32,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=673224.0, ans=0.5 2023-06-21 00:11:07,544 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.934e+02 2.300e+02 2.582e+02 3.075e+02 5.238e+02, threshold=5.163e+02, percent-clipped=0.0 2023-06-21 00:11:09,534 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 00:11:20,185 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.07 vs. limit=15.0 2023-06-21 00:11:27,052 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=673284.0, ans=0.1 2023-06-21 00:11:31,674 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=673344.0, ans=0.0 2023-06-21 00:11:42,128 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=673344.0, ans=0.0 2023-06-21 00:11:44,539 INFO [train.py:996] (2/4) Epoch 4, batch 20750, loss[loss=0.2419, simple_loss=0.3298, pruned_loss=0.07695, over 21464.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.308, pruned_loss=0.08361, over 4244201.88 frames. ], batch size: 211, lr: 7.77e-03, grad_scale: 32.0 2023-06-21 00:12:36,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=673524.0, ans=0.2 2023-06-21 00:12:42,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=673524.0, ans=0.125 2023-06-21 00:13:02,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=673584.0, ans=0.0 2023-06-21 00:13:23,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=673644.0, ans=0.125 2023-06-21 00:13:28,257 INFO [train.py:996] (2/4) Epoch 4, batch 20800, loss[loss=0.2345, simple_loss=0.2949, pruned_loss=0.08699, over 21468.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3147, pruned_loss=0.08575, over 4245222.64 frames. ], batch size: 441, lr: 7.77e-03, grad_scale: 32.0 2023-06-21 00:13:41,982 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.07 vs. limit=10.0 2023-06-21 00:14:08,006 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 00:14:12,582 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=673824.0, ans=0.04949747468305833 2023-06-21 00:14:19,129 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.52 vs. limit=15.0 2023-06-21 00:14:39,395 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.768e+02 2.643e+02 3.014e+02 3.715e+02 5.359e+02, threshold=6.029e+02, percent-clipped=3.0 2023-06-21 00:15:04,246 INFO [train.py:996] (2/4) Epoch 4, batch 20850, loss[loss=0.2102, simple_loss=0.2769, pruned_loss=0.07177, over 21620.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3064, pruned_loss=0.08292, over 4243847.82 frames. ], batch size: 230, lr: 7.77e-03, grad_scale: 32.0 2023-06-21 00:15:05,269 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.67 vs. limit=15.0 2023-06-21 00:15:38,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=674064.0, ans=0.125 2023-06-21 00:15:40,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=674064.0, ans=0.0 2023-06-21 00:16:10,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=674124.0, ans=0.125 2023-06-21 00:16:36,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff2.min_abs, batch_count=674184.0, ans=0.1 2023-06-21 00:16:36,709 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.71 vs. limit=12.0 2023-06-21 00:16:54,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=674244.0, ans=0.125 2023-06-21 00:16:58,060 INFO [train.py:996] (2/4) Epoch 4, batch 20900, loss[loss=0.1874, simple_loss=0.2513, pruned_loss=0.06171, over 16281.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3061, pruned_loss=0.08323, over 4245002.44 frames. ], batch size: 62, lr: 7.77e-03, grad_scale: 32.0 2023-06-21 00:17:02,087 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.36 vs. limit=6.0 2023-06-21 00:17:52,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=674484.0, ans=0.125 2023-06-21 00:17:58,039 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.573e+02 2.368e+02 2.710e+02 3.422e+02 5.410e+02, threshold=5.420e+02, percent-clipped=0.0 2023-06-21 00:18:33,408 INFO [train.py:996] (2/4) Epoch 4, batch 20950, loss[loss=0.1757, simple_loss=0.2574, pruned_loss=0.047, over 21574.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3036, pruned_loss=0.07994, over 4240528.97 frames. ], batch size: 212, lr: 7.76e-03, grad_scale: 32.0 2023-06-21 00:18:46,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=674664.0, ans=0.125 2023-06-21 00:19:11,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=674724.0, ans=0.125 2023-06-21 00:19:12,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=674724.0, ans=0.125 2023-06-21 00:19:39,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=674784.0, ans=0.2 2023-06-21 00:19:41,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=674844.0, ans=0.2 2023-06-21 00:20:07,637 INFO [train.py:996] (2/4) Epoch 4, batch 21000, loss[loss=0.2425, simple_loss=0.3104, pruned_loss=0.08727, over 21865.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3011, pruned_loss=0.08011, over 4248391.81 frames. ], batch size: 351, lr: 7.76e-03, grad_scale: 32.0 2023-06-21 00:20:07,637 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-21 00:20:59,675 INFO [train.py:1028] (2/4) Epoch 4, validation: loss=0.2681, simple_loss=0.367, pruned_loss=0.0846, over 1796401.00 frames. 2023-06-21 00:20:59,676 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-21 00:21:25,455 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=674964.0, ans=0.125 2023-06-21 00:21:41,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=675024.0, ans=0.0 2023-06-21 00:22:00,270 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.758e+02 2.300e+02 2.574e+02 2.981e+02 4.103e+02, threshold=5.148e+02, percent-clipped=0.0 2023-06-21 00:22:34,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=675204.0, ans=0.125 2023-06-21 00:22:35,782 INFO [train.py:996] (2/4) Epoch 4, batch 21050, loss[loss=0.1979, simple_loss=0.2585, pruned_loss=0.0686, over 21475.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.2996, pruned_loss=0.0804, over 4247628.58 frames. ], batch size: 212, lr: 7.76e-03, grad_scale: 32.0 2023-06-21 00:22:46,491 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=675204.0, ans=0.1 2023-06-21 00:24:04,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=675444.0, ans=0.1 2023-06-21 00:24:14,091 INFO [train.py:996] (2/4) Epoch 4, batch 21100, loss[loss=0.2459, simple_loss=0.2974, pruned_loss=0.09719, over 21429.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.2966, pruned_loss=0.07997, over 4248761.02 frames. ], batch size: 441, lr: 7.76e-03, grad_scale: 32.0 2023-06-21 00:25:28,113 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.981e+02 2.453e+02 2.744e+02 3.185e+02 4.554e+02, threshold=5.489e+02, percent-clipped=0.0 2023-06-21 00:25:49,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=675744.0, ans=0.2 2023-06-21 00:25:54,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=675744.0, ans=0.05 2023-06-21 00:25:59,866 INFO [train.py:996] (2/4) Epoch 4, batch 21150, loss[loss=0.2248, simple_loss=0.2779, pruned_loss=0.08585, over 21876.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.2925, pruned_loss=0.08014, over 4252117.63 frames. ], batch size: 107, lr: 7.76e-03, grad_scale: 32.0 2023-06-21 00:26:02,309 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.23 vs. limit=15.0 2023-06-21 00:26:03,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=675804.0, ans=0.125 2023-06-21 00:26:33,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=675864.0, ans=0.125 2023-06-21 00:26:38,234 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.92 vs. limit=22.5 2023-06-21 00:27:24,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=676044.0, ans=0.0 2023-06-21 00:27:25,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=676044.0, ans=0.0 2023-06-21 00:27:31,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=676044.0, ans=0.125 2023-06-21 00:27:40,831 INFO [train.py:996] (2/4) Epoch 4, batch 21200, loss[loss=0.1973, simple_loss=0.2661, pruned_loss=0.06421, over 21588.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2884, pruned_loss=0.07933, over 4248336.58 frames. ], batch size: 263, lr: 7.76e-03, grad_scale: 32.0 2023-06-21 00:27:53,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=676104.0, ans=0.125 2023-06-21 00:28:12,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=676164.0, ans=0.125 2023-06-21 00:28:12,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=676164.0, ans=0.125 2023-06-21 00:28:41,445 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.936e+02 2.500e+02 2.845e+02 3.451e+02 6.035e+02, threshold=5.690e+02, percent-clipped=1.0 2023-06-21 00:29:19,081 INFO [train.py:996] (2/4) Epoch 4, batch 21250, loss[loss=0.2134, simple_loss=0.2735, pruned_loss=0.07662, over 21435.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2867, pruned_loss=0.07864, over 4247862.40 frames. ], batch size: 212, lr: 7.75e-03, grad_scale: 32.0 2023-06-21 00:29:33,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=676464.0, ans=0.125 2023-06-21 00:30:04,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=676524.0, ans=0.125 2023-06-21 00:30:07,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=676524.0, ans=0.2 2023-06-21 00:30:40,915 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.77 vs. limit=22.5 2023-06-21 00:30:55,550 INFO [train.py:996] (2/4) Epoch 4, batch 21300, loss[loss=0.2647, simple_loss=0.3273, pruned_loss=0.101, over 21869.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.2937, pruned_loss=0.08133, over 4257516.35 frames. ], batch size: 391, lr: 7.75e-03, grad_scale: 32.0 2023-06-21 00:31:06,936 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.20 vs. limit=15.0 2023-06-21 00:31:21,035 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 00:31:28,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=676764.0, ans=0.04949747468305833 2023-06-21 00:31:39,114 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=11.78 vs. limit=12.0 2023-06-21 00:31:41,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=676824.0, ans=0.05 2023-06-21 00:31:46,240 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=676824.0, ans=0.125 2023-06-21 00:32:01,496 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.074e+02 2.763e+02 3.058e+02 3.413e+02 5.762e+02, threshold=6.115e+02, percent-clipped=1.0 2023-06-21 00:32:32,305 INFO [train.py:996] (2/4) Epoch 4, batch 21350, loss[loss=0.2224, simple_loss=0.3103, pruned_loss=0.06724, over 21363.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.2973, pruned_loss=0.08213, over 4263510.60 frames. ], batch size: 548, lr: 7.75e-03, grad_scale: 16.0 2023-06-21 00:32:54,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=677064.0, ans=0.0 2023-06-21 00:33:22,227 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=12.0 2023-06-21 00:33:36,163 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=677184.0, ans=0.0 2023-06-21 00:33:39,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=677184.0, ans=0.1 2023-06-21 00:34:04,371 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.24 vs. limit=15.0 2023-06-21 00:34:07,675 INFO [train.py:996] (2/4) Epoch 4, batch 21400, loss[loss=0.2422, simple_loss=0.3187, pruned_loss=0.0829, over 21745.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.302, pruned_loss=0.08283, over 4272245.45 frames. ], batch size: 247, lr: 7.75e-03, grad_scale: 16.0 2023-06-21 00:34:53,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=677364.0, ans=0.025 2023-06-21 00:35:01,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=677424.0, ans=0.125 2023-06-21 00:35:34,490 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.748e+02 2.398e+02 2.726e+02 3.426e+02 6.163e+02, threshold=5.451e+02, percent-clipped=1.0 2023-06-21 00:36:03,775 INFO [train.py:996] (2/4) Epoch 4, batch 21450, loss[loss=0.2575, simple_loss=0.3179, pruned_loss=0.09853, over 21527.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3048, pruned_loss=0.08364, over 4277644.76 frames. ], batch size: 548, lr: 7.75e-03, grad_scale: 16.0 2023-06-21 00:36:04,099 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=677604.0, ans=0.125 2023-06-21 00:36:05,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=677604.0, ans=0.0 2023-06-21 00:36:10,278 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=677604.0, ans=0.1 2023-06-21 00:36:14,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=677604.0, ans=0.125 2023-06-21 00:36:18,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=677664.0, ans=0.125 2023-06-21 00:36:18,848 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=677664.0, ans=0.1 2023-06-21 00:36:43,973 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 00:36:53,102 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.10 vs. limit=15.0 2023-06-21 00:37:46,119 INFO [train.py:996] (2/4) Epoch 4, batch 21500, loss[loss=0.1969, simple_loss=0.2572, pruned_loss=0.06834, over 21654.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3039, pruned_loss=0.08475, over 4279273.07 frames. ], batch size: 247, lr: 7.74e-03, grad_scale: 16.0 2023-06-21 00:37:55,966 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 00:39:13,312 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.721e+02 2.885e+02 3.448e+02 4.361e+02 7.505e+02, threshold=6.896e+02, percent-clipped=8.0 2023-06-21 00:39:23,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=678084.0, ans=0.125 2023-06-21 00:39:43,571 INFO [train.py:996] (2/4) Epoch 4, batch 21550, loss[loss=0.2181, simple_loss=0.26, pruned_loss=0.08813, over 20835.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.2961, pruned_loss=0.08121, over 4280775.07 frames. ], batch size: 608, lr: 7.74e-03, grad_scale: 16.0 2023-06-21 00:39:50,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=678204.0, ans=0.0 2023-06-21 00:39:55,396 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-06-21 00:40:31,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=678324.0, ans=10.0 2023-06-21 00:41:02,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=678384.0, ans=0.125 2023-06-21 00:41:17,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=678444.0, ans=0.125 2023-06-21 00:41:17,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=678444.0, ans=0.07 2023-06-21 00:41:18,457 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=678444.0, ans=0.0 2023-06-21 00:41:20,999 INFO [train.py:996] (2/4) Epoch 4, batch 21600, loss[loss=0.2315, simple_loss=0.3237, pruned_loss=0.06964, over 21258.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.296, pruned_loss=0.08032, over 4266245.38 frames. ], batch size: 549, lr: 7.74e-03, grad_scale: 32.0 2023-06-21 00:42:00,843 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.89 vs. limit=12.0 2023-06-21 00:42:38,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=678624.0, ans=0.05 2023-06-21 00:42:48,637 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.899e+02 2.277e+02 2.634e+02 3.164e+02 4.314e+02, threshold=5.268e+02, percent-clipped=0.0 2023-06-21 00:42:52,823 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.54 vs. limit=22.5 2023-06-21 00:42:52,904 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-21 00:42:59,956 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=678744.0, ans=0.0 2023-06-21 00:43:12,685 INFO [train.py:996] (2/4) Epoch 4, batch 21650, loss[loss=0.2008, simple_loss=0.2891, pruned_loss=0.05627, over 21869.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.2983, pruned_loss=0.07809, over 4272341.68 frames. ], batch size: 118, lr: 7.74e-03, grad_scale: 32.0 2023-06-21 00:44:24,642 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-06-21 00:44:28,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=678984.0, ans=0.2 2023-06-21 00:44:59,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=679044.0, ans=0.125 2023-06-21 00:44:59,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=679044.0, ans=0.2 2023-06-21 00:45:01,845 INFO [train.py:996] (2/4) Epoch 4, batch 21700, loss[loss=0.2079, simple_loss=0.2772, pruned_loss=0.06928, over 21332.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.2991, pruned_loss=0.07636, over 4273340.41 frames. ], batch size: 131, lr: 7.74e-03, grad_scale: 32.0 2023-06-21 00:45:02,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=679104.0, ans=0.125 2023-06-21 00:45:19,936 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=679104.0, ans=0.125 2023-06-21 00:45:25,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=679164.0, ans=0.125 2023-06-21 00:45:30,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=679164.0, ans=0.125 2023-06-21 00:45:30,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=679164.0, ans=0.125 2023-06-21 00:46:17,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=679284.0, ans=0.125 2023-06-21 00:46:24,569 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.663e+02 2.270e+02 2.667e+02 3.310e+02 7.431e+02, threshold=5.334e+02, percent-clipped=8.0 2023-06-21 00:46:38,254 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=679344.0, ans=0.0 2023-06-21 00:46:46,221 INFO [train.py:996] (2/4) Epoch 4, batch 21750, loss[loss=0.2039, simple_loss=0.2658, pruned_loss=0.07101, over 21276.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2943, pruned_loss=0.07631, over 4266794.61 frames. ], batch size: 144, lr: 7.74e-03, grad_scale: 16.0 2023-06-21 00:48:44,938 INFO [train.py:996] (2/4) Epoch 4, batch 21800, loss[loss=0.2473, simple_loss=0.3183, pruned_loss=0.08812, over 21853.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.292, pruned_loss=0.0778, over 4257933.71 frames. ], batch size: 373, lr: 7.73e-03, grad_scale: 16.0 2023-06-21 00:49:34,702 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.88 vs. limit=22.5 2023-06-21 00:50:04,854 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.990e+02 2.568e+02 2.919e+02 3.829e+02 7.614e+02, threshold=5.838e+02, percent-clipped=7.0 2023-06-21 00:50:41,346 INFO [train.py:996] (2/4) Epoch 4, batch 21850, loss[loss=0.2023, simple_loss=0.2702, pruned_loss=0.06723, over 19929.00 frames. ], tot_loss[loss=0.228, simple_loss=0.2983, pruned_loss=0.07889, over 4261201.22 frames. ], batch size: 702, lr: 7.73e-03, grad_scale: 16.0 2023-06-21 00:51:25,473 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.09 vs. limit=10.0 2023-06-21 00:51:37,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=680124.0, ans=0.0 2023-06-21 00:52:20,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=680244.0, ans=0.0 2023-06-21 00:52:33,983 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=680244.0, ans=10.0 2023-06-21 00:52:47,297 INFO [train.py:996] (2/4) Epoch 4, batch 21900, loss[loss=0.2117, simple_loss=0.279, pruned_loss=0.07221, over 21695.00 frames. ], tot_loss[loss=0.229, simple_loss=0.2985, pruned_loss=0.07971, over 4269757.35 frames. ], batch size: 298, lr: 7.73e-03, grad_scale: 16.0 2023-06-21 00:53:29,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=680424.0, ans=0.05 2023-06-21 00:53:54,052 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.90 vs. limit=22.5 2023-06-21 00:54:01,909 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.904e+02 2.548e+02 2.891e+02 3.393e+02 5.559e+02, threshold=5.783e+02, percent-clipped=0.0 2023-06-21 00:54:02,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=680484.0, ans=0.125 2023-06-21 00:54:23,720 INFO [train.py:996] (2/4) Epoch 4, batch 21950, loss[loss=0.1925, simple_loss=0.2596, pruned_loss=0.06273, over 21271.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.2934, pruned_loss=0.07868, over 4272400.71 frames. ], batch size: 144, lr: 7.73e-03, grad_scale: 16.0 2023-06-21 00:54:26,967 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=680604.0, ans=0.0 2023-06-21 00:55:09,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=680664.0, ans=0.125 2023-06-21 00:56:06,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=680844.0, ans=0.125 2023-06-21 00:56:13,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=680904.0, ans=0.1 2023-06-21 00:56:13,984 INFO [train.py:996] (2/4) Epoch 4, batch 22000, loss[loss=0.208, simple_loss=0.2712, pruned_loss=0.07237, over 21275.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2875, pruned_loss=0.07523, over 4259871.16 frames. ], batch size: 159, lr: 7.73e-03, grad_scale: 32.0 2023-06-21 00:57:31,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=681084.0, ans=0.0 2023-06-21 00:57:46,334 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.438e+02 2.067e+02 2.288e+02 2.758e+02 6.986e+02, threshold=4.576e+02, percent-clipped=2.0 2023-06-21 00:58:03,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=681144.0, ans=0.125 2023-06-21 00:58:10,289 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=681144.0, ans=0.125 2023-06-21 00:58:35,044 INFO [train.py:996] (2/4) Epoch 4, batch 22050, loss[loss=0.2681, simple_loss=0.3448, pruned_loss=0.09573, over 21594.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.2935, pruned_loss=0.07854, over 4260539.84 frames. ], batch size: 263, lr: 7.73e-03, grad_scale: 32.0 2023-06-21 00:58:36,154 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.08 vs. limit=12.0 2023-06-21 00:59:03,438 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 00:59:20,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=681324.0, ans=0.125 2023-06-21 00:59:22,480 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=681324.0, ans=0.0 2023-06-21 00:59:39,027 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=681384.0, ans=0.125 2023-06-21 01:00:15,414 INFO [train.py:996] (2/4) Epoch 4, batch 22100, loss[loss=0.2527, simple_loss=0.3208, pruned_loss=0.09225, over 21788.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3071, pruned_loss=0.08456, over 4261181.63 frames. ], batch size: 351, lr: 7.72e-03, grad_scale: 32.0 2023-06-21 01:00:47,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=681564.0, ans=0.125 2023-06-21 01:01:22,379 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.215e+02 2.888e+02 3.349e+02 4.124e+02 5.904e+02, threshold=6.697e+02, percent-clipped=15.0 2023-06-21 01:02:13,231 INFO [train.py:996] (2/4) Epoch 4, batch 22150, loss[loss=0.2345, simple_loss=0.3097, pruned_loss=0.07964, over 21859.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3093, pruned_loss=0.08541, over 4265609.69 frames. ], batch size: 332, lr: 7.72e-03, grad_scale: 32.0 2023-06-21 01:02:17,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=681804.0, ans=0.0 2023-06-21 01:03:18,186 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=681984.0, ans=0.125 2023-06-21 01:04:07,377 INFO [train.py:996] (2/4) Epoch 4, batch 22200, loss[loss=0.3534, simple_loss=0.4076, pruned_loss=0.1496, over 21646.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3109, pruned_loss=0.08652, over 4278597.50 frames. ], batch size: 508, lr: 7.72e-03, grad_scale: 16.0 2023-06-21 01:04:23,788 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=682104.0, ans=0.125 2023-06-21 01:04:54,797 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.96 vs. limit=15.0 2023-06-21 01:05:00,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=682224.0, ans=0.125 2023-06-21 01:05:17,335 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.985e+02 2.532e+02 2.842e+02 3.263e+02 4.792e+02, threshold=5.684e+02, percent-clipped=0.0 2023-06-21 01:06:03,317 INFO [train.py:996] (2/4) Epoch 4, batch 22250, loss[loss=0.2904, simple_loss=0.3742, pruned_loss=0.1033, over 21800.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3185, pruned_loss=0.08852, over 4282169.66 frames. ], batch size: 118, lr: 7.72e-03, grad_scale: 16.0 2023-06-21 01:06:03,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=682404.0, ans=0.1 2023-06-21 01:06:40,403 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=682464.0, ans=0.0 2023-06-21 01:06:58,954 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.35 vs. limit=22.5 2023-06-21 01:07:14,374 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.87 vs. limit=15.0 2023-06-21 01:07:43,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=682644.0, ans=15.0 2023-06-21 01:07:52,738 INFO [train.py:996] (2/4) Epoch 4, batch 22300, loss[loss=0.2708, simple_loss=0.3422, pruned_loss=0.09967, over 21861.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3192, pruned_loss=0.08967, over 4281891.01 frames. ], batch size: 107, lr: 7.72e-03, grad_scale: 16.0 2023-06-21 01:08:09,997 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=682704.0, ans=0.125 2023-06-21 01:08:10,644 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.54 vs. limit=10.0 2023-06-21 01:08:33,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=682764.0, ans=0.0 2023-06-21 01:09:16,188 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.111e+02 2.961e+02 3.295e+02 3.929e+02 6.165e+02, threshold=6.589e+02, percent-clipped=1.0 2023-06-21 01:09:18,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=682884.0, ans=0.1 2023-06-21 01:09:57,911 INFO [train.py:996] (2/4) Epoch 4, batch 22350, loss[loss=0.2663, simple_loss=0.3224, pruned_loss=0.1051, over 21773.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3169, pruned_loss=0.0898, over 4295199.65 frames. ], batch size: 441, lr: 7.72e-03, grad_scale: 16.0 2023-06-21 01:10:36,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=683064.0, ans=0.125 2023-06-21 01:10:45,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=683064.0, ans=0.0 2023-06-21 01:11:09,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=683184.0, ans=0.2 2023-06-21 01:11:26,711 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 01:11:54,221 INFO [train.py:996] (2/4) Epoch 4, batch 22400, loss[loss=0.204, simple_loss=0.2739, pruned_loss=0.06703, over 21315.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3128, pruned_loss=0.08588, over 4292176.38 frames. ], batch size: 211, lr: 7.71e-03, grad_scale: 32.0 2023-06-21 01:11:57,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=683304.0, ans=0.1 2023-06-21 01:12:09,411 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.79 vs. limit=15.0 2023-06-21 01:12:16,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=683364.0, ans=0.125 2023-06-21 01:12:46,888 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=683484.0, ans=0.125 2023-06-21 01:12:58,955 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.715e+02 2.385e+02 2.684e+02 3.038e+02 4.758e+02, threshold=5.368e+02, percent-clipped=0.0 2023-06-21 01:13:33,329 INFO [train.py:996] (2/4) Epoch 4, batch 22450, loss[loss=0.2354, simple_loss=0.2802, pruned_loss=0.09534, over 21335.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3068, pruned_loss=0.08488, over 4274930.13 frames. ], batch size: 473, lr: 7.71e-03, grad_scale: 32.0 2023-06-21 01:13:42,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=683604.0, ans=0.0 2023-06-21 01:13:56,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=683604.0, ans=0.07 2023-06-21 01:13:59,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=683664.0, ans=0.125 2023-06-21 01:15:39,198 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=683904.0, ans=0.0 2023-06-21 01:15:39,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=683904.0, ans=0.125 2023-06-21 01:15:40,179 INFO [train.py:996] (2/4) Epoch 4, batch 22500, loss[loss=0.2727, simple_loss=0.3613, pruned_loss=0.0921, over 21258.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3017, pruned_loss=0.08366, over 4276810.61 frames. ], batch size: 549, lr: 7.71e-03, grad_scale: 32.0 2023-06-21 01:15:51,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=683904.0, ans=0.1 2023-06-21 01:16:27,797 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=684024.0, ans=0.125 2023-06-21 01:16:27,896 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=684024.0, ans=0.125 2023-06-21 01:16:37,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=684024.0, ans=0.0 2023-06-21 01:16:53,266 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.003e+02 2.672e+02 3.079e+02 3.545e+02 7.228e+02, threshold=6.157e+02, percent-clipped=7.0 2023-06-21 01:16:53,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=684084.0, ans=0.5 2023-06-21 01:17:43,734 INFO [train.py:996] (2/4) Epoch 4, batch 22550, loss[loss=0.2219, simple_loss=0.2931, pruned_loss=0.07538, over 21843.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3026, pruned_loss=0.08383, over 4277341.78 frames. ], batch size: 282, lr: 7.71e-03, grad_scale: 32.0 2023-06-21 01:17:45,730 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=684204.0, ans=0.125 2023-06-21 01:18:34,001 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.08 vs. limit=10.0 2023-06-21 01:19:17,346 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.92 vs. limit=6.0 2023-06-21 01:19:34,958 INFO [train.py:996] (2/4) Epoch 4, batch 22600, loss[loss=0.2266, simple_loss=0.3055, pruned_loss=0.07389, over 21076.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3054, pruned_loss=0.08385, over 4285139.95 frames. ], batch size: 607, lr: 7.71e-03, grad_scale: 32.0 2023-06-21 01:20:55,352 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.152e+02 2.560e+02 2.970e+02 3.484e+02 5.965e+02, threshold=5.939e+02, percent-clipped=0.0 2023-06-21 01:20:57,235 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=684684.0, ans=0.125 2023-06-21 01:21:04,569 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=684684.0, ans=0.2 2023-06-21 01:21:30,604 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=684804.0, ans=0.0 2023-06-21 01:21:31,528 INFO [train.py:996] (2/4) Epoch 4, batch 22650, loss[loss=0.2214, simple_loss=0.274, pruned_loss=0.08436, over 21260.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3034, pruned_loss=0.08352, over 4271658.60 frames. ], batch size: 548, lr: 7.71e-03, grad_scale: 32.0 2023-06-21 01:21:33,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=684804.0, ans=0.2 2023-06-21 01:21:44,202 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.44 vs. limit=22.5 2023-06-21 01:22:27,174 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=684984.0, ans=0.0 2023-06-21 01:22:53,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=685044.0, ans=0.2 2023-06-21 01:23:06,099 INFO [train.py:996] (2/4) Epoch 4, batch 22700, loss[loss=0.2108, simple_loss=0.2654, pruned_loss=0.07807, over 21879.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.2978, pruned_loss=0.08342, over 4275308.71 frames. ], batch size: 373, lr: 7.70e-03, grad_scale: 32.0 2023-06-21 01:23:17,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=685104.0, ans=0.0 2023-06-21 01:23:46,357 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=685224.0, ans=0.125 2023-06-21 01:23:48,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=685224.0, ans=15.0 2023-06-21 01:24:16,225 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.069e+02 2.521e+02 2.946e+02 3.485e+02 5.279e+02, threshold=5.893e+02, percent-clipped=0.0 2023-06-21 01:24:28,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=685344.0, ans=0.125 2023-06-21 01:24:42,699 INFO [train.py:996] (2/4) Epoch 4, batch 22750, loss[loss=0.2395, simple_loss=0.2978, pruned_loss=0.09059, over 20761.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.298, pruned_loss=0.08479, over 4267447.74 frames. ], batch size: 607, lr: 7.70e-03, grad_scale: 32.0 2023-06-21 01:24:44,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=685404.0, ans=0.125 2023-06-21 01:26:10,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=685584.0, ans=0.1 2023-06-21 01:26:29,517 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=685584.0, ans=0.125 2023-06-21 01:26:48,309 INFO [train.py:996] (2/4) Epoch 4, batch 22800, loss[loss=0.2351, simple_loss=0.3014, pruned_loss=0.0844, over 21862.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3027, pruned_loss=0.08687, over 4277484.53 frames. ], batch size: 333, lr: 7.70e-03, grad_scale: 32.0 2023-06-21 01:27:00,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=685704.0, ans=0.125 2023-06-21 01:27:02,889 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=685764.0, ans=0.125 2023-06-21 01:27:30,204 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-21 01:28:02,024 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.075e+02 2.664e+02 3.134e+02 3.732e+02 6.258e+02, threshold=6.268e+02, percent-clipped=2.0 2023-06-21 01:28:27,903 INFO [train.py:996] (2/4) Epoch 4, batch 22850, loss[loss=0.2104, simple_loss=0.2685, pruned_loss=0.07611, over 21697.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.2995, pruned_loss=0.08605, over 4276639.71 frames. ], batch size: 283, lr: 7.70e-03, grad_scale: 32.0 2023-06-21 01:28:59,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=686064.0, ans=0.125 2023-06-21 01:29:42,268 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.46 vs. limit=22.5 2023-06-21 01:30:18,147 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.68 vs. limit=15.0 2023-06-21 01:30:22,083 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=686244.0, ans=0.125 2023-06-21 01:30:32,305 INFO [train.py:996] (2/4) Epoch 4, batch 22900, loss[loss=0.2265, simple_loss=0.3155, pruned_loss=0.06877, over 21422.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.302, pruned_loss=0.08553, over 4266077.43 frames. ], batch size: 211, lr: 7.70e-03, grad_scale: 32.0 2023-06-21 01:31:13,380 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=22.5 2023-06-21 01:31:16,090 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=686364.0, ans=0.125 2023-06-21 01:31:53,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=686424.0, ans=0.125 2023-06-21 01:32:28,758 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.866e+02 2.538e+02 2.990e+02 3.640e+02 6.063e+02, threshold=5.980e+02, percent-clipped=0.0 2023-06-21 01:32:30,850 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=686484.0, ans=0.0 2023-06-21 01:32:53,726 INFO [train.py:996] (2/4) Epoch 4, batch 22950, loss[loss=0.2667, simple_loss=0.3924, pruned_loss=0.07051, over 21284.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3143, pruned_loss=0.08377, over 4263089.56 frames. ], batch size: 548, lr: 7.70e-03, grad_scale: 32.0 2023-06-21 01:33:08,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=686604.0, ans=0.04949747468305833 2023-06-21 01:33:11,184 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=686604.0, ans=0.2 2023-06-21 01:33:13,188 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.36 vs. limit=15.0 2023-06-21 01:33:13,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=686604.0, ans=0.125 2023-06-21 01:33:39,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=686664.0, ans=0.125 2023-06-21 01:34:00,289 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=686724.0, ans=0.125 2023-06-21 01:34:36,315 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=686844.0, ans=0.1 2023-06-21 01:35:05,436 INFO [train.py:996] (2/4) Epoch 4, batch 23000, loss[loss=0.2208, simple_loss=0.2869, pruned_loss=0.07732, over 21832.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3137, pruned_loss=0.08154, over 4270236.09 frames. ], batch size: 282, lr: 7.69e-03, grad_scale: 16.0 2023-06-21 01:36:32,532 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.634e+02 2.398e+02 2.749e+02 3.269e+02 6.833e+02, threshold=5.498e+02, percent-clipped=2.0 2023-06-21 01:36:40,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff3.min_abs, batch_count=687144.0, ans=0.2 2023-06-21 01:36:51,875 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=687144.0, ans=0.0 2023-06-21 01:37:10,576 INFO [train.py:996] (2/4) Epoch 4, batch 23050, loss[loss=0.2523, simple_loss=0.3236, pruned_loss=0.09049, over 21336.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3155, pruned_loss=0.08378, over 4272368.19 frames. ], batch size: 159, lr: 7.69e-03, grad_scale: 16.0 2023-06-21 01:37:14,717 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=687204.0, ans=15.0 2023-06-21 01:38:39,645 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=687444.0, ans=0.1 2023-06-21 01:39:05,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=687444.0, ans=0.2 2023-06-21 01:39:10,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=687444.0, ans=0.2 2023-06-21 01:39:15,335 INFO [train.py:996] (2/4) Epoch 4, batch 23100, loss[loss=0.1998, simple_loss=0.256, pruned_loss=0.07183, over 21324.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3104, pruned_loss=0.08412, over 4268381.56 frames. ], batch size: 194, lr: 7.69e-03, grad_scale: 16.0 2023-06-21 01:39:15,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=687504.0, ans=0.125 2023-06-21 01:39:25,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=687504.0, ans=0.125 2023-06-21 01:40:36,572 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 2.504e+02 2.792e+02 3.257e+02 4.712e+02, threshold=5.583e+02, percent-clipped=0.0 2023-06-21 01:41:08,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=687744.0, ans=0.125 2023-06-21 01:41:10,867 INFO [train.py:996] (2/4) Epoch 4, batch 23150, loss[loss=0.2615, simple_loss=0.3102, pruned_loss=0.1064, over 21593.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3047, pruned_loss=0.08358, over 4271436.03 frames. ], batch size: 548, lr: 7.69e-03, grad_scale: 16.0 2023-06-21 01:41:52,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff2.min_abs, batch_count=687864.0, ans=0.1 2023-06-21 01:43:02,803 INFO [train.py:996] (2/4) Epoch 4, batch 23200, loss[loss=0.1886, simple_loss=0.2708, pruned_loss=0.05324, over 19896.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3046, pruned_loss=0.08449, over 4284426.93 frames. ], batch size: 703, lr: 7.69e-03, grad_scale: 32.0 2023-06-21 01:43:04,687 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=688104.0, ans=0.0 2023-06-21 01:43:26,874 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=688164.0, ans=0.0 2023-06-21 01:44:22,554 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=688284.0, ans=0.0 2023-06-21 01:44:27,748 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.177e+02 2.558e+02 3.014e+02 3.330e+02 5.283e+02, threshold=6.028e+02, percent-clipped=0.0 2023-06-21 01:44:28,392 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=688284.0, ans=0.0 2023-06-21 01:44:35,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=688344.0, ans=0.0 2023-06-21 01:45:06,672 INFO [train.py:996] (2/4) Epoch 4, batch 23250, loss[loss=0.2384, simple_loss=0.3069, pruned_loss=0.08492, over 21465.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3054, pruned_loss=0.08652, over 4287769.02 frames. ], batch size: 211, lr: 7.69e-03, grad_scale: 32.0 2023-06-21 01:46:06,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=688464.0, ans=0.125 2023-06-21 01:46:08,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=688464.0, ans=0.0 2023-06-21 01:46:21,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=688524.0, ans=0.02 2023-06-21 01:47:17,191 INFO [train.py:996] (2/4) Epoch 4, batch 23300, loss[loss=0.2628, simple_loss=0.3784, pruned_loss=0.0736, over 20815.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3144, pruned_loss=0.08912, over 4290497.43 frames. ], batch size: 607, lr: 7.68e-03, grad_scale: 32.0 2023-06-21 01:47:57,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=688764.0, ans=0.0 2023-06-21 01:48:00,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=688764.0, ans=0.125 2023-06-21 01:48:15,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=688824.0, ans=0.125 2023-06-21 01:48:42,117 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 01:48:48,664 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.67 vs. limit=15.0 2023-06-21 01:48:50,548 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.961e+02 2.610e+02 2.927e+02 3.361e+02 4.958e+02, threshold=5.855e+02, percent-clipped=0.0 2023-06-21 01:49:20,897 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.08 vs. limit=6.0 2023-06-21 01:49:30,665 INFO [train.py:996] (2/4) Epoch 4, batch 23350, loss[loss=0.1831, simple_loss=0.2679, pruned_loss=0.04919, over 21789.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3197, pruned_loss=0.08821, over 4282642.23 frames. ], batch size: 316, lr: 7.68e-03, grad_scale: 32.0 2023-06-21 01:49:58,413 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=689064.0, ans=0.125 2023-06-21 01:49:58,488 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=689064.0, ans=0.125 2023-06-21 01:50:24,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=689184.0, ans=0.1 2023-06-21 01:51:02,923 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 01:51:10,963 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=689244.0, ans=0.0 2023-06-21 01:51:21,100 INFO [train.py:996] (2/4) Epoch 4, batch 23400, loss[loss=0.2323, simple_loss=0.3037, pruned_loss=0.08039, over 15447.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3124, pruned_loss=0.08412, over 4283199.05 frames. ], batch size: 61, lr: 7.68e-03, grad_scale: 32.0 2023-06-21 01:51:31,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=689304.0, ans=0.125 2023-06-21 01:51:43,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=689364.0, ans=0.125 2023-06-21 01:52:01,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=689364.0, ans=0.125 2023-06-21 01:52:11,847 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=689424.0, ans=0.1 2023-06-21 01:52:58,683 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.781e+02 2.326e+02 2.655e+02 3.143e+02 5.119e+02, threshold=5.310e+02, percent-clipped=0.0 2023-06-21 01:52:59,186 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=689484.0, ans=0.125 2023-06-21 01:53:17,054 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=689544.0, ans=0.95 2023-06-21 01:53:33,732 INFO [train.py:996] (2/4) Epoch 4, batch 23450, loss[loss=0.2649, simple_loss=0.3294, pruned_loss=0.1002, over 21949.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3134, pruned_loss=0.08666, over 4286914.61 frames. ], batch size: 316, lr: 7.68e-03, grad_scale: 32.0 2023-06-21 01:53:49,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=689664.0, ans=0.125 2023-06-21 01:54:22,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=689724.0, ans=0.125 2023-06-21 01:55:20,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=689844.0, ans=0.125 2023-06-21 01:55:27,871 INFO [train.py:996] (2/4) Epoch 4, batch 23500, loss[loss=0.2075, simple_loss=0.2735, pruned_loss=0.0707, over 21630.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3127, pruned_loss=0.08836, over 4291678.53 frames. ], batch size: 263, lr: 7.68e-03, grad_scale: 32.0 2023-06-21 01:55:31,417 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=689904.0, ans=0.0 2023-06-21 01:55:38,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=689904.0, ans=0.2 2023-06-21 01:55:42,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=689964.0, ans=0.2 2023-06-21 01:55:48,333 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=689964.0, ans=0.125 2023-06-21 01:56:23,356 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=690084.0, ans=0.125 2023-06-21 01:56:28,879 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.039e+02 2.673e+02 3.044e+02 3.448e+02 4.722e+02, threshold=6.088e+02, percent-clipped=0.0 2023-06-21 01:56:30,895 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=690084.0, ans=0.0 2023-06-21 01:57:04,421 INFO [train.py:996] (2/4) Epoch 4, batch 23550, loss[loss=0.2277, simple_loss=0.296, pruned_loss=0.07972, over 21374.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3069, pruned_loss=0.08756, over 4285234.29 frames. ], batch size: 131, lr: 7.68e-03, grad_scale: 32.0 2023-06-21 01:57:04,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=690204.0, ans=0.025 2023-06-21 01:57:07,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=690204.0, ans=0.035 2023-06-21 01:57:07,843 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=690204.0, ans=0.0 2023-06-21 01:57:12,122 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=690204.0, ans=0.0 2023-06-21 01:57:15,055 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=690204.0, ans=0.07 2023-06-21 01:58:09,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=690324.0, ans=0.0 2023-06-21 01:59:09,019 INFO [train.py:996] (2/4) Epoch 4, batch 23600, loss[loss=0.2652, simple_loss=0.3337, pruned_loss=0.09831, over 21706.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3083, pruned_loss=0.08667, over 4273809.39 frames. ], batch size: 351, lr: 7.67e-03, grad_scale: 32.0 2023-06-21 01:59:46,442 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 02:00:01,187 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.91 vs. limit=22.5 2023-06-21 02:00:52,714 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.585e+02 3.175e+02 3.942e+02 6.182e+02, threshold=6.349e+02, percent-clipped=1.0 2023-06-21 02:01:13,087 INFO [train.py:996] (2/4) Epoch 4, batch 23650, loss[loss=0.2099, simple_loss=0.2866, pruned_loss=0.06664, over 21458.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3078, pruned_loss=0.08496, over 4265703.86 frames. ], batch size: 211, lr: 7.67e-03, grad_scale: 32.0 2023-06-21 02:01:22,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=690804.0, ans=0.125 2023-06-21 02:02:05,886 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=690924.0, ans=0.1 2023-06-21 02:02:23,598 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.39 vs. limit=15.0 2023-06-21 02:02:27,952 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-21 02:02:30,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=690984.0, ans=0.04949747468305833 2023-06-21 02:02:51,052 INFO [train.py:996] (2/4) Epoch 4, batch 23700, loss[loss=0.2165, simple_loss=0.2855, pruned_loss=0.0737, over 21307.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.31, pruned_loss=0.08437, over 4273683.06 frames. ], batch size: 159, lr: 7.67e-03, grad_scale: 32.0 2023-06-21 02:04:09,241 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.827e+02 2.529e+02 2.918e+02 3.500e+02 6.134e+02, threshold=5.836e+02, percent-clipped=0.0 2023-06-21 02:04:36,873 INFO [train.py:996] (2/4) Epoch 4, batch 23750, loss[loss=0.2463, simple_loss=0.3179, pruned_loss=0.08735, over 21701.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3139, pruned_loss=0.08514, over 4271691.36 frames. ], batch size: 351, lr: 7.67e-03, grad_scale: 16.0 2023-06-21 02:06:39,762 INFO [train.py:996] (2/4) Epoch 4, batch 23800, loss[loss=0.309, simple_loss=0.3917, pruned_loss=0.1131, over 21626.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3123, pruned_loss=0.08319, over 4270472.78 frames. ], batch size: 414, lr: 7.67e-03, grad_scale: 16.0 2023-06-21 02:06:58,792 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=691704.0, ans=0.125 2023-06-21 02:07:26,838 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=691764.0, ans=0.125 2023-06-21 02:07:47,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=691824.0, ans=0.2 2023-06-21 02:07:58,614 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-21 02:08:17,846 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.895e+02 2.570e+02 3.095e+02 3.507e+02 5.751e+02, threshold=6.189e+02, percent-clipped=0.0 2023-06-21 02:08:24,441 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=691944.0, ans=0.05 2023-06-21 02:08:24,450 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=691944.0, ans=0.125 2023-06-21 02:08:28,722 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.98 vs. limit=12.0 2023-06-21 02:09:00,380 INFO [train.py:996] (2/4) Epoch 4, batch 23850, loss[loss=0.2466, simple_loss=0.3246, pruned_loss=0.0843, over 21673.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3207, pruned_loss=0.08543, over 4260827.14 frames. ], batch size: 351, lr: 7.67e-03, grad_scale: 16.0 2023-06-21 02:09:12,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=692004.0, ans=0.125 2023-06-21 02:09:18,571 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=692004.0, ans=0.2 2023-06-21 02:09:25,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=692064.0, ans=0.0 2023-06-21 02:10:00,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=692124.0, ans=0.05 2023-06-21 02:10:41,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=692304.0, ans=0.2 2023-06-21 02:10:42,687 INFO [train.py:996] (2/4) Epoch 4, batch 23900, loss[loss=0.253, simple_loss=0.3218, pruned_loss=0.09213, over 21282.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.3281, pruned_loss=0.08815, over 4262851.74 frames. ], batch size: 159, lr: 7.66e-03, grad_scale: 16.0 2023-06-21 02:11:31,774 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=692424.0, ans=0.125 2023-06-21 02:12:06,799 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.098e+02 2.492e+02 2.802e+02 3.240e+02 5.431e+02, threshold=5.603e+02, percent-clipped=0.0 2023-06-21 02:12:35,444 INFO [train.py:996] (2/4) Epoch 4, batch 23950, loss[loss=0.2354, simple_loss=0.3054, pruned_loss=0.08267, over 21673.00 frames. ], tot_loss[loss=0.249, simple_loss=0.322, pruned_loss=0.08803, over 4261823.65 frames. ], batch size: 298, lr: 7.66e-03, grad_scale: 16.0 2023-06-21 02:13:03,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=692664.0, ans=0.125 2023-06-21 02:13:17,426 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.31 vs. limit=22.5 2023-06-21 02:14:03,662 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.15 vs. limit=6.0 2023-06-21 02:14:43,735 INFO [train.py:996] (2/4) Epoch 4, batch 24000, loss[loss=0.2785, simple_loss=0.3403, pruned_loss=0.1083, over 21417.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3239, pruned_loss=0.09142, over 4266304.50 frames. ], batch size: 549, lr: 7.66e-03, grad_scale: 32.0 2023-06-21 02:14:43,736 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-21 02:15:40,420 INFO [train.py:1028] (2/4) Epoch 4, validation: loss=0.268, simple_loss=0.3653, pruned_loss=0.08536, over 1796401.00 frames. 2023-06-21 02:15:40,422 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-21 02:16:19,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=693024.0, ans=0.0 2023-06-21 02:16:56,187 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.223e+02 2.654e+02 3.041e+02 3.636e+02 5.941e+02, threshold=6.083e+02, percent-clipped=2.0 2023-06-21 02:17:26,610 INFO [train.py:996] (2/4) Epoch 4, batch 24050, loss[loss=0.22, simple_loss=0.3169, pruned_loss=0.06156, over 20845.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.324, pruned_loss=0.09062, over 4271511.02 frames. ], batch size: 607, lr: 7.66e-03, grad_scale: 32.0 2023-06-21 02:18:05,531 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=693264.0, ans=0.0 2023-06-21 02:18:14,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=693264.0, ans=0.125 2023-06-21 02:18:39,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=693384.0, ans=0.07 2023-06-21 02:19:01,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=693384.0, ans=0.125 2023-06-21 02:19:07,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=693444.0, ans=0.0 2023-06-21 02:19:19,533 INFO [train.py:996] (2/4) Epoch 4, batch 24100, loss[loss=0.3144, simple_loss=0.3686, pruned_loss=0.1301, over 21770.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3224, pruned_loss=0.08803, over 4281150.05 frames. ], batch size: 441, lr: 7.66e-03, grad_scale: 32.0 2023-06-21 02:20:40,082 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=693624.0, ans=0.1 2023-06-21 02:20:41,477 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=693624.0, ans=0.125 2023-06-21 02:21:09,278 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.718e+02 2.445e+02 2.792e+02 3.280e+02 5.618e+02, threshold=5.584e+02, percent-clipped=0.0 2023-06-21 02:21:19,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=693744.0, ans=0.125 2023-06-21 02:21:27,718 INFO [train.py:996] (2/4) Epoch 4, batch 24150, loss[loss=0.2449, simple_loss=0.3048, pruned_loss=0.09248, over 21454.00 frames. ], tot_loss[loss=0.2517, simple_loss=0.3228, pruned_loss=0.09033, over 4288461.40 frames. ], batch size: 194, lr: 7.66e-03, grad_scale: 32.0 2023-06-21 02:21:28,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=693804.0, ans=0.1 2023-06-21 02:21:39,805 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 02:21:44,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=693804.0, ans=0.1 2023-06-21 02:21:47,178 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=693864.0, ans=0.05 2023-06-21 02:22:20,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=693924.0, ans=0.125 2023-06-21 02:22:21,983 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=693924.0, ans=0.2 2023-06-21 02:23:26,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=694044.0, ans=0.0 2023-06-21 02:23:36,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=694044.0, ans=0.1 2023-06-21 02:23:39,055 INFO [train.py:996] (2/4) Epoch 4, batch 24200, loss[loss=0.2497, simple_loss=0.3255, pruned_loss=0.08699, over 21594.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3263, pruned_loss=0.09215, over 4292405.50 frames. ], batch size: 230, lr: 7.65e-03, grad_scale: 32.0 2023-06-21 02:23:44,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=694104.0, ans=0.025 2023-06-21 02:25:13,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=694284.0, ans=0.0 2023-06-21 02:25:14,760 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.651e+02 2.339e+02 2.793e+02 3.596e+02 6.031e+02, threshold=5.587e+02, percent-clipped=1.0 2023-06-21 02:25:38,084 INFO [train.py:996] (2/4) Epoch 4, batch 24250, loss[loss=0.1923, simple_loss=0.2752, pruned_loss=0.05467, over 21224.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3226, pruned_loss=0.08512, over 4287451.36 frames. ], batch size: 143, lr: 7.65e-03, grad_scale: 32.0 2023-06-21 02:25:49,242 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 02:27:14,731 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=694584.0, ans=0.2 2023-06-21 02:27:29,617 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.18 vs. limit=15.0 2023-06-21 02:27:40,668 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.05 vs. limit=6.0 2023-06-21 02:28:01,384 INFO [train.py:996] (2/4) Epoch 4, batch 24300, loss[loss=0.1827, simple_loss=0.2556, pruned_loss=0.05493, over 21881.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3153, pruned_loss=0.0795, over 4281160.12 frames. ], batch size: 107, lr: 7.65e-03, grad_scale: 32.0 2023-06-21 02:28:12,072 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=694704.0, ans=0.125 2023-06-21 02:28:16,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=694704.0, ans=0.2 2023-06-21 02:28:22,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=694764.0, ans=0.125 2023-06-21 02:29:19,389 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 2.168e+02 3.072e+02 4.276e+02 8.509e+02, threshold=6.143e+02, percent-clipped=10.0 2023-06-21 02:29:38,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=694944.0, ans=0.5 2023-06-21 02:29:56,143 INFO [train.py:996] (2/4) Epoch 4, batch 24350, loss[loss=0.2199, simple_loss=0.2699, pruned_loss=0.085, over 20185.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3118, pruned_loss=0.07982, over 4286741.45 frames. ], batch size: 702, lr: 7.65e-03, grad_scale: 32.0 2023-06-21 02:30:01,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=695004.0, ans=0.0 2023-06-21 02:30:59,324 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.55 vs. limit=15.0 2023-06-21 02:32:00,370 INFO [train.py:996] (2/4) Epoch 4, batch 24400, loss[loss=0.2549, simple_loss=0.3285, pruned_loss=0.09067, over 21721.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3151, pruned_loss=0.08324, over 4285940.07 frames. ], batch size: 333, lr: 7.65e-03, grad_scale: 32.0 2023-06-21 02:32:15,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=695304.0, ans=0.125 2023-06-21 02:32:45,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=695424.0, ans=0.1 2023-06-21 02:32:45,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=695424.0, ans=0.125 2023-06-21 02:33:14,564 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.86 vs. limit=22.5 2023-06-21 02:33:25,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=695484.0, ans=0.125 2023-06-21 02:33:26,744 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.110e+02 2.702e+02 3.028e+02 3.520e+02 5.844e+02, threshold=6.057e+02, percent-clipped=0.0 2023-06-21 02:33:54,477 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=695544.0, ans=0.1 2023-06-21 02:33:58,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=695604.0, ans=0.2 2023-06-21 02:33:59,837 INFO [train.py:996] (2/4) Epoch 4, batch 24450, loss[loss=0.2416, simple_loss=0.3309, pruned_loss=0.07619, over 21753.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3199, pruned_loss=0.08567, over 4279545.48 frames. ], batch size: 332, lr: 7.65e-03, grad_scale: 32.0 2023-06-21 02:35:50,227 INFO [train.py:996] (2/4) Epoch 4, batch 24500, loss[loss=0.2463, simple_loss=0.3148, pruned_loss=0.08892, over 21850.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.319, pruned_loss=0.08579, over 4275008.96 frames. ], batch size: 107, lr: 7.64e-03, grad_scale: 32.0 2023-06-21 02:36:49,498 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=696024.0, ans=0.125 2023-06-21 02:37:00,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=696084.0, ans=0.125 2023-06-21 02:37:18,795 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.869e+02 2.487e+02 2.699e+02 3.128e+02 4.356e+02, threshold=5.399e+02, percent-clipped=0.0 2023-06-21 02:37:19,266 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=696144.0, ans=0.0 2023-06-21 02:38:03,339 INFO [train.py:996] (2/4) Epoch 4, batch 24550, loss[loss=0.2187, simple_loss=0.2684, pruned_loss=0.08452, over 20198.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3209, pruned_loss=0.0882, over 4273759.57 frames. ], batch size: 703, lr: 7.64e-03, grad_scale: 32.0 2023-06-21 02:38:12,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=696204.0, ans=0.1 2023-06-21 02:38:24,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=696264.0, ans=0.2 2023-06-21 02:38:26,276 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.71 vs. limit=15.0 2023-06-21 02:38:46,420 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=696324.0, ans=0.125 2023-06-21 02:38:48,426 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.22 vs. limit=15.0 2023-06-21 02:39:20,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=696384.0, ans=0.0 2023-06-21 02:39:33,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=696444.0, ans=0.125 2023-06-21 02:39:57,896 INFO [train.py:996] (2/4) Epoch 4, batch 24600, loss[loss=0.2328, simple_loss=0.3012, pruned_loss=0.08217, over 21814.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.316, pruned_loss=0.08878, over 4266341.12 frames. ], batch size: 372, lr: 7.64e-03, grad_scale: 32.0 2023-06-21 02:40:03,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=696504.0, ans=0.0 2023-06-21 02:40:10,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=696504.0, ans=0.125 2023-06-21 02:40:13,604 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.85 vs. limit=22.5 2023-06-21 02:40:30,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=696564.0, ans=0.0 2023-06-21 02:40:53,078 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=696624.0, ans=0.2 2023-06-21 02:41:15,584 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.931e+02 2.636e+02 3.077e+02 3.640e+02 5.149e+02, threshold=6.153e+02, percent-clipped=0.0 2023-06-21 02:41:49,319 INFO [train.py:996] (2/4) Epoch 4, batch 24650, loss[loss=0.1856, simple_loss=0.2544, pruned_loss=0.05837, over 21668.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3108, pruned_loss=0.08785, over 4269567.98 frames. ], batch size: 282, lr: 7.64e-03, grad_scale: 32.0 2023-06-21 02:42:26,024 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=696864.0, ans=0.125 2023-06-21 02:42:30,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=696924.0, ans=0.0 2023-06-21 02:42:48,779 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=696924.0, ans=0.1 2023-06-21 02:43:24,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=697044.0, ans=0.125 2023-06-21 02:43:42,984 INFO [train.py:996] (2/4) Epoch 4, batch 24700, loss[loss=0.2102, simple_loss=0.2837, pruned_loss=0.06836, over 21137.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3086, pruned_loss=0.08579, over 4257896.72 frames. ], batch size: 176, lr: 7.64e-03, grad_scale: 32.0 2023-06-21 02:44:46,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=697224.0, ans=0.125 2023-06-21 02:44:46,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=697224.0, ans=0.2 2023-06-21 02:45:02,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=697284.0, ans=0.0 2023-06-21 02:45:05,998 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.997e+02 2.595e+02 3.081e+02 3.646e+02 7.591e+02, threshold=6.163e+02, percent-clipped=1.0 2023-06-21 02:45:37,774 INFO [train.py:996] (2/4) Epoch 4, batch 24750, loss[loss=0.2238, simple_loss=0.2811, pruned_loss=0.08322, over 21911.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3016, pruned_loss=0.08324, over 4269772.92 frames. ], batch size: 373, lr: 7.64e-03, grad_scale: 32.0 2023-06-21 02:47:00,392 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=697584.0, ans=0.0 2023-06-21 02:47:16,839 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=697644.0, ans=0.5 2023-06-21 02:47:49,842 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 02:47:52,375 INFO [train.py:996] (2/4) Epoch 4, batch 24800, loss[loss=0.1969, simple_loss=0.2618, pruned_loss=0.066, over 21559.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.2961, pruned_loss=0.08304, over 4278916.02 frames. ], batch size: 132, lr: 7.63e-03, grad_scale: 32.0 2023-06-21 02:48:01,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=697704.0, ans=0.125 2023-06-21 02:48:08,335 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.21 vs. limit=15.0 2023-06-21 02:48:25,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=697764.0, ans=0.125 2023-06-21 02:48:47,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=697884.0, ans=0.125 2023-06-21 02:48:50,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=697884.0, ans=0.125 2023-06-21 02:49:02,100 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.986e+02 2.444e+02 2.860e+02 3.241e+02 4.627e+02, threshold=5.721e+02, percent-clipped=0.0 2023-06-21 02:49:17,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=697944.0, ans=0.125 2023-06-21 02:49:18,575 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.whiten.whitening_limit, batch_count=697944.0, ans=12.0 2023-06-21 02:49:29,179 INFO [train.py:996] (2/4) Epoch 4, batch 24850, loss[loss=0.2806, simple_loss=0.3405, pruned_loss=0.1103, over 21591.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.2965, pruned_loss=0.08408, over 4280160.99 frames. ], batch size: 471, lr: 7.63e-03, grad_scale: 32.0 2023-06-21 02:49:48,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=698004.0, ans=0.125 2023-06-21 02:50:47,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=698184.0, ans=0.0 2023-06-21 02:51:03,398 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=698184.0, ans=0.125 2023-06-21 02:51:28,359 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=698244.0, ans=0.125 2023-06-21 02:51:28,441 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=698244.0, ans=0.125 2023-06-21 02:51:32,391 INFO [train.py:996] (2/4) Epoch 4, batch 24900, loss[loss=0.2668, simple_loss=0.3392, pruned_loss=0.09724, over 21701.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.2974, pruned_loss=0.08363, over 4284452.44 frames. ], batch size: 351, lr: 7.63e-03, grad_scale: 32.0 2023-06-21 02:51:41,147 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=698304.0, ans=0.035 2023-06-21 02:52:03,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=698364.0, ans=22.5 2023-06-21 02:52:44,710 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=12.0 2023-06-21 02:53:09,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=698484.0, ans=0.125 2023-06-21 02:53:10,794 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.158e+02 2.670e+02 3.202e+02 3.948e+02 7.205e+02, threshold=6.404e+02, percent-clipped=5.0 2023-06-21 02:53:13,235 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.whiten.whitening_limit, batch_count=698544.0, ans=12.0 2023-06-21 02:53:38,984 INFO [train.py:996] (2/4) Epoch 4, batch 24950, loss[loss=0.3292, simple_loss=0.3788, pruned_loss=0.1398, over 21455.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3065, pruned_loss=0.0885, over 4280366.46 frames. ], batch size: 471, lr: 7.63e-03, grad_scale: 32.0 2023-06-21 02:53:56,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=698604.0, ans=0.125 2023-06-21 02:55:04,702 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.14 vs. limit=6.0 2023-06-21 02:55:25,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=698844.0, ans=0.125 2023-06-21 02:55:32,504 INFO [train.py:996] (2/4) Epoch 4, batch 25000, loss[loss=0.2178, simple_loss=0.288, pruned_loss=0.07377, over 21628.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3128, pruned_loss=0.08964, over 4275392.40 frames. ], batch size: 298, lr: 7.63e-03, grad_scale: 32.0 2023-06-21 02:56:18,880 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.62 vs. limit=15.0 2023-06-21 02:57:02,184 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.987e+02 2.552e+02 2.956e+02 3.917e+02 9.098e+02, threshold=5.911e+02, percent-clipped=2.0 2023-06-21 02:57:19,773 INFO [train.py:996] (2/4) Epoch 4, batch 25050, loss[loss=0.2163, simple_loss=0.2829, pruned_loss=0.0749, over 21794.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3058, pruned_loss=0.08714, over 4259671.45 frames. ], batch size: 352, lr: 7.63e-03, grad_scale: 32.0 2023-06-21 02:57:27,881 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=699204.0, ans=10.0 2023-06-21 02:57:30,819 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=699204.0, ans=0.0 2023-06-21 02:58:50,653 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.05 vs. limit=22.5 2023-06-21 02:59:25,201 INFO [train.py:996] (2/4) Epoch 4, batch 25100, loss[loss=0.1964, simple_loss=0.2506, pruned_loss=0.07107, over 20754.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3001, pruned_loss=0.0862, over 4251802.14 frames. ], batch size: 608, lr: 7.62e-03, grad_scale: 32.0 2023-06-21 02:59:45,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=699564.0, ans=0.125 2023-06-21 02:59:52,084 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.28 vs. limit=22.5 2023-06-21 03:00:46,755 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.765e+02 2.444e+02 2.736e+02 3.224e+02 6.054e+02, threshold=5.473e+02, percent-clipped=1.0 2023-06-21 03:00:58,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=699744.0, ans=10.0 2023-06-21 03:01:08,941 INFO [train.py:996] (2/4) Epoch 4, batch 25150, loss[loss=0.2384, simple_loss=0.3129, pruned_loss=0.08199, over 21454.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3043, pruned_loss=0.08402, over 4253739.02 frames. ], batch size: 131, lr: 7.62e-03, grad_scale: 32.0 2023-06-21 03:01:22,936 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=699864.0, ans=0.09899494936611666 2023-06-21 03:01:46,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=699924.0, ans=0.1 2023-06-21 03:02:23,537 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.30 vs. limit=5.0 2023-06-21 03:02:42,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=700044.0, ans=0.2 2023-06-21 03:02:45,531 INFO [train.py:996] (2/4) Epoch 4, batch 25200, loss[loss=0.242, simple_loss=0.3244, pruned_loss=0.07981, over 21692.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3035, pruned_loss=0.08156, over 4250802.23 frames. ], batch size: 389, lr: 7.62e-03, grad_scale: 32.0 2023-06-21 03:02:46,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=700104.0, ans=0.125 2023-06-21 03:02:57,141 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.03 vs. limit=15.0 2023-06-21 03:03:19,091 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=15.0 2023-06-21 03:03:23,766 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.74 vs. limit=15.0 2023-06-21 03:04:01,010 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.733e+02 2.161e+02 2.477e+02 2.831e+02 4.094e+02, threshold=4.954e+02, percent-clipped=0.0 2023-06-21 03:04:23,340 INFO [train.py:996] (2/4) Epoch 4, batch 25250, loss[loss=0.2317, simple_loss=0.2968, pruned_loss=0.08328, over 21768.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3016, pruned_loss=0.07967, over 4263502.12 frames. ], batch size: 102, lr: 7.62e-03, grad_scale: 32.0 2023-06-21 03:04:31,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=700404.0, ans=0.0 2023-06-21 03:05:01,120 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=700524.0, ans=0.125 2023-06-21 03:05:14,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=700524.0, ans=0.125 2023-06-21 03:05:36,404 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 03:06:01,003 INFO [train.py:996] (2/4) Epoch 4, batch 25300, loss[loss=0.1942, simple_loss=0.2759, pruned_loss=0.05628, over 20796.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.299, pruned_loss=0.07866, over 4259425.73 frames. ], batch size: 608, lr: 7.62e-03, grad_scale: 32.0 2023-06-21 03:06:16,234 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.59 vs. limit=10.0 2023-06-21 03:06:39,824 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.60 vs. limit=22.5 2023-06-21 03:06:42,550 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-21 03:07:02,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=700824.0, ans=0.0 2023-06-21 03:07:14,495 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.37 vs. limit=15.0 2023-06-21 03:07:35,702 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.941e+02 2.516e+02 3.011e+02 3.707e+02 6.335e+02, threshold=6.023e+02, percent-clipped=10.0 2023-06-21 03:07:39,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=700944.0, ans=0.125 2023-06-21 03:08:02,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=700944.0, ans=0.2 2023-06-21 03:08:04,603 INFO [train.py:996] (2/4) Epoch 4, batch 25350, loss[loss=0.1789, simple_loss=0.259, pruned_loss=0.04937, over 21522.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3027, pruned_loss=0.07953, over 4257625.56 frames. ], batch size: 230, lr: 7.62e-03, grad_scale: 32.0 2023-06-21 03:08:31,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=701064.0, ans=0.125 2023-06-21 03:08:32,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=701064.0, ans=0.125 2023-06-21 03:08:32,872 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=701064.0, ans=0.0 2023-06-21 03:08:47,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=701064.0, ans=0.05 2023-06-21 03:08:47,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=701064.0, ans=0.0 2023-06-21 03:09:04,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=701124.0, ans=0.125 2023-06-21 03:09:15,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=701184.0, ans=0.125 2023-06-21 03:09:46,644 INFO [train.py:996] (2/4) Epoch 4, batch 25400, loss[loss=0.2024, simple_loss=0.2693, pruned_loss=0.06773, over 21551.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3001, pruned_loss=0.07949, over 4261961.59 frames. ], batch size: 263, lr: 7.62e-03, grad_scale: 32.0 2023-06-21 03:09:59,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=701364.0, ans=0.125 2023-06-21 03:11:09,862 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=701484.0, ans=0.025 2023-06-21 03:11:13,892 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.752e+02 2.377e+02 2.648e+02 3.077e+02 4.908e+02, threshold=5.297e+02, percent-clipped=0.0 2023-06-21 03:11:30,295 INFO [train.py:996] (2/4) Epoch 4, batch 25450, loss[loss=0.2125, simple_loss=0.3036, pruned_loss=0.06068, over 21827.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3003, pruned_loss=0.08107, over 4271800.93 frames. ], batch size: 282, lr: 7.61e-03, grad_scale: 32.0 2023-06-21 03:11:39,675 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=701604.0, ans=0.0 2023-06-21 03:12:25,498 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=701724.0, ans=0.125 2023-06-21 03:12:52,677 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 03:12:52,751 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=701784.0, ans=0.0 2023-06-21 03:13:31,377 INFO [train.py:996] (2/4) Epoch 4, batch 25500, loss[loss=0.2117, simple_loss=0.3014, pruned_loss=0.061, over 21691.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.2993, pruned_loss=0.07756, over 4253461.53 frames. ], batch size: 263, lr: 7.61e-03, grad_scale: 32.0 2023-06-21 03:13:31,800 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=701904.0, ans=0.04949747468305833 2023-06-21 03:14:09,310 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=701964.0, ans=0.125 2023-06-21 03:14:28,847 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-21 03:15:00,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=702084.0, ans=0.125 2023-06-21 03:15:04,303 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.819e+02 2.421e+02 2.742e+02 3.267e+02 5.395e+02, threshold=5.484e+02, percent-clipped=1.0 2023-06-21 03:15:41,949 INFO [train.py:996] (2/4) Epoch 4, batch 25550, loss[loss=0.2095, simple_loss=0.303, pruned_loss=0.05799, over 21419.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.305, pruned_loss=0.07701, over 4252863.69 frames. ], batch size: 211, lr: 7.61e-03, grad_scale: 16.0 2023-06-21 03:16:22,592 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=702264.0, ans=0.1 2023-06-21 03:17:02,163 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=702384.0, ans=0.125 2023-06-21 03:17:31,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=702444.0, ans=0.2 2023-06-21 03:17:36,138 INFO [train.py:996] (2/4) Epoch 4, batch 25600, loss[loss=0.2706, simple_loss=0.3385, pruned_loss=0.1013, over 21737.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3095, pruned_loss=0.07818, over 4255090.38 frames. ], batch size: 351, lr: 7.61e-03, grad_scale: 32.0 2023-06-21 03:17:49,371 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.47 vs. limit=15.0 2023-06-21 03:18:28,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=702564.0, ans=0.1 2023-06-21 03:18:48,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=702624.0, ans=0.0 2023-06-21 03:18:58,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=702684.0, ans=0.125 2023-06-21 03:19:16,518 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.914e+02 2.481e+02 2.894e+02 3.647e+02 7.641e+02, threshold=5.788e+02, percent-clipped=5.0 2023-06-21 03:19:31,805 INFO [train.py:996] (2/4) Epoch 4, batch 25650, loss[loss=0.2256, simple_loss=0.2846, pruned_loss=0.08332, over 21450.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3101, pruned_loss=0.08118, over 4259808.19 frames. ], batch size: 211, lr: 7.61e-03, grad_scale: 32.0 2023-06-21 03:20:31,948 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=702924.0, ans=0.125 2023-06-21 03:21:21,889 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.46 vs. limit=15.0 2023-06-21 03:21:21,939 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.40 vs. limit=22.5 2023-06-21 03:21:22,354 INFO [train.py:996] (2/4) Epoch 4, batch 25700, loss[loss=0.2653, simple_loss=0.3187, pruned_loss=0.1059, over 21728.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3087, pruned_loss=0.0831, over 4258657.90 frames. ], batch size: 441, lr: 7.61e-03, grad_scale: 32.0 2023-06-21 03:22:01,627 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.18 vs. limit=12.0 2023-06-21 03:22:01,819 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.64 vs. limit=15.0 2023-06-21 03:22:36,838 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=703284.0, ans=0.0 2023-06-21 03:22:43,847 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.097e+02 2.692e+02 3.025e+02 3.444e+02 5.063e+02, threshold=6.050e+02, percent-clipped=0.0 2023-06-21 03:23:08,321 INFO [train.py:996] (2/4) Epoch 4, batch 25750, loss[loss=0.3134, simple_loss=0.4037, pruned_loss=0.1115, over 21323.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3139, pruned_loss=0.08597, over 4267638.80 frames. ], batch size: 548, lr: 7.60e-03, grad_scale: 32.0 2023-06-21 03:23:50,509 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.59 vs. limit=6.0 2023-06-21 03:24:06,174 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.13 vs. limit=6.0 2023-06-21 03:24:08,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=703464.0, ans=0.125 2023-06-21 03:24:29,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=703524.0, ans=0.2 2023-06-21 03:24:51,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=703584.0, ans=0.125 2023-06-21 03:25:44,701 INFO [train.py:996] (2/4) Epoch 4, batch 25800, loss[loss=0.2658, simple_loss=0.3363, pruned_loss=0.09762, over 21705.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.3265, pruned_loss=0.09129, over 4262551.87 frames. ], batch size: 332, lr: 7.60e-03, grad_scale: 32.0 2023-06-21 03:26:12,509 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=703704.0, ans=0.0 2023-06-21 03:27:04,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=703824.0, ans=0.0 2023-06-21 03:27:18,712 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.41 vs. limit=15.0 2023-06-21 03:27:26,996 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-21 03:27:33,242 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.874e+02 2.719e+02 3.110e+02 3.607e+02 5.723e+02, threshold=6.221e+02, percent-clipped=0.0 2023-06-21 03:27:34,520 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.57 vs. limit=15.0 2023-06-21 03:27:42,716 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.99 vs. limit=12.0 2023-06-21 03:27:45,932 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.69 vs. limit=15.0 2023-06-21 03:27:53,810 INFO [train.py:996] (2/4) Epoch 4, batch 25850, loss[loss=0.1811, simple_loss=0.2435, pruned_loss=0.05937, over 16796.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.3282, pruned_loss=0.09098, over 4259666.69 frames. ], batch size: 62, lr: 7.60e-03, grad_scale: 32.0 2023-06-21 03:29:06,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=704124.0, ans=0.125 2023-06-21 03:29:59,621 INFO [train.py:996] (2/4) Epoch 4, batch 25900, loss[loss=0.2481, simple_loss=0.3244, pruned_loss=0.08592, over 21798.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3283, pruned_loss=0.09074, over 4267169.92 frames. ], batch size: 112, lr: 7.60e-03, grad_scale: 32.0 2023-06-21 03:30:05,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=704304.0, ans=0.0 2023-06-21 03:31:18,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=704424.0, ans=0.125 2023-06-21 03:31:21,144 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=704424.0, ans=0.125 2023-06-21 03:31:23,133 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=15.0 2023-06-21 03:31:31,491 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=704484.0, ans=0.04949747468305833 2023-06-21 03:31:51,623 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.997e+02 2.727e+02 3.001e+02 3.716e+02 5.320e+02, threshold=6.003e+02, percent-clipped=0.0 2023-06-21 03:32:06,565 INFO [train.py:996] (2/4) Epoch 4, batch 25950, loss[loss=0.2645, simple_loss=0.3404, pruned_loss=0.09428, over 21282.00 frames. ], tot_loss[loss=0.2609, simple_loss=0.3349, pruned_loss=0.09352, over 4278166.22 frames. ], batch size: 143, lr: 7.60e-03, grad_scale: 32.0 2023-06-21 03:32:29,057 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.44 vs. limit=15.0 2023-06-21 03:32:31,230 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=704604.0, ans=0.125 2023-06-21 03:32:58,853 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.38 vs. limit=22.5 2023-06-21 03:33:32,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=704784.0, ans=0.025 2023-06-21 03:33:32,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=704784.0, ans=10.0 2023-06-21 03:33:47,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=704844.0, ans=0.2 2023-06-21 03:34:13,443 INFO [train.py:996] (2/4) Epoch 4, batch 26000, loss[loss=0.2594, simple_loss=0.3376, pruned_loss=0.09059, over 21774.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3362, pruned_loss=0.09298, over 4278529.27 frames. ], batch size: 247, lr: 7.60e-03, grad_scale: 32.0 2023-06-21 03:34:21,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=704904.0, ans=0.0 2023-06-21 03:34:22,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=704904.0, ans=0.2 2023-06-21 03:34:34,517 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=704904.0, ans=0.125 2023-06-21 03:34:54,009 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.68 vs. limit=15.0 2023-06-21 03:35:45,256 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.908e+02 2.487e+02 2.984e+02 3.738e+02 5.035e+02, threshold=5.968e+02, percent-clipped=0.0 2023-06-21 03:36:20,365 INFO [train.py:996] (2/4) Epoch 4, batch 26050, loss[loss=0.2393, simple_loss=0.3061, pruned_loss=0.08627, over 21861.00 frames. ], tot_loss[loss=0.2604, simple_loss=0.3347, pruned_loss=0.09303, over 4284232.34 frames. ], batch size: 371, lr: 7.59e-03, grad_scale: 32.0 2023-06-21 03:36:25,906 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.89 vs. limit=22.5 2023-06-21 03:36:27,984 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=705204.0, ans=0.125 2023-06-21 03:36:54,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=705264.0, ans=0.2 2023-06-21 03:36:57,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=705264.0, ans=0.0 2023-06-21 03:37:25,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=705384.0, ans=0.1 2023-06-21 03:37:31,353 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.16 vs. limit=15.0 2023-06-21 03:37:37,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=705384.0, ans=0.125 2023-06-21 03:37:52,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=705444.0, ans=0.0 2023-06-21 03:38:13,463 INFO [train.py:996] (2/4) Epoch 4, batch 26100, loss[loss=0.2532, simple_loss=0.3079, pruned_loss=0.09931, over 21843.00 frames. ], tot_loss[loss=0.2566, simple_loss=0.3289, pruned_loss=0.09213, over 4277416.90 frames. ], batch size: 441, lr: 7.59e-03, grad_scale: 32.0 2023-06-21 03:39:45,699 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.939e+02 2.666e+02 3.098e+02 3.751e+02 6.264e+02, threshold=6.196e+02, percent-clipped=2.0 2023-06-21 03:39:50,566 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=705744.0, ans=0.95 2023-06-21 03:40:00,818 INFO [train.py:996] (2/4) Epoch 4, batch 26150, loss[loss=0.2891, simple_loss=0.3493, pruned_loss=0.1144, over 21346.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3244, pruned_loss=0.0919, over 4281182.60 frames. ], batch size: 159, lr: 7.59e-03, grad_scale: 32.0 2023-06-21 03:40:34,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=705864.0, ans=0.0 2023-06-21 03:41:17,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=705924.0, ans=0.125 2023-06-21 03:42:12,731 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=706044.0, ans=0.0 2023-06-21 03:42:31,087 INFO [train.py:996] (2/4) Epoch 4, batch 26200, loss[loss=0.3341, simple_loss=0.4165, pruned_loss=0.1259, over 21509.00 frames. ], tot_loss[loss=0.253, simple_loss=0.3257, pruned_loss=0.09014, over 4278952.55 frames. ], batch size: 508, lr: 7.59e-03, grad_scale: 16.0 2023-06-21 03:42:45,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=706164.0, ans=0.1 2023-06-21 03:43:06,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=706224.0, ans=0.125 2023-06-21 03:43:13,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=706224.0, ans=15.0 2023-06-21 03:44:00,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=706344.0, ans=0.125 2023-06-21 03:44:04,537 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.967e+02 2.476e+02 2.776e+02 3.346e+02 6.662e+02, threshold=5.552e+02, percent-clipped=1.0 2023-06-21 03:44:38,042 INFO [train.py:996] (2/4) Epoch 4, batch 26250, loss[loss=0.2716, simple_loss=0.3383, pruned_loss=0.1025, over 21747.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.328, pruned_loss=0.08816, over 4280604.99 frames. ], batch size: 389, lr: 7.59e-03, grad_scale: 16.0 2023-06-21 03:44:59,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=706464.0, ans=0.05 2023-06-21 03:46:35,594 INFO [train.py:996] (2/4) Epoch 4, batch 26300, loss[loss=0.2317, simple_loss=0.304, pruned_loss=0.07971, over 21871.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3249, pruned_loss=0.08846, over 4286861.26 frames. ], batch size: 124, lr: 7.59e-03, grad_scale: 16.0 2023-06-21 03:48:19,512 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.68 vs. limit=22.5 2023-06-21 03:48:39,875 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.984e+02 2.559e+02 2.845e+02 3.128e+02 5.313e+02, threshold=5.690e+02, percent-clipped=0.0 2023-06-21 03:48:46,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=706944.0, ans=0.2 2023-06-21 03:48:53,022 INFO [train.py:996] (2/4) Epoch 4, batch 26350, loss[loss=0.2569, simple_loss=0.3302, pruned_loss=0.09181, over 21658.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3232, pruned_loss=0.08907, over 4291866.65 frames. ], batch size: 112, lr: 7.58e-03, grad_scale: 16.0 2023-06-21 03:49:07,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=707004.0, ans=0.125 2023-06-21 03:49:09,403 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-21 03:49:29,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=707064.0, ans=10.0 2023-06-21 03:50:46,312 INFO [train.py:996] (2/4) Epoch 4, batch 26400, loss[loss=0.2472, simple_loss=0.3036, pruned_loss=0.0954, over 21865.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3179, pruned_loss=0.08906, over 4290730.26 frames. ], batch size: 102, lr: 7.58e-03, grad_scale: 32.0 2023-06-21 03:51:09,138 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=707364.0, ans=0.0 2023-06-21 03:51:23,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=707424.0, ans=0.0 2023-06-21 03:51:52,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=707484.0, ans=0.1 2023-06-21 03:52:09,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=707544.0, ans=0.2 2023-06-21 03:52:11,728 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.448e+02 2.879e+02 3.292e+02 3.769e+02 5.955e+02, threshold=6.584e+02, percent-clipped=1.0 2023-06-21 03:52:13,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=707544.0, ans=0.125 2023-06-21 03:52:13,847 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=707544.0, ans=0.09899494936611666 2023-06-21 03:52:33,049 INFO [train.py:996] (2/4) Epoch 4, batch 26450, loss[loss=0.2948, simple_loss=0.3909, pruned_loss=0.09941, over 21654.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3189, pruned_loss=0.08926, over 4284716.55 frames. ], batch size: 414, lr: 7.58e-03, grad_scale: 32.0 2023-06-21 03:54:14,228 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.34 vs. limit=15.0 2023-06-21 03:54:38,373 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.61 vs. limit=15.0 2023-06-21 03:54:54,324 INFO [train.py:996] (2/4) Epoch 4, batch 26500, loss[loss=0.2443, simple_loss=0.3282, pruned_loss=0.08021, over 21633.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3207, pruned_loss=0.08762, over 4270954.91 frames. ], batch size: 389, lr: 7.58e-03, grad_scale: 16.0 2023-06-21 03:55:00,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=707904.0, ans=0.0 2023-06-21 03:55:20,070 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=707904.0, ans=0.025 2023-06-21 03:55:43,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=707964.0, ans=0.0 2023-06-21 03:55:45,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=707964.0, ans=0.0 2023-06-21 03:56:02,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=708024.0, ans=0.125 2023-06-21 03:56:27,691 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.45 vs. limit=15.0 2023-06-21 03:56:29,949 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=708084.0, ans=0.2 2023-06-21 03:56:35,246 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.27 vs. limit=15.0 2023-06-21 03:56:37,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=708084.0, ans=0.0 2023-06-21 03:56:53,848 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.031e+02 2.479e+02 2.895e+02 3.453e+02 8.083e+02, threshold=5.789e+02, percent-clipped=2.0 2023-06-21 03:57:25,085 INFO [train.py:996] (2/4) Epoch 4, batch 26550, loss[loss=0.1994, simple_loss=0.3125, pruned_loss=0.04317, over 20800.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3158, pruned_loss=0.08473, over 4257471.73 frames. ], batch size: 609, lr: 7.58e-03, grad_scale: 16.0 2023-06-21 03:57:44,087 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=708204.0, ans=0.125 2023-06-21 03:58:13,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=708264.0, ans=0.125 2023-06-21 03:58:40,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=708324.0, ans=0.0 2023-06-21 03:59:37,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=708444.0, ans=0.125 2023-06-21 03:59:43,289 INFO [train.py:996] (2/4) Epoch 4, batch 26600, loss[loss=0.2381, simple_loss=0.3028, pruned_loss=0.08665, over 21182.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3151, pruned_loss=0.0821, over 4260788.03 frames. ], batch size: 159, lr: 7.58e-03, grad_scale: 16.0 2023-06-21 04:00:12,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=708564.0, ans=0.125 2023-06-21 04:00:34,854 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.44 vs. limit=15.0 2023-06-21 04:00:46,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=708684.0, ans=0.125 2023-06-21 04:01:15,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=708684.0, ans=0.0 2023-06-21 04:01:27,989 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.759e+02 2.406e+02 2.882e+02 3.356e+02 5.632e+02, threshold=5.763e+02, percent-clipped=0.0 2023-06-21 04:01:39,519 INFO [train.py:996] (2/4) Epoch 4, batch 26650, loss[loss=0.2004, simple_loss=0.2794, pruned_loss=0.06066, over 21769.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3086, pruned_loss=0.08102, over 4255728.69 frames. ], batch size: 351, lr: 7.57e-03, grad_scale: 16.0 2023-06-21 04:01:46,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=708804.0, ans=0.0 2023-06-21 04:01:46,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=708804.0, ans=0.2 2023-06-21 04:01:59,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=708804.0, ans=0.125 2023-06-21 04:02:26,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=708864.0, ans=0.125 2023-06-21 04:03:02,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=708984.0, ans=0.0 2023-06-21 04:03:38,416 INFO [train.py:996] (2/4) Epoch 4, batch 26700, loss[loss=0.2188, simple_loss=0.2829, pruned_loss=0.07734, over 21263.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3013, pruned_loss=0.07781, over 4263702.47 frames. ], batch size: 176, lr: 7.57e-03, grad_scale: 16.0 2023-06-21 04:03:45,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=709104.0, ans=0.1 2023-06-21 04:04:21,383 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=709224.0, ans=0.125 2023-06-21 04:04:28,245 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.48 vs. limit=15.0 2023-06-21 04:04:30,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=709224.0, ans=0.035 2023-06-21 04:05:03,602 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 2.086e+02 2.352e+02 2.691e+02 3.815e+02, threshold=4.705e+02, percent-clipped=0.0 2023-06-21 04:05:04,060 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=709344.0, ans=0.02 2023-06-21 04:05:05,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=709344.0, ans=0.125 2023-06-21 04:05:20,969 INFO [train.py:996] (2/4) Epoch 4, batch 26750, loss[loss=0.2384, simple_loss=0.3259, pruned_loss=0.07538, over 21719.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3021, pruned_loss=0.07766, over 4267506.01 frames. ], batch size: 351, lr: 7.57e-03, grad_scale: 16.0 2023-06-21 04:05:21,386 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=709404.0, ans=0.125 2023-06-21 04:05:46,694 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=709464.0, ans=0.125 2023-06-21 04:05:53,032 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=709464.0, ans=0.125 2023-06-21 04:06:24,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=709584.0, ans=0.1 2023-06-21 04:06:31,582 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=709584.0, ans=0.0 2023-06-21 04:06:59,165 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=709644.0, ans=0.04949747468305833 2023-06-21 04:07:04,503 INFO [train.py:996] (2/4) Epoch 4, batch 26800, loss[loss=0.3312, simple_loss=0.3794, pruned_loss=0.1415, over 21300.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3113, pruned_loss=0.08326, over 4273866.71 frames. ], batch size: 507, lr: 7.57e-03, grad_scale: 32.0 2023-06-21 04:07:25,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=709764.0, ans=0.125 2023-06-21 04:08:10,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=709884.0, ans=0.125 2023-06-21 04:08:19,058 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.199e+02 2.660e+02 3.035e+02 3.389e+02 6.268e+02, threshold=6.069e+02, percent-clipped=8.0 2023-06-21 04:08:36,510 INFO [train.py:996] (2/4) Epoch 4, batch 26850, loss[loss=0.2235, simple_loss=0.2798, pruned_loss=0.08363, over 21652.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3122, pruned_loss=0.0854, over 4278349.37 frames. ], batch size: 282, lr: 7.57e-03, grad_scale: 32.0 2023-06-21 04:08:40,610 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-21 04:08:41,999 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.32 vs. limit=15.0 2023-06-21 04:08:45,039 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.66 vs. limit=6.0 2023-06-21 04:08:48,964 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=710004.0, ans=0.125 2023-06-21 04:08:52,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=710064.0, ans=0.2 2023-06-21 04:09:01,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=710064.0, ans=0.1 2023-06-21 04:09:32,457 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=710184.0, ans=0.125 2023-06-21 04:10:09,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=710244.0, ans=0.125 2023-06-21 04:10:12,275 INFO [train.py:996] (2/4) Epoch 4, batch 26900, loss[loss=0.2082, simple_loss=0.2755, pruned_loss=0.07042, over 15762.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.304, pruned_loss=0.08382, over 4263102.56 frames. ], batch size: 63, lr: 7.57e-03, grad_scale: 32.0 2023-06-21 04:10:28,953 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=710364.0, ans=0.0 2023-06-21 04:11:10,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=710484.0, ans=0.125 2023-06-21 04:11:20,217 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=710484.0, ans=0.125 2023-06-21 04:11:21,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=710484.0, ans=0.125 2023-06-21 04:11:25,296 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.05 vs. limit=22.5 2023-06-21 04:11:29,914 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.955e+02 2.421e+02 2.683e+02 3.097e+02 4.785e+02, threshold=5.366e+02, percent-clipped=0.0 2023-06-21 04:11:47,567 INFO [train.py:996] (2/4) Epoch 4, batch 26950, loss[loss=0.217, simple_loss=0.2779, pruned_loss=0.07804, over 21678.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3023, pruned_loss=0.08373, over 4268298.14 frames. ], batch size: 299, lr: 7.57e-03, grad_scale: 32.0 2023-06-21 04:12:10,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=710664.0, ans=0.125 2023-06-21 04:12:22,710 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=710724.0, ans=0.02 2023-06-21 04:12:51,512 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-21 04:13:24,351 INFO [train.py:996] (2/4) Epoch 4, batch 27000, loss[loss=0.1933, simple_loss=0.2826, pruned_loss=0.05201, over 21608.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3017, pruned_loss=0.08029, over 4270124.85 frames. ], batch size: 263, lr: 7.56e-03, grad_scale: 32.0 2023-06-21 04:13:24,352 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-21 04:14:20,350 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([3.9900, 1.8262, 3.2941, 2.0381], device='cuda:2') 2023-06-21 04:14:23,379 INFO [train.py:1028] (2/4) Epoch 4, validation: loss=0.2574, simple_loss=0.3499, pruned_loss=0.08242, over 1796401.00 frames. 2023-06-21 04:14:23,383 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-21 04:14:33,125 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=710904.0, ans=0.0 2023-06-21 04:14:55,515 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=22.5 2023-06-21 04:15:10,797 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=711024.0, ans=0.2 2023-06-21 04:15:42,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=711084.0, ans=0.0 2023-06-21 04:15:48,123 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.638e+02 2.341e+02 2.653e+02 3.174e+02 5.780e+02, threshold=5.306e+02, percent-clipped=1.0 2023-06-21 04:15:59,802 INFO [train.py:996] (2/4) Epoch 4, batch 27050, loss[loss=0.3004, simple_loss=0.351, pruned_loss=0.1249, over 21651.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3046, pruned_loss=0.07722, over 4269571.32 frames. ], batch size: 507, lr: 7.56e-03, grad_scale: 32.0 2023-06-21 04:16:11,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=711204.0, ans=0.015 2023-06-21 04:16:44,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=711324.0, ans=0.2 2023-06-21 04:17:07,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=711384.0, ans=0.125 2023-06-21 04:17:36,459 INFO [train.py:996] (2/4) Epoch 4, batch 27100, loss[loss=0.2474, simple_loss=0.3265, pruned_loss=0.08418, over 21344.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3076, pruned_loss=0.07874, over 4273047.74 frames. ], batch size: 159, lr: 7.56e-03, grad_scale: 32.0 2023-06-21 04:18:08,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=711564.0, ans=0.125 2023-06-21 04:18:28,758 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=711564.0, ans=0.125 2023-06-21 04:18:28,788 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=711564.0, ans=0.0 2023-06-21 04:19:13,957 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.915e+02 2.531e+02 3.029e+02 3.575e+02 6.566e+02, threshold=6.059e+02, percent-clipped=4.0 2023-06-21 04:19:22,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=711744.0, ans=0.125 2023-06-21 04:19:25,872 INFO [train.py:996] (2/4) Epoch 4, batch 27150, loss[loss=0.2974, simple_loss=0.3888, pruned_loss=0.103, over 21316.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3193, pruned_loss=0.08214, over 4277411.59 frames. ], batch size: 548, lr: 7.56e-03, grad_scale: 32.0 2023-06-21 04:19:26,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=711804.0, ans=0.125 2023-06-21 04:19:34,576 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=711804.0, ans=0.0 2023-06-21 04:20:10,124 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.97 vs. limit=6.0 2023-06-21 04:20:35,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=711984.0, ans=0.0 2023-06-21 04:20:44,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=711984.0, ans=0.125 2023-06-21 04:20:53,832 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 04:21:13,171 INFO [train.py:996] (2/4) Epoch 4, batch 27200, loss[loss=0.3142, simple_loss=0.4096, pruned_loss=0.1094, over 21276.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3268, pruned_loss=0.08494, over 4271820.57 frames. ], batch size: 548, lr: 7.56e-03, grad_scale: 32.0 2023-06-21 04:21:13,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=712104.0, ans=0.125 2023-06-21 04:21:31,759 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.26 vs. limit=15.0 2023-06-21 04:21:36,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=712164.0, ans=0.2 2023-06-21 04:21:53,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=712224.0, ans=0.125 2023-06-21 04:23:02,070 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.156e+02 2.916e+02 3.596e+02 4.478e+02 6.197e+02, threshold=7.191e+02, percent-clipped=3.0 2023-06-21 04:23:20,364 INFO [train.py:996] (2/4) Epoch 4, batch 27250, loss[loss=0.2844, simple_loss=0.3481, pruned_loss=0.1104, over 21255.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3299, pruned_loss=0.08913, over 4268057.85 frames. ], batch size: 143, lr: 7.56e-03, grad_scale: 32.0 2023-06-21 04:23:40,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=712464.0, ans=0.125 2023-06-21 04:23:45,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=712464.0, ans=0.2 2023-06-21 04:24:03,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=712524.0, ans=0.125 2023-06-21 04:24:11,799 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=712584.0, ans=0.0 2023-06-21 04:25:04,909 INFO [train.py:996] (2/4) Epoch 4, batch 27300, loss[loss=0.2368, simple_loss=0.2914, pruned_loss=0.09112, over 20089.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3314, pruned_loss=0.09039, over 4271615.25 frames. ], batch size: 703, lr: 7.55e-03, grad_scale: 32.0 2023-06-21 04:26:17,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=712824.0, ans=0.125 2023-06-21 04:26:36,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=712884.0, ans=0.1 2023-06-21 04:26:37,154 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.86 vs. limit=15.0 2023-06-21 04:26:50,684 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.116e+02 2.547e+02 2.887e+02 3.256e+02 5.962e+02, threshold=5.774e+02, percent-clipped=0.0 2023-06-21 04:26:51,070 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=712944.0, ans=0.0 2023-06-21 04:27:01,876 INFO [train.py:996] (2/4) Epoch 4, batch 27350, loss[loss=0.268, simple_loss=0.352, pruned_loss=0.09199, over 21665.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.3343, pruned_loss=0.09105, over 4277567.44 frames. ], batch size: 414, lr: 7.55e-03, grad_scale: 32.0 2023-06-21 04:27:02,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=713004.0, ans=0.0 2023-06-21 04:27:19,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=713004.0, ans=0.125 2023-06-21 04:28:01,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=713064.0, ans=0.125 2023-06-21 04:29:05,752 INFO [train.py:996] (2/4) Epoch 4, batch 27400, loss[loss=0.2207, simple_loss=0.2819, pruned_loss=0.07972, over 21837.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3293, pruned_loss=0.09014, over 4276451.54 frames. ], batch size: 98, lr: 7.55e-03, grad_scale: 32.0 2023-06-21 04:29:11,259 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.93 vs. limit=15.0 2023-06-21 04:29:24,268 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.96 vs. limit=15.0 2023-06-21 04:29:56,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=713424.0, ans=0.0 2023-06-21 04:30:13,407 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.20 vs. limit=22.5 2023-06-21 04:30:42,125 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.947e+02 2.364e+02 2.653e+02 3.007e+02 3.565e+02, threshold=5.305e+02, percent-clipped=0.0 2023-06-21 04:30:51,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=713544.0, ans=0.125 2023-06-21 04:31:01,233 INFO [train.py:996] (2/4) Epoch 4, batch 27450, loss[loss=0.2521, simple_loss=0.3384, pruned_loss=0.08296, over 21869.00 frames. ], tot_loss[loss=0.25, simple_loss=0.323, pruned_loss=0.08856, over 4281223.12 frames. ], batch size: 317, lr: 7.55e-03, grad_scale: 32.0 2023-06-21 04:31:03,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=713604.0, ans=0.125 2023-06-21 04:32:41,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=713844.0, ans=0.125 2023-06-21 04:32:46,517 INFO [train.py:996] (2/4) Epoch 4, batch 27500, loss[loss=0.2461, simple_loss=0.3208, pruned_loss=0.0857, over 21765.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3217, pruned_loss=0.08887, over 4287422.43 frames. ], batch size: 332, lr: 7.55e-03, grad_scale: 32.0 2023-06-21 04:34:03,277 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=714084.0, ans=0.125 2023-06-21 04:34:15,076 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.074e+02 2.580e+02 3.045e+02 3.674e+02 6.292e+02, threshold=6.090e+02, percent-clipped=1.0 2023-06-21 04:34:27,559 INFO [train.py:996] (2/4) Epoch 4, batch 27550, loss[loss=0.2854, simple_loss=0.3581, pruned_loss=0.1064, over 19995.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3169, pruned_loss=0.08622, over 4287789.83 frames. ], batch size: 702, lr: 7.55e-03, grad_scale: 32.0 2023-06-21 04:35:59,355 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=714444.0, ans=0.125 2023-06-21 04:36:03,177 INFO [train.py:996] (2/4) Epoch 4, batch 27600, loss[loss=0.2168, simple_loss=0.2787, pruned_loss=0.07741, over 21687.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.309, pruned_loss=0.08441, over 4291468.91 frames. ], batch size: 282, lr: 7.54e-03, grad_scale: 32.0 2023-06-21 04:36:09,967 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.16 vs. limit=15.0 2023-06-21 04:36:12,284 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 04:36:24,597 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.79 vs. limit=15.0 2023-06-21 04:36:25,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=714564.0, ans=0.125 2023-06-21 04:36:29,082 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.40 vs. limit=15.0 2023-06-21 04:37:27,292 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.907e+02 2.399e+02 2.649e+02 3.187e+02 5.092e+02, threshold=5.298e+02, percent-clipped=0.0 2023-06-21 04:37:38,966 INFO [train.py:996] (2/4) Epoch 4, batch 27650, loss[loss=0.2263, simple_loss=0.3017, pruned_loss=0.07543, over 21283.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3049, pruned_loss=0.08394, over 4282505.83 frames. ], batch size: 159, lr: 7.54e-03, grad_scale: 32.0 2023-06-21 04:37:45,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=714804.0, ans=0.125 2023-06-21 04:37:45,685 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.12 vs. limit=10.0 2023-06-21 04:38:26,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=714924.0, ans=0.125 2023-06-21 04:39:31,716 INFO [train.py:996] (2/4) Epoch 4, batch 27700, loss[loss=0.2197, simple_loss=0.3012, pruned_loss=0.06915, over 21614.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3043, pruned_loss=0.08161, over 4279010.76 frames. ], batch size: 263, lr: 7.54e-03, grad_scale: 32.0 2023-06-21 04:39:39,765 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=715104.0, ans=0.015 2023-06-21 04:40:14,616 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 04:40:16,710 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.65 vs. limit=15.0 2023-06-21 04:40:48,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=715284.0, ans=0.2 2023-06-21 04:40:56,294 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.24 vs. limit=15.0 2023-06-21 04:40:56,526 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.830e+02 2.480e+02 2.917e+02 3.384e+02 5.795e+02, threshold=5.834e+02, percent-clipped=2.0 2023-06-21 04:41:05,696 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=715344.0, ans=0.0 2023-06-21 04:41:07,996 INFO [train.py:996] (2/4) Epoch 4, batch 27750, loss[loss=0.219, simple_loss=0.2847, pruned_loss=0.07668, over 21370.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3068, pruned_loss=0.0812, over 4282049.09 frames. ], batch size: 131, lr: 7.54e-03, grad_scale: 32.0 2023-06-21 04:41:08,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=715404.0, ans=0.0 2023-06-21 04:42:45,358 INFO [train.py:996] (2/4) Epoch 4, batch 27800, loss[loss=0.242, simple_loss=0.3018, pruned_loss=0.09115, over 21365.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3048, pruned_loss=0.08128, over 4280351.70 frames. ], batch size: 159, lr: 7.54e-03, grad_scale: 32.0 2023-06-21 04:44:10,743 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.75 vs. limit=15.0 2023-06-21 04:44:21,152 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.064e+02 2.548e+02 3.003e+02 3.795e+02 7.044e+02, threshold=6.005e+02, percent-clipped=2.0 2023-06-21 04:44:33,677 INFO [train.py:996] (2/4) Epoch 4, batch 27850, loss[loss=0.2569, simple_loss=0.3367, pruned_loss=0.08854, over 21026.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3046, pruned_loss=0.08235, over 4288473.89 frames. ], batch size: 607, lr: 7.54e-03, grad_scale: 32.0 2023-06-21 04:45:26,325 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-21 04:45:32,575 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=716124.0, ans=0.0 2023-06-21 04:45:56,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=716244.0, ans=0.125 2023-06-21 04:46:00,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=716244.0, ans=0.0 2023-06-21 04:46:06,494 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=716244.0, ans=0.1 2023-06-21 04:46:17,452 INFO [train.py:996] (2/4) Epoch 4, batch 27900, loss[loss=0.2026, simple_loss=0.2888, pruned_loss=0.0582, over 21370.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3138, pruned_loss=0.08306, over 4281497.07 frames. ], batch size: 194, lr: 7.54e-03, grad_scale: 32.0 2023-06-21 04:46:19,302 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=716304.0, ans=0.2 2023-06-21 04:46:58,764 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.57 vs. limit=22.5 2023-06-21 04:47:20,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=716484.0, ans=0.1 2023-06-21 04:47:27,925 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=716484.0, ans=0.0 2023-06-21 04:47:30,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=716484.0, ans=0.125 2023-06-21 04:48:01,886 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.814e+02 2.532e+02 3.006e+02 3.855e+02 8.106e+02, threshold=6.012e+02, percent-clipped=3.0 2023-06-21 04:48:19,855 INFO [train.py:996] (2/4) Epoch 4, batch 27950, loss[loss=0.1913, simple_loss=0.2818, pruned_loss=0.05047, over 21585.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3138, pruned_loss=0.08009, over 4277846.48 frames. ], batch size: 230, lr: 7.53e-03, grad_scale: 32.0 2023-06-21 04:50:18,746 INFO [train.py:996] (2/4) Epoch 4, batch 28000, loss[loss=0.2196, simple_loss=0.3004, pruned_loss=0.06941, over 21855.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3117, pruned_loss=0.07873, over 4282327.36 frames. ], batch size: 351, lr: 7.53e-03, grad_scale: 32.0 2023-06-21 04:51:42,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=717084.0, ans=0.125 2023-06-21 04:51:59,102 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.819e+02 2.391e+02 2.762e+02 3.354e+02 5.546e+02, threshold=5.523e+02, percent-clipped=0.0 2023-06-21 04:52:22,060 INFO [train.py:996] (2/4) Epoch 4, batch 28050, loss[loss=0.3173, simple_loss=0.3787, pruned_loss=0.128, over 21604.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3096, pruned_loss=0.08052, over 4274422.51 frames. ], batch size: 508, lr: 7.53e-03, grad_scale: 32.0 2023-06-21 04:52:22,498 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=717204.0, ans=0.125 2023-06-21 04:52:51,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=717264.0, ans=0.2 2023-06-21 04:53:05,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=717324.0, ans=0.2 2023-06-21 04:53:29,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=717384.0, ans=0.0 2023-06-21 04:53:55,229 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=717384.0, ans=0.0 2023-06-21 04:54:05,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=717444.0, ans=0.125 2023-06-21 04:54:21,883 INFO [train.py:996] (2/4) Epoch 4, batch 28100, loss[loss=0.2266, simple_loss=0.2919, pruned_loss=0.0806, over 21759.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3084, pruned_loss=0.0807, over 4272542.93 frames. ], batch size: 351, lr: 7.53e-03, grad_scale: 32.0 2023-06-21 04:55:04,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=717564.0, ans=0.04949747468305833 2023-06-21 04:55:27,052 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=717624.0, ans=0.2 2023-06-21 04:55:43,896 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=717624.0, ans=0.1 2023-06-21 04:56:05,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=717684.0, ans=0.0 2023-06-21 04:56:20,286 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.029e+02 2.718e+02 3.349e+02 4.404e+02 9.791e+02, threshold=6.698e+02, percent-clipped=12.0 2023-06-21 04:56:39,054 INFO [train.py:996] (2/4) Epoch 4, batch 28150, loss[loss=0.231, simple_loss=0.292, pruned_loss=0.08505, over 21948.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3028, pruned_loss=0.08023, over 4273476.98 frames. ], batch size: 113, lr: 7.53e-03, grad_scale: 32.0 2023-06-21 04:56:40,194 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.59 vs. limit=15.0 2023-06-21 04:58:17,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=717984.0, ans=0.1 2023-06-21 04:58:30,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=717984.0, ans=0.125 2023-06-21 04:58:41,719 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=717984.0, ans=0.0 2023-06-21 04:58:45,986 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=718044.0, ans=0.125 2023-06-21 04:59:05,627 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.96 vs. limit=15.0 2023-06-21 04:59:16,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=718044.0, ans=0.0 2023-06-21 04:59:19,393 INFO [train.py:996] (2/4) Epoch 4, batch 28200, loss[loss=0.2745, simple_loss=0.3438, pruned_loss=0.1026, over 21765.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3031, pruned_loss=0.0819, over 4274896.23 frames. ], batch size: 124, lr: 7.53e-03, grad_scale: 32.0 2023-06-21 05:00:00,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=718164.0, ans=0.2 2023-06-21 05:00:08,383 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.35 vs. limit=15.0 2023-06-21 05:00:11,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=718224.0, ans=0.125 2023-06-21 05:00:11,186 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=718224.0, ans=0.125 2023-06-21 05:00:35,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=718224.0, ans=0.0 2023-06-21 05:01:01,116 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.143e+02 2.643e+02 3.167e+02 3.877e+02 7.013e+02, threshold=6.334e+02, percent-clipped=1.0 2023-06-21 05:01:27,689 INFO [train.py:996] (2/4) Epoch 4, batch 28250, loss[loss=0.2297, simple_loss=0.2887, pruned_loss=0.0854, over 21650.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3057, pruned_loss=0.08509, over 4271073.77 frames. ], batch size: 298, lr: 7.52e-03, grad_scale: 32.0 2023-06-21 05:01:57,710 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.18 vs. limit=15.0 2023-06-21 05:03:39,792 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.61 vs. limit=10.0 2023-06-21 05:04:06,642 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=12.0 2023-06-21 05:04:15,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=718644.0, ans=0.0 2023-06-21 05:04:18,248 INFO [train.py:996] (2/4) Epoch 4, batch 28300, loss[loss=0.1765, simple_loss=0.2521, pruned_loss=0.05045, over 21237.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3016, pruned_loss=0.08203, over 4269187.75 frames. ], batch size: 159, lr: 7.52e-03, grad_scale: 32.0 2023-06-21 05:04:30,079 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.07 vs. limit=6.0 2023-06-21 05:04:36,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=718764.0, ans=0.0 2023-06-21 05:04:50,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=718764.0, ans=0.0 2023-06-21 05:05:45,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=718884.0, ans=0.2 2023-06-21 05:05:46,072 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.77 vs. limit=10.0 2023-06-21 05:06:06,027 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 05:06:09,701 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.566e+02 2.185e+02 2.470e+02 3.133e+02 5.042e+02, threshold=4.941e+02, percent-clipped=0.0 2023-06-21 05:06:39,497 INFO [train.py:996] (2/4) Epoch 4, batch 28350, loss[loss=0.2351, simple_loss=0.2994, pruned_loss=0.08537, over 21326.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.2999, pruned_loss=0.0768, over 4263470.93 frames. ], batch size: 471, lr: 7.52e-03, grad_scale: 32.0 2023-06-21 05:06:48,256 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 05:07:04,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=719004.0, ans=0.2 2023-06-21 05:07:24,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=719064.0, ans=0.125 2023-06-21 05:07:55,935 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=719124.0, ans=0.125 2023-06-21 05:08:01,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=719124.0, ans=0.0 2023-06-21 05:08:45,278 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=719184.0, ans=0.0 2023-06-21 05:08:52,286 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=719244.0, ans=0.125 2023-06-21 05:08:53,873 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=719244.0, ans=0.125 2023-06-21 05:09:07,090 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=719244.0, ans=0.0 2023-06-21 05:09:31,702 INFO [train.py:996] (2/4) Epoch 4, batch 28400, loss[loss=0.2685, simple_loss=0.3334, pruned_loss=0.1018, over 21183.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.2961, pruned_loss=0.07661, over 4271183.14 frames. ], batch size: 143, lr: 7.52e-03, grad_scale: 32.0 2023-06-21 05:10:51,026 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=719424.0, ans=0.05 2023-06-21 05:11:42,792 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.979e+02 2.715e+02 3.259e+02 3.998e+02 7.508e+02, threshold=6.518e+02, percent-clipped=8.0 2023-06-21 05:12:07,595 INFO [train.py:996] (2/4) Epoch 4, batch 28450, loss[loss=0.2602, simple_loss=0.3278, pruned_loss=0.09625, over 21799.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3018, pruned_loss=0.08065, over 4278541.99 frames. ], batch size: 112, lr: 7.52e-03, grad_scale: 32.0 2023-06-21 05:12:12,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=719604.0, ans=0.0 2023-06-21 05:13:13,041 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.10 vs. limit=15.0 2023-06-21 05:13:20,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=719724.0, ans=0.125 2023-06-21 05:14:01,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=719784.0, ans=0.125 2023-06-21 05:14:54,377 INFO [train.py:996] (2/4) Epoch 4, batch 28500, loss[loss=0.2761, simple_loss=0.3391, pruned_loss=0.1065, over 21225.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3048, pruned_loss=0.08395, over 4286508.80 frames. ], batch size: 143, lr: 7.52e-03, grad_scale: 32.0 2023-06-21 05:15:32,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=719964.0, ans=0.125 2023-06-21 05:16:51,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=720084.0, ans=0.125 2023-06-21 05:16:51,776 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=720084.0, ans=0.0 2023-06-21 05:16:53,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=720084.0, ans=0.2 2023-06-21 05:17:05,831 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.000e+02 2.689e+02 3.008e+02 3.521e+02 7.488e+02, threshold=6.015e+02, percent-clipped=1.0 2023-06-21 05:17:06,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=720144.0, ans=0.125 2023-06-21 05:17:37,835 INFO [train.py:996] (2/4) Epoch 4, batch 28550, loss[loss=0.2695, simple_loss=0.3651, pruned_loss=0.08695, over 21870.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3129, pruned_loss=0.08582, over 4285663.35 frames. ], batch size: 372, lr: 7.51e-03, grad_scale: 32.0 2023-06-21 05:18:33,024 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.49 vs. limit=15.0 2023-06-21 05:18:45,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=720324.0, ans=0.125 2023-06-21 05:19:39,997 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=720384.0, ans=0.07 2023-06-21 05:19:56,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=720444.0, ans=0.125 2023-06-21 05:20:19,285 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=720444.0, ans=0.0 2023-06-21 05:20:28,532 INFO [train.py:996] (2/4) Epoch 4, batch 28600, loss[loss=0.2592, simple_loss=0.339, pruned_loss=0.08973, over 21664.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3193, pruned_loss=0.08784, over 4278662.51 frames. ], batch size: 351, lr: 7.51e-03, grad_scale: 16.0 2023-06-21 05:20:35,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=720504.0, ans=0.125 2023-06-21 05:20:37,776 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=720504.0, ans=0.0 2023-06-21 05:21:40,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=720624.0, ans=0.0 2023-06-21 05:22:37,118 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.101e+02 2.703e+02 3.009e+02 3.352e+02 5.683e+02, threshold=6.017e+02, percent-clipped=0.0 2023-06-21 05:22:51,419 INFO [train.py:996] (2/4) Epoch 4, batch 28650, loss[loss=0.1984, simple_loss=0.2611, pruned_loss=0.06788, over 21438.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3128, pruned_loss=0.08676, over 4275916.61 frames. ], batch size: 195, lr: 7.51e-03, grad_scale: 16.0 2023-06-21 05:23:53,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=720864.0, ans=0.05 2023-06-21 05:24:00,819 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=720924.0, ans=0.125 2023-06-21 05:24:01,384 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.98 vs. limit=15.0 2023-06-21 05:24:22,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=720924.0, ans=0.125 2023-06-21 05:25:30,148 INFO [train.py:996] (2/4) Epoch 4, batch 28700, loss[loss=0.2694, simple_loss=0.3378, pruned_loss=0.1005, over 21923.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3129, pruned_loss=0.08858, over 4278127.31 frames. ], batch size: 372, lr: 7.51e-03, grad_scale: 16.0 2023-06-21 05:26:21,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=721164.0, ans=0.2 2023-06-21 05:27:52,696 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.027e+02 2.736e+02 3.172e+02 3.821e+02 7.853e+02, threshold=6.343e+02, percent-clipped=6.0 2023-06-21 05:28:04,609 INFO [train.py:996] (2/4) Epoch 4, batch 28750, loss[loss=0.2267, simple_loss=0.2924, pruned_loss=0.08049, over 21353.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3124, pruned_loss=0.089, over 4266753.84 frames. ], batch size: 176, lr: 7.51e-03, grad_scale: 16.0 2023-06-21 05:28:34,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=721464.0, ans=0.125 2023-06-21 05:29:00,737 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.58 vs. limit=15.0 2023-06-21 05:29:09,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=721464.0, ans=0.0 2023-06-21 05:29:14,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=721524.0, ans=0.0 2023-06-21 05:30:26,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=721584.0, ans=0.0 2023-06-21 05:30:56,309 INFO [train.py:996] (2/4) Epoch 4, batch 28800, loss[loss=0.2718, simple_loss=0.335, pruned_loss=0.1043, over 21924.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3163, pruned_loss=0.08943, over 4275385.66 frames. ], batch size: 316, lr: 7.51e-03, grad_scale: 32.0 2023-06-21 05:31:55,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=721764.0, ans=0.125 2023-06-21 05:31:57,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=721764.0, ans=0.125 2023-06-21 05:32:07,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=721824.0, ans=0.1 2023-06-21 05:32:51,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=721884.0, ans=0.125 2023-06-21 05:33:18,597 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.932e+02 2.546e+02 2.847e+02 3.234e+02 4.509e+02, threshold=5.694e+02, percent-clipped=0.0 2023-06-21 05:33:30,194 INFO [train.py:996] (2/4) Epoch 4, batch 28850, loss[loss=0.2438, simple_loss=0.3148, pruned_loss=0.0864, over 21466.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3174, pruned_loss=0.091, over 4284720.71 frames. ], batch size: 131, lr: 7.51e-03, grad_scale: 32.0 2023-06-21 05:34:34,889 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=722064.0, ans=0.0 2023-06-21 05:35:32,524 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=722184.0, ans=0.125 2023-06-21 05:36:42,439 INFO [train.py:996] (2/4) Epoch 4, batch 28900, loss[loss=0.2324, simple_loss=0.2941, pruned_loss=0.08533, over 21451.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3192, pruned_loss=0.09213, over 4280924.40 frames. ], batch size: 194, lr: 7.50e-03, grad_scale: 32.0 2023-06-21 05:36:52,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=722304.0, ans=22.5 2023-06-21 05:37:05,052 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=722364.0, ans=0.0 2023-06-21 05:37:56,652 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=722424.0, ans=0.125 2023-06-21 05:38:03,736 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=722424.0, ans=0.1 2023-06-21 05:38:08,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=722424.0, ans=0.125 2023-06-21 05:38:15,711 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.88 vs. limit=6.0 2023-06-21 05:38:34,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=722484.0, ans=0.125 2023-06-21 05:39:05,967 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.047e+02 2.757e+02 3.326e+02 3.999e+02 7.317e+02, threshold=6.653e+02, percent-clipped=7.0 2023-06-21 05:39:44,238 INFO [train.py:996] (2/4) Epoch 4, batch 28950, loss[loss=0.2515, simple_loss=0.3622, pruned_loss=0.07035, over 21234.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3194, pruned_loss=0.09125, over 4276018.22 frames. ], batch size: 548, lr: 7.50e-03, grad_scale: 32.0 2023-06-21 05:41:41,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=722784.0, ans=0.125 2023-06-21 05:41:51,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=722784.0, ans=0.5 2023-06-21 05:42:10,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=722844.0, ans=0.0 2023-06-21 05:42:35,362 INFO [train.py:996] (2/4) Epoch 4, batch 29000, loss[loss=0.2553, simple_loss=0.3334, pruned_loss=0.08856, over 20781.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3221, pruned_loss=0.08983, over 4269835.24 frames. ], batch size: 607, lr: 7.50e-03, grad_scale: 32.0 2023-06-21 05:43:59,114 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=723024.0, ans=0.125 2023-06-21 05:44:37,696 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=723084.0, ans=0.0 2023-06-21 05:44:48,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=723144.0, ans=0.1 2023-06-21 05:44:54,070 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.954e+02 2.818e+02 3.202e+02 3.885e+02 5.481e+02, threshold=6.403e+02, percent-clipped=0.0 2023-06-21 05:45:34,364 INFO [train.py:996] (2/4) Epoch 4, batch 29050, loss[loss=0.2318, simple_loss=0.3036, pruned_loss=0.07999, over 21767.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3206, pruned_loss=0.09075, over 4272066.99 frames. ], batch size: 112, lr: 7.50e-03, grad_scale: 32.0 2023-06-21 05:47:50,945 INFO [train.py:996] (2/4) Epoch 4, batch 29100, loss[loss=0.1933, simple_loss=0.2534, pruned_loss=0.06663, over 21560.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3127, pruned_loss=0.08758, over 4274891.74 frames. ], batch size: 196, lr: 7.50e-03, grad_scale: 32.0 2023-06-21 05:48:53,121 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 05:49:07,381 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.78 vs. limit=15.0 2023-06-21 05:49:12,539 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=723624.0, ans=0.125 2023-06-21 05:49:14,066 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=723624.0, ans=0.0 2023-06-21 05:50:06,142 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 2.494e+02 2.914e+02 3.696e+02 5.946e+02, threshold=5.828e+02, percent-clipped=0.0 2023-06-21 05:50:13,120 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=723744.0, ans=0.0 2023-06-21 05:50:36,697 INFO [train.py:996] (2/4) Epoch 4, batch 29150, loss[loss=0.2237, simple_loss=0.2897, pruned_loss=0.07882, over 21209.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3105, pruned_loss=0.08587, over 4275689.10 frames. ], batch size: 176, lr: 7.50e-03, grad_scale: 32.0 2023-06-21 05:51:00,952 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=723804.0, ans=0.0 2023-06-21 05:51:06,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=723864.0, ans=0.125 2023-06-21 05:51:56,376 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.00 vs. limit=12.0 2023-06-21 05:52:26,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=723984.0, ans=0.125 2023-06-21 05:53:18,586 INFO [train.py:996] (2/4) Epoch 4, batch 29200, loss[loss=0.1975, simple_loss=0.2496, pruned_loss=0.0727, over 20706.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3073, pruned_loss=0.08535, over 4267827.49 frames. ], batch size: 608, lr: 7.49e-03, grad_scale: 32.0 2023-06-21 05:53:25,547 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=724104.0, ans=0.1 2023-06-21 05:54:14,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=724224.0, ans=0.0 2023-06-21 05:54:25,337 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 05:55:34,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=724344.0, ans=0.125 2023-06-21 05:55:44,457 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.808e+02 2.500e+02 2.791e+02 3.395e+02 6.739e+02, threshold=5.582e+02, percent-clipped=1.0 2023-06-21 05:55:51,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=724344.0, ans=0.0 2023-06-21 05:55:53,792 INFO [train.py:996] (2/4) Epoch 4, batch 29250, loss[loss=0.2264, simple_loss=0.3161, pruned_loss=0.06837, over 21622.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3057, pruned_loss=0.0829, over 4264796.55 frames. ], batch size: 263, lr: 7.49e-03, grad_scale: 32.0 2023-06-21 05:55:54,337 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=724404.0, ans=0.1 2023-06-21 05:56:02,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=724404.0, ans=0.5 2023-06-21 05:57:03,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=724524.0, ans=0.0 2023-06-21 05:57:13,610 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=724524.0, ans=0.0 2023-06-21 05:57:44,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=724584.0, ans=0.2 2023-06-21 05:58:12,173 INFO [train.py:996] (2/4) Epoch 4, batch 29300, loss[loss=0.2106, simple_loss=0.2695, pruned_loss=0.07586, over 21293.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3072, pruned_loss=0.08182, over 4273631.42 frames. ], batch size: 176, lr: 7.49e-03, grad_scale: 32.0 2023-06-21 05:58:12,628 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=724704.0, ans=0.05 2023-06-21 05:59:33,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=724824.0, ans=0.125 2023-06-21 06:00:10,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=724884.0, ans=0.125 2023-06-21 06:00:12,146 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=724884.0, ans=0.0 2023-06-21 06:00:47,272 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.908e+02 2.479e+02 2.818e+02 3.454e+02 4.910e+02, threshold=5.636e+02, percent-clipped=0.0 2023-06-21 06:00:58,264 INFO [train.py:996] (2/4) Epoch 4, batch 29350, loss[loss=0.2451, simple_loss=0.3328, pruned_loss=0.07866, over 21681.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3041, pruned_loss=0.08096, over 4274400.78 frames. ], batch size: 391, lr: 7.49e-03, grad_scale: 32.0 2023-06-21 06:01:25,177 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=725004.0, ans=0.2 2023-06-21 06:01:32,469 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=725064.0, ans=0.07 2023-06-21 06:02:45,392 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=725184.0, ans=0.0 2023-06-21 06:03:34,674 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=725304.0, ans=0.125 2023-06-21 06:03:35,560 INFO [train.py:996] (2/4) Epoch 4, batch 29400, loss[loss=0.2103, simple_loss=0.2949, pruned_loss=0.06283, over 21727.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3018, pruned_loss=0.07884, over 4265456.72 frames. ], batch size: 391, lr: 7.49e-03, grad_scale: 32.0 2023-06-21 06:03:37,714 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=725304.0, ans=0.125 2023-06-21 06:04:42,365 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.30 vs. limit=22.5 2023-06-21 06:05:22,190 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.69 vs. limit=22.5 2023-06-21 06:05:23,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=725484.0, ans=0.125 2023-06-21 06:06:06,852 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.934e+02 2.554e+02 2.866e+02 3.298e+02 4.992e+02, threshold=5.731e+02, percent-clipped=0.0 2023-06-21 06:06:14,120 INFO [train.py:996] (2/4) Epoch 4, batch 29450, loss[loss=0.3158, simple_loss=0.3993, pruned_loss=0.1161, over 21805.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3016, pruned_loss=0.07864, over 4264144.16 frames. ], batch size: 124, lr: 7.49e-03, grad_scale: 16.0 2023-06-21 06:06:15,023 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.89 vs. limit=10.0 2023-06-21 06:06:21,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=725604.0, ans=0.125 2023-06-21 06:06:28,759 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.49 vs. limit=15.0 2023-06-21 06:08:52,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=725844.0, ans=0.1 2023-06-21 06:08:59,114 INFO [train.py:996] (2/4) Epoch 4, batch 29500, loss[loss=0.313, simple_loss=0.3526, pruned_loss=0.1367, over 21762.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3056, pruned_loss=0.08205, over 4272335.94 frames. ], batch size: 508, lr: 7.49e-03, grad_scale: 16.0 2023-06-21 06:09:00,469 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.71 vs. limit=15.0 2023-06-21 06:10:37,145 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=726024.0, ans=0.125 2023-06-21 06:11:27,322 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.078e+02 2.546e+02 2.828e+02 3.561e+02 7.164e+02, threshold=5.655e+02, percent-clipped=2.0 2023-06-21 06:11:36,389 INFO [train.py:996] (2/4) Epoch 4, batch 29550, loss[loss=0.2296, simple_loss=0.2877, pruned_loss=0.08579, over 21552.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.305, pruned_loss=0.08362, over 4282198.67 frames. ], batch size: 548, lr: 7.48e-03, grad_scale: 16.0 2023-06-21 06:13:56,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=726384.0, ans=0.125 2023-06-21 06:13:59,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=726384.0, ans=0.125 2023-06-21 06:14:36,513 INFO [train.py:996] (2/4) Epoch 4, batch 29600, loss[loss=0.2461, simple_loss=0.3271, pruned_loss=0.08257, over 21442.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3123, pruned_loss=0.08611, over 4285070.56 frames. ], batch size: 211, lr: 7.48e-03, grad_scale: 32.0 2023-06-21 06:15:33,712 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.48 vs. limit=15.0 2023-06-21 06:15:40,881 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=726564.0, ans=0.125 2023-06-21 06:15:41,293 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.18 vs. limit=22.5 2023-06-21 06:16:59,161 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.37 vs. limit=12.0 2023-06-21 06:17:17,835 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.726e+02 2.703e+02 2.966e+02 3.927e+02 7.887e+02, threshold=5.932e+02, percent-clipped=5.0 2023-06-21 06:17:18,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=726744.0, ans=0.125 2023-06-21 06:17:19,791 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=726744.0, ans=0.0 2023-06-21 06:17:21,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=726744.0, ans=0.125 2023-06-21 06:17:25,413 INFO [train.py:996] (2/4) Epoch 4, batch 29650, loss[loss=0.2032, simple_loss=0.2765, pruned_loss=0.06489, over 21879.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3105, pruned_loss=0.08335, over 4275637.45 frames. ], batch size: 316, lr: 7.48e-03, grad_scale: 32.0 2023-06-21 06:18:20,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=726924.0, ans=0.1 2023-06-21 06:19:48,158 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 06:20:05,844 INFO [train.py:996] (2/4) Epoch 4, batch 29700, loss[loss=0.2435, simple_loss=0.3274, pruned_loss=0.07984, over 21129.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3118, pruned_loss=0.08296, over 4275847.46 frames. ], batch size: 143, lr: 7.48e-03, grad_scale: 32.0 2023-06-21 06:21:17,173 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.62 vs. limit=15.0 2023-06-21 06:21:32,359 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=727224.0, ans=0.2 2023-06-21 06:21:52,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=727224.0, ans=0.0 2023-06-21 06:22:32,537 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.021e+02 2.514e+02 2.951e+02 3.470e+02 6.624e+02, threshold=5.902e+02, percent-clipped=3.0 2023-06-21 06:22:49,176 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.24 vs. limit=22.5 2023-06-21 06:22:49,521 INFO [train.py:996] (2/4) Epoch 4, batch 29750, loss[loss=0.2119, simple_loss=0.2883, pruned_loss=0.06779, over 21458.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3181, pruned_loss=0.08352, over 4276317.26 frames. ], batch size: 131, lr: 7.48e-03, grad_scale: 32.0 2023-06-21 06:24:01,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=727464.0, ans=0.125 2023-06-21 06:24:01,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=727464.0, ans=0.125 2023-06-21 06:25:06,955 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 06:25:28,089 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.05 vs. limit=22.5 2023-06-21 06:25:34,525 INFO [train.py:996] (2/4) Epoch 4, batch 29800, loss[loss=0.2345, simple_loss=0.3043, pruned_loss=0.08232, over 21856.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.319, pruned_loss=0.08356, over 4275765.57 frames. ], batch size: 298, lr: 7.48e-03, grad_scale: 32.0 2023-06-21 06:26:09,522 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=727704.0, ans=0.125 2023-06-21 06:26:10,044 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-21 06:28:12,227 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.829e+02 2.465e+02 2.740e+02 3.189e+02 4.611e+02, threshold=5.479e+02, percent-clipped=0.0 2023-06-21 06:28:19,323 INFO [train.py:996] (2/4) Epoch 4, batch 29850, loss[loss=0.2272, simple_loss=0.2874, pruned_loss=0.0835, over 21251.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3143, pruned_loss=0.08128, over 4280044.03 frames. ], batch size: 608, lr: 7.47e-03, grad_scale: 32.0 2023-06-21 06:28:38,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=728004.0, ans=0.125 2023-06-21 06:29:42,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=728124.0, ans=0.0 2023-06-21 06:31:05,309 INFO [train.py:996] (2/4) Epoch 4, batch 29900, loss[loss=0.2676, simple_loss=0.332, pruned_loss=0.1016, over 21374.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.311, pruned_loss=0.08212, over 4275482.83 frames. ], batch size: 143, lr: 7.47e-03, grad_scale: 16.0 2023-06-21 06:31:08,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=728304.0, ans=0.2 2023-06-21 06:31:18,139 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.04 vs. limit=5.0 2023-06-21 06:31:50,258 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.35 vs. limit=15.0 2023-06-21 06:33:31,147 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.044e+02 2.587e+02 2.985e+02 3.529e+02 5.856e+02, threshold=5.969e+02, percent-clipped=3.0 2023-06-21 06:33:48,940 INFO [train.py:996] (2/4) Epoch 4, batch 29950, loss[loss=0.2583, simple_loss=0.3352, pruned_loss=0.09072, over 21833.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3157, pruned_loss=0.08683, over 4270508.68 frames. ], batch size: 124, lr: 7.47e-03, grad_scale: 16.0 2023-06-21 06:33:52,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=728604.0, ans=0.125 2023-06-21 06:33:54,975 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=728604.0, ans=0.0 2023-06-21 06:34:35,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=728664.0, ans=0.2 2023-06-21 06:35:38,359 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=728784.0, ans=0.125 2023-06-21 06:36:14,427 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=728844.0, ans=0.0 2023-06-21 06:36:22,879 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=728844.0, ans=0.07 2023-06-21 06:36:36,776 INFO [train.py:996] (2/4) Epoch 4, batch 30000, loss[loss=0.232, simple_loss=0.32, pruned_loss=0.072, over 21715.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3189, pruned_loss=0.08737, over 4271865.41 frames. ], batch size: 247, lr: 7.47e-03, grad_scale: 32.0 2023-06-21 06:36:36,777 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-21 06:37:40,812 INFO [train.py:1028] (2/4) Epoch 4, validation: loss=0.2514, simple_loss=0.3484, pruned_loss=0.07722, over 1796401.00 frames. 2023-06-21 06:37:40,813 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-21 06:37:52,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=728904.0, ans=0.125 2023-06-21 06:38:05,206 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.29 vs. limit=15.0 2023-06-21 06:39:36,257 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=729084.0, ans=0.125 2023-06-21 06:39:45,873 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=729144.0, ans=0.1 2023-06-21 06:39:53,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=729144.0, ans=0.1 2023-06-21 06:40:00,395 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.920e+02 2.416e+02 2.798e+02 3.507e+02 5.014e+02, threshold=5.595e+02, percent-clipped=0.0 2023-06-21 06:40:26,619 INFO [train.py:996] (2/4) Epoch 4, batch 30050, loss[loss=0.2409, simple_loss=0.3369, pruned_loss=0.07244, over 21270.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3226, pruned_loss=0.08563, over 4265066.23 frames. ], batch size: 548, lr: 7.47e-03, grad_scale: 32.0 2023-06-21 06:40:57,797 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=729204.0, ans=0.2 2023-06-21 06:41:20,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=729324.0, ans=0.125 2023-06-21 06:41:20,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=729324.0, ans=0.1 2023-06-21 06:41:34,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=729324.0, ans=0.0 2023-06-21 06:42:32,456 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.22 vs. limit=6.0 2023-06-21 06:42:42,146 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=729504.0, ans=0.1 2023-06-21 06:42:43,072 INFO [train.py:996] (2/4) Epoch 4, batch 30100, loss[loss=0.2845, simple_loss=0.3971, pruned_loss=0.08597, over 21229.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3212, pruned_loss=0.08492, over 4260342.15 frames. ], batch size: 549, lr: 7.47e-03, grad_scale: 32.0 2023-06-21 06:43:42,366 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.26 vs. limit=22.5 2023-06-21 06:44:25,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=729684.0, ans=0.125 2023-06-21 06:45:00,414 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.306e+02 2.761e+02 3.179e+02 3.830e+02 7.077e+02, threshold=6.357e+02, percent-clipped=3.0 2023-06-21 06:45:02,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=729744.0, ans=0.125 2023-06-21 06:45:17,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=729804.0, ans=0.0 2023-06-21 06:45:32,450 INFO [train.py:996] (2/4) Epoch 4, batch 30150, loss[loss=0.3333, simple_loss=0.3729, pruned_loss=0.1468, over 21423.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3172, pruned_loss=0.08613, over 4269624.83 frames. ], batch size: 510, lr: 7.47e-03, grad_scale: 32.0 2023-06-21 06:45:48,871 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=729804.0, ans=0.95 2023-06-21 06:46:09,953 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=729864.0, ans=0.1 2023-06-21 06:46:44,200 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.87 vs. limit=15.0 2023-06-21 06:48:29,914 INFO [train.py:996] (2/4) Epoch 4, batch 30200, loss[loss=0.3146, simple_loss=0.3805, pruned_loss=0.1243, over 21458.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3212, pruned_loss=0.08585, over 4271630.95 frames. ], batch size: 471, lr: 7.46e-03, grad_scale: 32.0 2023-06-21 06:49:16,451 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.79 vs. limit=5.0 2023-06-21 06:49:47,571 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=730224.0, ans=0.0 2023-06-21 06:51:08,594 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.41 vs. limit=6.0 2023-06-21 06:51:08,902 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.030e+02 2.530e+02 2.908e+02 3.490e+02 5.232e+02, threshold=5.817e+02, percent-clipped=0.0 2023-06-21 06:51:14,501 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.16 vs. limit=15.0 2023-06-21 06:51:14,935 INFO [train.py:996] (2/4) Epoch 4, batch 30250, loss[loss=0.2682, simple_loss=0.3152, pruned_loss=0.1106, over 20129.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3263, pruned_loss=0.08782, over 4273114.70 frames. ], batch size: 707, lr: 7.46e-03, grad_scale: 32.0 2023-06-21 06:52:17,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=730524.0, ans=0.1 2023-06-21 06:53:05,008 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.32 vs. limit=15.0 2023-06-21 06:53:47,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=730704.0, ans=0.0 2023-06-21 06:53:48,470 INFO [train.py:996] (2/4) Epoch 4, batch 30300, loss[loss=0.236, simple_loss=0.2934, pruned_loss=0.08925, over 21737.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3242, pruned_loss=0.08783, over 4269840.52 frames. ], batch size: 351, lr: 7.46e-03, grad_scale: 32.0 2023-06-21 06:54:42,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=730764.0, ans=0.0 2023-06-21 06:54:54,549 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 06:55:30,363 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.30 vs. limit=15.0 2023-06-21 06:55:37,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=730884.0, ans=0.125 2023-06-21 06:56:35,036 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.013e+02 2.610e+02 3.012e+02 3.806e+02 5.738e+02, threshold=6.023e+02, percent-clipped=0.0 2023-06-21 06:56:42,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=730944.0, ans=0.125 2023-06-21 06:56:46,355 INFO [train.py:996] (2/4) Epoch 4, batch 30350, loss[loss=0.3636, simple_loss=0.435, pruned_loss=0.1461, over 21524.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3252, pruned_loss=0.08883, over 4270333.54 frames. ], batch size: 509, lr: 7.46e-03, grad_scale: 32.0 2023-06-21 06:56:51,730 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.30 vs. limit=10.0 2023-06-21 06:57:00,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=731004.0, ans=0.1 2023-06-21 06:58:08,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=731124.0, ans=0.125 2023-06-21 06:58:15,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=731184.0, ans=0.0 2023-06-21 06:58:32,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=731184.0, ans=0.035 2023-06-21 06:58:54,168 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.04 vs. limit=15.0 2023-06-21 06:58:54,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=731184.0, ans=0.125 2023-06-21 07:00:07,932 INFO [train.py:996] (2/4) Epoch 4, batch 30400, loss[loss=0.2519, simple_loss=0.3377, pruned_loss=0.08303, over 21208.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3201, pruned_loss=0.08666, over 4259992.69 frames. ], batch size: 549, lr: 7.46e-03, grad_scale: 32.0 2023-06-21 07:01:40,056 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.19 vs. limit=15.0 2023-06-21 07:01:45,155 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=731364.0, ans=10.0 2023-06-21 07:02:33,105 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=731424.0, ans=0.1 2023-06-21 07:03:01,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=731424.0, ans=0.125 2023-06-21 07:03:52,852 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.55 vs. limit=10.0 2023-06-21 07:04:16,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=731484.0, ans=0.125 2023-06-21 07:05:08,745 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.116e+02 3.449e+02 4.209e+02 5.452e+02 1.525e+03, threshold=8.417e+02, percent-clipped=19.0 2023-06-21 07:05:24,258 INFO [train.py:996] (2/4) Epoch 4, batch 30450, loss[loss=0.308, simple_loss=0.422, pruned_loss=0.097, over 19862.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3211, pruned_loss=0.08688, over 4202588.61 frames. ], batch size: 702, lr: 7.46e-03, grad_scale: 32.0 2023-06-21 07:06:28,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.whiten.whitening_limit, batch_count=731664.0, ans=12.0 2023-06-21 07:09:26,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=731844.0, ans=0.125 2023-06-21 07:12:03,774 INFO [train.py:996] (2/4) Epoch 5, batch 0, loss[loss=0.2867, simple_loss=0.3241, pruned_loss=0.1246, over 21377.00 frames. ], tot_loss[loss=0.2867, simple_loss=0.3241, pruned_loss=0.1246, over 21377.00 frames. ], batch size: 509, lr: 6.61e-03, grad_scale: 32.0 2023-06-21 07:12:03,775 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-21 07:12:43,347 INFO [train.py:1028] (2/4) Epoch 5, validation: loss=0.2379, simple_loss=0.3479, pruned_loss=0.06395, over 1796401.00 frames. 2023-06-21 07:12:43,348 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-21 07:13:00,179 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=731934.0, ans=0.0 2023-06-21 07:13:39,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=731994.0, ans=0.0 2023-06-21 07:13:41,415 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.65 vs. limit=22.5 2023-06-21 07:14:23,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=732114.0, ans=0.125 2023-06-21 07:15:06,054 INFO [train.py:996] (2/4) Epoch 5, batch 50, loss[loss=0.2238, simple_loss=0.3106, pruned_loss=0.06848, over 21433.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3252, pruned_loss=0.08659, over 961720.41 frames. ], batch size: 194, lr: 6.60e-03, grad_scale: 32.0 2023-06-21 07:15:13,371 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.161e+02 3.133e+02 4.909e+02 7.707e+02 2.246e+03, threshold=9.818e+02, percent-clipped=21.0 2023-06-21 07:15:22,777 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.26 vs. limit=22.5 2023-06-21 07:15:44,101 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=732294.0, ans=0.0 2023-06-21 07:17:01,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=732414.0, ans=0.1 2023-06-21 07:17:09,049 INFO [train.py:996] (2/4) Epoch 5, batch 100, loss[loss=0.3085, simple_loss=0.3832, pruned_loss=0.1169, over 21763.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3368, pruned_loss=0.08882, over 1693975.46 frames. ], batch size: 441, lr: 6.60e-03, grad_scale: 32.0 2023-06-21 07:18:05,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=732594.0, ans=0.125 2023-06-21 07:18:49,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=732654.0, ans=0.2 2023-06-21 07:19:30,625 INFO [train.py:996] (2/4) Epoch 5, batch 150, loss[loss=0.2765, simple_loss=0.3736, pruned_loss=0.08975, over 21646.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.3387, pruned_loss=0.08695, over 2267665.31 frames. ], batch size: 389, lr: 6.60e-03, grad_scale: 16.0 2023-06-21 07:19:42,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=732774.0, ans=0.125 2023-06-21 07:19:43,409 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.868e+02 2.471e+02 2.754e+02 3.178e+02 4.719e+02, threshold=5.509e+02, percent-clipped=0.0 2023-06-21 07:19:50,610 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=732774.0, ans=0.2 2023-06-21 07:20:31,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=732834.0, ans=0.2 2023-06-21 07:21:57,863 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=22.5 2023-06-21 07:22:07,532 INFO [train.py:996] (2/4) Epoch 5, batch 200, loss[loss=0.2462, simple_loss=0.3011, pruned_loss=0.09562, over 21752.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.334, pruned_loss=0.08587, over 2714133.60 frames. ], batch size: 112, lr: 6.60e-03, grad_scale: 16.0 2023-06-21 07:22:27,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=733134.0, ans=0.0 2023-06-21 07:22:46,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=733134.0, ans=0.1 2023-06-21 07:23:15,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=733194.0, ans=0.1 2023-06-21 07:23:18,125 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=733194.0, ans=0.125 2023-06-21 07:23:19,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=733194.0, ans=0.0 2023-06-21 07:24:22,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=733314.0, ans=0.05 2023-06-21 07:24:31,201 INFO [train.py:996] (2/4) Epoch 5, batch 250, loss[loss=0.3294, simple_loss=0.3791, pruned_loss=0.1398, over 21379.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3303, pruned_loss=0.08625, over 3059905.29 frames. ], batch size: 507, lr: 6.60e-03, grad_scale: 16.0 2023-06-21 07:24:34,088 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.132e+02 2.530e+02 2.881e+02 3.593e+02 5.629e+02, threshold=5.761e+02, percent-clipped=1.0 2023-06-21 07:24:34,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=733374.0, ans=0.0 2023-06-21 07:25:35,989 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=733494.0, ans=0.1 2023-06-21 07:25:55,866 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.24 vs. limit=15.0 2023-06-21 07:25:56,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=733554.0, ans=10.0 2023-06-21 07:26:09,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=733554.0, ans=0.2 2023-06-21 07:26:40,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=733614.0, ans=0.125 2023-06-21 07:26:42,582 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=733614.0, ans=0.125 2023-06-21 07:27:07,460 INFO [train.py:996] (2/4) Epoch 5, batch 300, loss[loss=0.2258, simple_loss=0.2935, pruned_loss=0.07902, over 21443.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3233, pruned_loss=0.0856, over 3321908.79 frames. ], batch size: 211, lr: 6.60e-03, grad_scale: 16.0 2023-06-21 07:27:41,963 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=733734.0, ans=0.0 2023-06-21 07:28:55,427 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.71 vs. limit=15.0 2023-06-21 07:29:39,575 INFO [train.py:996] (2/4) Epoch 5, batch 350, loss[loss=0.2461, simple_loss=0.3091, pruned_loss=0.09153, over 21434.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3164, pruned_loss=0.08415, over 3541607.63 frames. ], batch size: 194, lr: 6.60e-03, grad_scale: 16.0 2023-06-21 07:29:51,532 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.928e+02 2.569e+02 2.912e+02 3.547e+02 5.180e+02, threshold=5.824e+02, percent-clipped=0.0 2023-06-21 07:31:51,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=734214.0, ans=0.125 2023-06-21 07:31:52,357 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-21 07:32:07,380 INFO [train.py:996] (2/4) Epoch 5, batch 400, loss[loss=0.1882, simple_loss=0.2498, pruned_loss=0.06328, over 21571.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3125, pruned_loss=0.08303, over 3706716.85 frames. ], batch size: 231, lr: 6.59e-03, grad_scale: 32.0 2023-06-21 07:32:44,703 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=734334.0, ans=0.1 2023-06-21 07:34:39,339 INFO [train.py:996] (2/4) Epoch 5, batch 450, loss[loss=0.2108, simple_loss=0.2878, pruned_loss=0.06687, over 21574.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.311, pruned_loss=0.08196, over 3838421.70 frames. ], batch size: 263, lr: 6.59e-03, grad_scale: 32.0 2023-06-21 07:34:47,771 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.082e+02 2.593e+02 3.148e+02 3.879e+02 6.028e+02, threshold=6.296e+02, percent-clipped=1.0 2023-06-21 07:34:59,586 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.07 vs. limit=15.0 2023-06-21 07:35:52,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=734694.0, ans=0.025 2023-06-21 07:36:39,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=734754.0, ans=0.125 2023-06-21 07:36:39,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=734754.0, ans=0.0 2023-06-21 07:37:05,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=734814.0, ans=0.1 2023-06-21 07:37:07,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=734814.0, ans=0.125 2023-06-21 07:37:22,098 INFO [train.py:996] (2/4) Epoch 5, batch 500, loss[loss=0.2104, simple_loss=0.2706, pruned_loss=0.07508, over 21386.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3085, pruned_loss=0.08054, over 3944278.16 frames. ], batch size: 212, lr: 6.59e-03, grad_scale: 32.0 2023-06-21 07:37:36,232 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-21 07:37:53,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=734934.0, ans=0.0 2023-06-21 07:39:34,426 INFO [train.py:996] (2/4) Epoch 5, batch 550, loss[loss=0.248, simple_loss=0.3162, pruned_loss=0.08991, over 21794.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3123, pruned_loss=0.08037, over 4018980.85 frames. ], batch size: 112, lr: 6.59e-03, grad_scale: 32.0 2023-06-21 07:39:56,902 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.846e+02 2.656e+02 3.238e+02 3.999e+02 7.986e+02, threshold=6.476e+02, percent-clipped=2.0 2023-06-21 07:40:13,451 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.80 vs. limit=6.0 2023-06-21 07:40:18,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=735234.0, ans=0.125 2023-06-21 07:42:15,952 INFO [train.py:996] (2/4) Epoch 5, batch 600, loss[loss=0.1966, simple_loss=0.2654, pruned_loss=0.06393, over 21186.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3157, pruned_loss=0.08086, over 4078944.13 frames. ], batch size: 143, lr: 6.59e-03, grad_scale: 32.0 2023-06-21 07:42:35,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=735534.0, ans=0.125 2023-06-21 07:42:45,870 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.74 vs. limit=15.0 2023-06-21 07:43:17,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=735594.0, ans=0.125 2023-06-21 07:44:04,547 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=735714.0, ans=0.125 2023-06-21 07:44:12,040 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.38 vs. limit=12.0 2023-06-21 07:44:21,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=735714.0, ans=0.0 2023-06-21 07:44:23,904 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=735714.0, ans=0.125 2023-06-21 07:44:38,925 INFO [train.py:996] (2/4) Epoch 5, batch 650, loss[loss=0.2447, simple_loss=0.3326, pruned_loss=0.07843, over 21843.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3161, pruned_loss=0.08141, over 4123746.86 frames. ], batch size: 298, lr: 6.59e-03, grad_scale: 32.0 2023-06-21 07:44:43,360 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.114e+02 2.568e+02 2.858e+02 3.474e+02 5.611e+02, threshold=5.715e+02, percent-clipped=0.0 2023-06-21 07:45:21,851 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.68 vs. limit=15.0 2023-06-21 07:45:52,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=735894.0, ans=0.2 2023-06-21 07:46:08,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=735954.0, ans=0.1 2023-06-21 07:46:34,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=736014.0, ans=0.125 2023-06-21 07:47:06,525 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=736074.0, ans=0.035 2023-06-21 07:47:07,521 INFO [train.py:996] (2/4) Epoch 5, batch 700, loss[loss=0.256, simple_loss=0.3287, pruned_loss=0.09163, over 21399.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3183, pruned_loss=0.08306, over 4156958.55 frames. ], batch size: 194, lr: 6.59e-03, grad_scale: 32.0 2023-06-21 07:47:26,865 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.05 vs. limit=15.0 2023-06-21 07:48:39,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=736254.0, ans=0.125 2023-06-21 07:48:51,469 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=736254.0, ans=0.2 2023-06-21 07:48:52,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=736314.0, ans=0.09899494936611666 2023-06-21 07:48:55,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=736314.0, ans=0.5 2023-06-21 07:49:41,156 INFO [train.py:996] (2/4) Epoch 5, batch 750, loss[loss=0.2445, simple_loss=0.3345, pruned_loss=0.07727, over 21436.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3159, pruned_loss=0.08309, over 4195100.32 frames. ], batch size: 211, lr: 6.59e-03, grad_scale: 32.0 2023-06-21 07:49:43,941 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.986e+02 2.823e+02 3.263e+02 3.934e+02 5.736e+02, threshold=6.525e+02, percent-clipped=1.0 2023-06-21 07:49:57,072 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.87 vs. limit=10.0 2023-06-21 07:50:14,289 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=736494.0, ans=0.125 2023-06-21 07:51:26,339 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=736614.0, ans=0.0 2023-06-21 07:52:01,040 INFO [train.py:996] (2/4) Epoch 5, batch 800, loss[loss=0.2632, simple_loss=0.319, pruned_loss=0.1037, over 21769.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3124, pruned_loss=0.08306, over 4214550.45 frames. ], batch size: 414, lr: 6.58e-03, grad_scale: 32.0 2023-06-21 07:53:04,756 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=736794.0, ans=0.125 2023-06-21 07:53:07,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=736794.0, ans=0.0 2023-06-21 07:54:08,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=736914.0, ans=0.05 2023-06-21 07:54:09,423 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.85 vs. limit=15.0 2023-06-21 07:54:31,376 INFO [train.py:996] (2/4) Epoch 5, batch 850, loss[loss=0.2499, simple_loss=0.3153, pruned_loss=0.09218, over 21873.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3107, pruned_loss=0.08271, over 4231096.73 frames. ], batch size: 414, lr: 6.58e-03, grad_scale: 32.0 2023-06-21 07:54:34,150 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.860e+02 2.467e+02 2.771e+02 3.268e+02 5.744e+02, threshold=5.542e+02, percent-clipped=0.0 2023-06-21 07:55:42,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=737154.0, ans=0.125 2023-06-21 07:56:41,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=737214.0, ans=0.125 2023-06-21 07:56:54,556 INFO [train.py:996] (2/4) Epoch 5, batch 900, loss[loss=0.2314, simple_loss=0.2958, pruned_loss=0.08352, over 21652.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3087, pruned_loss=0.0828, over 4246744.24 frames. ], batch size: 230, lr: 6.58e-03, grad_scale: 32.0 2023-06-21 07:57:45,817 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.03 vs. limit=10.0 2023-06-21 07:57:59,171 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.43 vs. limit=10.0 2023-06-21 07:58:05,647 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.69 vs. limit=22.5 2023-06-21 07:58:24,791 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.72 vs. limit=8.0 2023-06-21 07:58:25,477 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=737454.0, ans=0.125 2023-06-21 07:59:28,521 INFO [train.py:996] (2/4) Epoch 5, batch 950, loss[loss=0.1788, simple_loss=0.2701, pruned_loss=0.04372, over 21639.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3065, pruned_loss=0.08214, over 4254478.36 frames. ], batch size: 230, lr: 6.58e-03, grad_scale: 32.0 2023-06-21 07:59:31,412 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.019e+02 2.538e+02 2.883e+02 3.307e+02 5.189e+02, threshold=5.766e+02, percent-clipped=0.0 2023-06-21 07:59:33,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=737574.0, ans=0.0 2023-06-21 08:01:28,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=737814.0, ans=0.125 2023-06-21 08:01:38,590 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.28 vs. limit=22.5 2023-06-21 08:02:02,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=737874.0, ans=0.2 2023-06-21 08:02:02,910 INFO [train.py:996] (2/4) Epoch 5, batch 1000, loss[loss=0.2142, simple_loss=0.273, pruned_loss=0.07773, over 21449.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3066, pruned_loss=0.08097, over 4264054.58 frames. ], batch size: 212, lr: 6.58e-03, grad_scale: 32.0 2023-06-21 08:02:14,472 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-21 08:02:14,494 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-06-21 08:03:16,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=737994.0, ans=0.125 2023-06-21 08:04:25,212 INFO [train.py:996] (2/4) Epoch 5, batch 1050, loss[loss=0.211, simple_loss=0.2927, pruned_loss=0.06468, over 21363.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3063, pruned_loss=0.08069, over 4273720.38 frames. ], batch size: 176, lr: 6.58e-03, grad_scale: 32.0 2023-06-21 08:04:28,166 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.999e+02 2.450e+02 2.796e+02 3.213e+02 4.581e+02, threshold=5.591e+02, percent-clipped=0.0 2023-06-21 08:04:28,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=738174.0, ans=0.2 2023-06-21 08:05:35,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=738294.0, ans=0.125 2023-06-21 08:06:19,277 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=738354.0, ans=0.125 2023-06-21 08:07:09,340 INFO [train.py:996] (2/4) Epoch 5, batch 1100, loss[loss=0.2204, simple_loss=0.2928, pruned_loss=0.07406, over 21508.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3064, pruned_loss=0.08039, over 4276657.58 frames. ], batch size: 211, lr: 6.58e-03, grad_scale: 16.0 2023-06-21 08:07:09,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=738474.0, ans=0.125 2023-06-21 08:08:36,149 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.19 vs. limit=15.0 2023-06-21 08:09:28,098 INFO [train.py:996] (2/4) Epoch 5, batch 1150, loss[loss=0.2045, simple_loss=0.2959, pruned_loss=0.0566, over 21318.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3086, pruned_loss=0.08091, over 4279478.22 frames. ], batch size: 548, lr: 6.57e-03, grad_scale: 16.0 2023-06-21 08:09:37,095 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 2.476e+02 2.814e+02 3.322e+02 5.569e+02, threshold=5.628e+02, percent-clipped=0.0 2023-06-21 08:11:44,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=739014.0, ans=0.035 2023-06-21 08:12:23,024 INFO [train.py:996] (2/4) Epoch 5, batch 1200, loss[loss=0.2761, simple_loss=0.3349, pruned_loss=0.1086, over 21619.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3107, pruned_loss=0.08199, over 4278090.51 frames. ], batch size: 471, lr: 6.57e-03, grad_scale: 32.0 2023-06-21 08:13:47,893 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.14 vs. limit=6.0 2023-06-21 08:13:51,510 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=739254.0, ans=0.1 2023-06-21 08:14:32,975 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=739314.0, ans=0.125 2023-06-21 08:14:40,378 INFO [train.py:996] (2/4) Epoch 5, batch 1250, loss[loss=0.252, simple_loss=0.3221, pruned_loss=0.09096, over 21353.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.312, pruned_loss=0.08137, over 4280369.79 frames. ], batch size: 548, lr: 6.57e-03, grad_scale: 32.0 2023-06-21 08:14:48,799 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.156e+02 2.637e+02 3.101e+02 3.888e+02 6.560e+02, threshold=6.202e+02, percent-clipped=3.0 2023-06-21 08:15:17,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=739434.0, ans=0.125 2023-06-21 08:15:55,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=739494.0, ans=0.125 2023-06-21 08:16:02,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=739554.0, ans=0.0 2023-06-21 08:16:25,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=739554.0, ans=0.04949747468305833 2023-06-21 08:17:07,372 INFO [train.py:996] (2/4) Epoch 5, batch 1300, loss[loss=0.216, simple_loss=0.2975, pruned_loss=0.06728, over 21443.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3139, pruned_loss=0.08183, over 4289731.94 frames. ], batch size: 195, lr: 6.57e-03, grad_scale: 32.0 2023-06-21 08:17:52,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=739734.0, ans=0.125 2023-06-21 08:18:34,622 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=739794.0, ans=0.125 2023-06-21 08:18:36,723 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.20 vs. limit=12.0 2023-06-21 08:18:48,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=739854.0, ans=0.2 2023-06-21 08:19:41,996 INFO [train.py:996] (2/4) Epoch 5, batch 1350, loss[loss=0.2519, simple_loss=0.3291, pruned_loss=0.08736, over 21811.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3148, pruned_loss=0.0824, over 4283354.09 frames. ], batch size: 351, lr: 6.57e-03, grad_scale: 32.0 2023-06-21 08:19:51,969 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.970e+02 2.635e+02 2.966e+02 3.709e+02 5.719e+02, threshold=5.932e+02, percent-clipped=0.0 2023-06-21 08:20:25,992 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=740034.0, ans=0.04949747468305833 2023-06-21 08:20:26,062 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=740034.0, ans=0.95 2023-06-21 08:21:32,044 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.68 vs. limit=15.0 2023-06-21 08:22:00,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=740214.0, ans=0.125 2023-06-21 08:22:02,748 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.05 vs. limit=15.0 2023-06-21 08:22:06,201 INFO [train.py:996] (2/4) Epoch 5, batch 1400, loss[loss=0.2237, simple_loss=0.2984, pruned_loss=0.07451, over 21842.00 frames. ], tot_loss[loss=0.238, simple_loss=0.312, pruned_loss=0.08201, over 4290343.02 frames. ], batch size: 372, lr: 6.57e-03, grad_scale: 32.0 2023-06-21 08:23:16,674 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.07 vs. limit=15.0 2023-06-21 08:24:29,371 INFO [train.py:996] (2/4) Epoch 5, batch 1450, loss[loss=0.2565, simple_loss=0.3251, pruned_loss=0.09388, over 21724.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3121, pruned_loss=0.08315, over 4292772.49 frames. ], batch size: 332, lr: 6.57e-03, grad_scale: 32.0 2023-06-21 08:24:40,586 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.711e+02 2.440e+02 2.892e+02 3.416e+02 5.937e+02, threshold=5.784e+02, percent-clipped=1.0 2023-06-21 08:26:05,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=740754.0, ans=0.0 2023-06-21 08:26:05,771 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.95 vs. limit=6.0 2023-06-21 08:26:34,722 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.83 vs. limit=15.0 2023-06-21 08:26:42,925 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.83 vs. limit=22.5 2023-06-21 08:26:55,896 INFO [train.py:996] (2/4) Epoch 5, batch 1500, loss[loss=0.2752, simple_loss=0.3698, pruned_loss=0.0903, over 21515.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3136, pruned_loss=0.08425, over 4298442.00 frames. ], batch size: 471, lr: 6.57e-03, grad_scale: 16.0 2023-06-21 08:27:07,498 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=740874.0, ans=0.5 2023-06-21 08:27:19,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=740874.0, ans=0.125 2023-06-21 08:27:55,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=740994.0, ans=0.125 2023-06-21 08:28:00,305 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-21 08:28:41,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=741054.0, ans=0.2 2023-06-21 08:28:51,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=741114.0, ans=0.125 2023-06-21 08:28:58,656 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=741114.0, ans=0.5 2023-06-21 08:29:22,635 INFO [train.py:996] (2/4) Epoch 5, batch 1550, loss[loss=0.2494, simple_loss=0.3181, pruned_loss=0.09038, over 21229.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3119, pruned_loss=0.08379, over 4300402.78 frames. ], batch size: 143, lr: 6.56e-03, grad_scale: 16.0 2023-06-21 08:29:34,020 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.890e+02 2.613e+02 3.171e+02 3.955e+02 6.837e+02, threshold=6.341e+02, percent-clipped=1.0 2023-06-21 08:29:39,690 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.81 vs. limit=15.0 2023-06-21 08:29:53,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=741234.0, ans=0.125 2023-06-21 08:30:37,461 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=741294.0, ans=0.0 2023-06-21 08:31:19,365 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.79 vs. limit=15.0 2023-06-21 08:31:39,866 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=741414.0, ans=0.1 2023-06-21 08:32:00,162 INFO [train.py:996] (2/4) Epoch 5, batch 1600, loss[loss=0.2316, simple_loss=0.335, pruned_loss=0.06412, over 20890.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3108, pruned_loss=0.08191, over 4298465.84 frames. ], batch size: 607, lr: 6.56e-03, grad_scale: 32.0 2023-06-21 08:32:00,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=741474.0, ans=0.1 2023-06-21 08:32:42,476 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 08:34:26,754 INFO [train.py:996] (2/4) Epoch 5, batch 1650, loss[loss=0.2036, simple_loss=0.3056, pruned_loss=0.05081, over 21770.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3109, pruned_loss=0.08169, over 4291715.36 frames. ], batch size: 351, lr: 6.56e-03, grad_scale: 16.0 2023-06-21 08:34:39,415 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.900e+02 2.504e+02 2.926e+02 3.545e+02 5.904e+02, threshold=5.852e+02, percent-clipped=0.0 2023-06-21 08:34:55,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=741834.0, ans=0.125 2023-06-21 08:35:38,409 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.56 vs. limit=10.0 2023-06-21 08:36:53,266 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=742014.0, ans=0.2 2023-06-21 08:37:08,119 INFO [train.py:996] (2/4) Epoch 5, batch 1700, loss[loss=0.2496, simple_loss=0.3171, pruned_loss=0.09107, over 21840.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3133, pruned_loss=0.08282, over 4291676.51 frames. ], batch size: 298, lr: 6.56e-03, grad_scale: 16.0 2023-06-21 08:37:13,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=742074.0, ans=0.1 2023-06-21 08:39:33,371 INFO [train.py:996] (2/4) Epoch 5, batch 1750, loss[loss=0.1822, simple_loss=0.271, pruned_loss=0.04669, over 21785.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3124, pruned_loss=0.08146, over 4286759.94 frames. ], batch size: 316, lr: 6.56e-03, grad_scale: 16.0 2023-06-21 08:39:51,714 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.969e+02 2.601e+02 3.019e+02 3.659e+02 6.555e+02, threshold=6.038e+02, percent-clipped=1.0 2023-06-21 08:40:10,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=742374.0, ans=0.0 2023-06-21 08:40:13,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=742434.0, ans=0.125 2023-06-21 08:40:16,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=742434.0, ans=0.0 2023-06-21 08:40:26,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=742434.0, ans=0.0 2023-06-21 08:41:45,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=742554.0, ans=0.0 2023-06-21 08:42:27,567 INFO [train.py:996] (2/4) Epoch 5, batch 1800, loss[loss=0.2182, simple_loss=0.2835, pruned_loss=0.0764, over 21181.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3089, pruned_loss=0.07913, over 4278924.26 frames. ], batch size: 607, lr: 6.56e-03, grad_scale: 16.0 2023-06-21 08:42:29,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=742674.0, ans=0.1 2023-06-21 08:43:40,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=742794.0, ans=0.0 2023-06-21 08:44:19,566 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=742854.0, ans=0.04949747468305833 2023-06-21 08:44:52,832 INFO [train.py:996] (2/4) Epoch 5, batch 1850, loss[loss=0.1834, simple_loss=0.273, pruned_loss=0.04694, over 21742.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3089, pruned_loss=0.07624, over 4274241.83 frames. ], batch size: 298, lr: 6.56e-03, grad_scale: 16.0 2023-06-21 08:45:21,491 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.680e+02 2.344e+02 2.714e+02 3.170e+02 5.790e+02, threshold=5.429e+02, percent-clipped=0.0 2023-06-21 08:45:32,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=743034.0, ans=0.0 2023-06-21 08:46:00,965 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.56 vs. limit=22.5 2023-06-21 08:46:04,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=743094.0, ans=0.1 2023-06-21 08:46:30,485 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=15.0 2023-06-21 08:47:06,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=743214.0, ans=0.2 2023-06-21 08:47:40,691 INFO [train.py:996] (2/4) Epoch 5, batch 1900, loss[loss=0.2187, simple_loss=0.28, pruned_loss=0.0787, over 21861.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.308, pruned_loss=0.07628, over 4278121.88 frames. ], batch size: 118, lr: 6.56e-03, grad_scale: 16.0 2023-06-21 08:47:52,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=743274.0, ans=0.0 2023-06-21 08:48:14,846 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.05 vs. limit=15.0 2023-06-21 08:48:38,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=743394.0, ans=0.125 2023-06-21 08:49:48,612 INFO [train.py:996] (2/4) Epoch 5, batch 1950, loss[loss=0.1988, simple_loss=0.2764, pruned_loss=0.06056, over 21559.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3049, pruned_loss=0.07653, over 4276812.30 frames. ], batch size: 263, lr: 6.55e-03, grad_scale: 16.0 2023-06-21 08:50:13,868 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.090e+02 2.637e+02 3.082e+02 3.738e+02 5.890e+02, threshold=6.165e+02, percent-clipped=2.0 2023-06-21 08:50:17,992 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.55 vs. limit=15.0 2023-06-21 08:50:30,505 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=743634.0, ans=0.0 2023-06-21 08:50:30,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=743634.0, ans=0.1 2023-06-21 08:51:22,870 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.89 vs. limit=22.5 2023-06-21 08:52:28,568 INFO [train.py:996] (2/4) Epoch 5, batch 2000, loss[loss=0.2121, simple_loss=0.2666, pruned_loss=0.07878, over 20793.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3011, pruned_loss=0.07586, over 4276512.71 frames. ], batch size: 607, lr: 6.55e-03, grad_scale: 32.0 2023-06-21 08:52:35,465 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.61 vs. limit=10.0 2023-06-21 08:53:22,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.whiten.whitening_limit, batch_count=743994.0, ans=15.0 2023-06-21 08:54:14,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=744054.0, ans=0.2 2023-06-21 08:54:27,193 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.67 vs. limit=8.0 2023-06-21 08:54:44,410 INFO [train.py:996] (2/4) Epoch 5, batch 2050, loss[loss=0.2348, simple_loss=0.3169, pruned_loss=0.07631, over 21875.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3033, pruned_loss=0.07698, over 4281929.12 frames. ], batch size: 107, lr: 6.55e-03, grad_scale: 32.0 2023-06-21 08:55:04,151 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.048e+02 2.606e+02 2.968e+02 3.488e+02 6.810e+02, threshold=5.937e+02, percent-clipped=2.0 2023-06-21 08:56:41,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=744354.0, ans=0.125 2023-06-21 08:56:56,111 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.32 vs. limit=6.0 2023-06-21 08:57:04,044 INFO [train.py:996] (2/4) Epoch 5, batch 2100, loss[loss=0.2712, simple_loss=0.3488, pruned_loss=0.09679, over 21892.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3052, pruned_loss=0.07804, over 4286471.76 frames. ], batch size: 316, lr: 6.55e-03, grad_scale: 32.0 2023-06-21 08:57:07,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=744474.0, ans=0.125 2023-06-21 08:57:07,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=744474.0, ans=0.0 2023-06-21 08:58:00,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=744534.0, ans=0.2 2023-06-21 08:59:35,521 INFO [train.py:996] (2/4) Epoch 5, batch 2150, loss[loss=0.2156, simple_loss=0.2812, pruned_loss=0.07497, over 21665.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.308, pruned_loss=0.07957, over 4282845.74 frames. ], batch size: 333, lr: 6.55e-03, grad_scale: 32.0 2023-06-21 09:00:06,676 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.964e+02 2.554e+02 3.162e+02 3.775e+02 6.322e+02, threshold=6.325e+02, percent-clipped=1.0 2023-06-21 09:00:10,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=744774.0, ans=0.1 2023-06-21 09:00:26,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=744834.0, ans=0.0 2023-06-21 09:00:54,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=744894.0, ans=0.1 2023-06-21 09:01:36,253 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-06-21 09:01:37,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=744954.0, ans=0.0 2023-06-21 09:02:01,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=745014.0, ans=0.0 2023-06-21 09:02:08,197 INFO [train.py:996] (2/4) Epoch 5, batch 2200, loss[loss=0.1871, simple_loss=0.2689, pruned_loss=0.05269, over 21214.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3087, pruned_loss=0.07962, over 4276946.06 frames. ], batch size: 176, lr: 6.55e-03, grad_scale: 32.0 2023-06-21 09:04:39,504 INFO [train.py:996] (2/4) Epoch 5, batch 2250, loss[loss=0.2123, simple_loss=0.2828, pruned_loss=0.07092, over 21783.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3054, pruned_loss=0.07712, over 4277322.67 frames. ], batch size: 371, lr: 6.55e-03, grad_scale: 32.0 2023-06-21 09:04:44,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=745374.0, ans=0.0 2023-06-21 09:04:48,406 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.866e+02 2.467e+02 2.789e+02 3.250e+02 6.214e+02, threshold=5.578e+02, percent-clipped=0.0 2023-06-21 09:05:54,355 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=745494.0, ans=15.0 2023-06-21 09:06:18,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=745554.0, ans=0.1 2023-06-21 09:06:26,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=745614.0, ans=0.125 2023-06-21 09:06:49,372 INFO [train.py:996] (2/4) Epoch 5, batch 2300, loss[loss=0.2508, simple_loss=0.3472, pruned_loss=0.07721, over 20738.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.2999, pruned_loss=0.07621, over 4267865.94 frames. ], batch size: 608, lr: 6.54e-03, grad_scale: 32.0 2023-06-21 09:07:39,890 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=745734.0, ans=0.125 2023-06-21 09:08:13,314 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=745854.0, ans=0.1 2023-06-21 09:08:36,959 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.35 vs. limit=15.0 2023-06-21 09:09:04,698 INFO [train.py:996] (2/4) Epoch 5, batch 2350, loss[loss=0.2216, simple_loss=0.2835, pruned_loss=0.07991, over 21175.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.2979, pruned_loss=0.07669, over 4264070.21 frames. ], batch size: 176, lr: 6.54e-03, grad_scale: 32.0 2023-06-21 09:09:18,950 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.931e+02 2.648e+02 3.067e+02 3.920e+02 6.116e+02, threshold=6.134e+02, percent-clipped=2.0 2023-06-21 09:09:28,526 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.12 vs. limit=15.0 2023-06-21 09:09:49,254 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 09:11:02,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=746154.0, ans=0.1 2023-06-21 09:11:02,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=746154.0, ans=0.1 2023-06-21 09:11:13,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=746214.0, ans=0.125 2023-06-21 09:11:13,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=746214.0, ans=0.125 2023-06-21 09:11:19,053 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=746214.0, ans=0.125 2023-06-21 09:11:46,923 INFO [train.py:996] (2/4) Epoch 5, batch 2400, loss[loss=0.2542, simple_loss=0.3214, pruned_loss=0.09349, over 21468.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3007, pruned_loss=0.07899, over 4266233.01 frames. ], batch size: 211, lr: 6.54e-03, grad_scale: 32.0 2023-06-21 09:12:31,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=746334.0, ans=0.0 2023-06-21 09:12:36,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=746394.0, ans=0.125 2023-06-21 09:13:26,848 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=746454.0, ans=0.125 2023-06-21 09:13:42,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=746454.0, ans=0.035 2023-06-21 09:14:12,776 INFO [train.py:996] (2/4) Epoch 5, batch 2450, loss[loss=0.2151, simple_loss=0.3019, pruned_loss=0.06418, over 21815.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3086, pruned_loss=0.08189, over 4261237.35 frames. ], batch size: 118, lr: 6.54e-03, grad_scale: 32.0 2023-06-21 09:14:38,909 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.112e+02 2.774e+02 3.114e+02 3.672e+02 6.323e+02, threshold=6.229e+02, percent-clipped=1.0 2023-06-21 09:16:02,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=746754.0, ans=0.1 2023-06-21 09:16:20,636 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=746814.0, ans=0.125 2023-06-21 09:16:24,619 INFO [train.py:996] (2/4) Epoch 5, batch 2500, loss[loss=0.2302, simple_loss=0.2963, pruned_loss=0.0821, over 21237.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3072, pruned_loss=0.08172, over 4265196.99 frames. ], batch size: 159, lr: 6.54e-03, grad_scale: 16.0 2023-06-21 09:16:30,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=746874.0, ans=0.5 2023-06-21 09:17:24,779 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=746994.0, ans=0.1 2023-06-21 09:17:31,118 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.73 vs. limit=15.0 2023-06-21 09:17:32,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=746994.0, ans=0.125 2023-06-21 09:18:42,277 INFO [train.py:996] (2/4) Epoch 5, batch 2550, loss[loss=0.2106, simple_loss=0.2753, pruned_loss=0.07294, over 21527.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3051, pruned_loss=0.08056, over 4263813.98 frames. ], batch size: 391, lr: 6.54e-03, grad_scale: 16.0 2023-06-21 09:18:45,430 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=747174.0, ans=0.1 2023-06-21 09:18:50,716 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.885e+02 2.510e+02 2.866e+02 3.285e+02 4.415e+02, threshold=5.731e+02, percent-clipped=0.0 2023-06-21 09:19:03,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=747174.0, ans=0.125 2023-06-21 09:20:09,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=747354.0, ans=0.0 2023-06-21 09:20:12,839 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=747354.0, ans=0.5 2023-06-21 09:20:59,461 INFO [train.py:996] (2/4) Epoch 5, batch 2600, loss[loss=0.269, simple_loss=0.3316, pruned_loss=0.1032, over 21321.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3057, pruned_loss=0.0808, over 4268893.84 frames. ], batch size: 143, lr: 6.54e-03, grad_scale: 16.0 2023-06-21 09:21:13,169 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=747474.0, ans=0.0 2023-06-21 09:22:05,941 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=747594.0, ans=0.0 2023-06-21 09:22:13,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=747594.0, ans=0.125 2023-06-21 09:23:12,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=747714.0, ans=0.125 2023-06-21 09:23:26,448 INFO [train.py:996] (2/4) Epoch 5, batch 2650, loss[loss=0.2288, simple_loss=0.3151, pruned_loss=0.07122, over 21852.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3085, pruned_loss=0.08297, over 4277199.67 frames. ], batch size: 282, lr: 6.54e-03, grad_scale: 16.0 2023-06-21 09:23:35,112 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.884e+02 2.828e+02 3.188e+02 4.094e+02 7.867e+02, threshold=6.375e+02, percent-clipped=3.0 2023-06-21 09:23:47,746 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=747834.0, ans=0.1 2023-06-21 09:25:49,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=748014.0, ans=0.0 2023-06-21 09:25:52,123 INFO [train.py:996] (2/4) Epoch 5, batch 2700, loss[loss=0.1934, simple_loss=0.2566, pruned_loss=0.06512, over 21461.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3064, pruned_loss=0.08174, over 4278582.88 frames. ], batch size: 195, lr: 6.53e-03, grad_scale: 16.0 2023-06-21 09:25:54,823 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.29 vs. limit=22.5 2023-06-21 09:26:40,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=748134.0, ans=0.04949747468305833 2023-06-21 09:26:48,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=748134.0, ans=0.1 2023-06-21 09:26:49,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=748134.0, ans=0.0 2023-06-21 09:26:57,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=748194.0, ans=0.1 2023-06-21 09:27:22,869 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.18 vs. limit=10.0 2023-06-21 09:28:05,674 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=748314.0, ans=0.1 2023-06-21 09:28:17,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=748374.0, ans=0.125 2023-06-21 09:28:18,191 INFO [train.py:996] (2/4) Epoch 5, batch 2750, loss[loss=0.2725, simple_loss=0.3249, pruned_loss=0.1101, over 21726.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3052, pruned_loss=0.0819, over 4279309.20 frames. ], batch size: 473, lr: 6.53e-03, grad_scale: 16.0 2023-06-21 09:28:33,251 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.906e+02 2.495e+02 2.844e+02 3.275e+02 5.915e+02, threshold=5.688e+02, percent-clipped=0.0 2023-06-21 09:28:35,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=748374.0, ans=0.0 2023-06-21 09:29:17,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=748434.0, ans=0.07 2023-06-21 09:31:01,731 INFO [train.py:996] (2/4) Epoch 5, batch 2800, loss[loss=0.214, simple_loss=0.3317, pruned_loss=0.04817, over 19725.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3107, pruned_loss=0.08334, over 4281284.81 frames. ], batch size: 702, lr: 6.53e-03, grad_scale: 32.0 2023-06-21 09:31:05,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=748674.0, ans=0.0 2023-06-21 09:32:15,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=748794.0, ans=0.05 2023-06-21 09:32:20,317 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-06-21 09:33:40,818 INFO [train.py:996] (2/4) Epoch 5, batch 2850, loss[loss=0.2324, simple_loss=0.3432, pruned_loss=0.06077, over 19856.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.313, pruned_loss=0.08366, over 4280474.80 frames. ], batch size: 704, lr: 6.53e-03, grad_scale: 32.0 2023-06-21 09:33:41,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=748974.0, ans=0.1 2023-06-21 09:34:00,811 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.211e+02 2.971e+02 3.664e+02 4.232e+02 8.442e+02, threshold=7.329e+02, percent-clipped=6.0 2023-06-21 09:34:45,730 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=749094.0, ans=0.5 2023-06-21 09:34:57,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=749154.0, ans=0.0 2023-06-21 09:36:12,164 INFO [train.py:996] (2/4) Epoch 5, batch 2900, loss[loss=0.1834, simple_loss=0.2331, pruned_loss=0.06682, over 20727.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3096, pruned_loss=0.08291, over 4287677.41 frames. ], batch size: 608, lr: 6.53e-03, grad_scale: 32.0 2023-06-21 09:36:27,286 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=749274.0, ans=0.2 2023-06-21 09:36:55,394 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=749334.0, ans=0.0 2023-06-21 09:37:09,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=749394.0, ans=0.0 2023-06-21 09:37:46,822 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.16 vs. limit=6.0 2023-06-21 09:38:34,559 INFO [train.py:996] (2/4) Epoch 5, batch 2950, loss[loss=0.2122, simple_loss=0.2927, pruned_loss=0.06589, over 21710.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.31, pruned_loss=0.08324, over 4295157.63 frames. ], batch size: 112, lr: 6.53e-03, grad_scale: 32.0 2023-06-21 09:38:42,284 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.90 vs. limit=15.0 2023-06-21 09:38:48,673 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.079e+02 2.603e+02 2.918e+02 3.396e+02 5.731e+02, threshold=5.836e+02, percent-clipped=0.0 2023-06-21 09:39:00,331 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=749634.0, ans=0.0 2023-06-21 09:39:37,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=749694.0, ans=0.125 2023-06-21 09:39:37,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=749694.0, ans=0.2 2023-06-21 09:39:55,568 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=749694.0, ans=0.1 2023-06-21 09:40:24,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=749754.0, ans=0.125 2023-06-21 09:40:55,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=749814.0, ans=0.125 2023-06-21 09:41:10,940 INFO [train.py:996] (2/4) Epoch 5, batch 3000, loss[loss=0.289, simple_loss=0.3494, pruned_loss=0.1143, over 21338.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3142, pruned_loss=0.08464, over 4293902.64 frames. ], batch size: 548, lr: 6.53e-03, grad_scale: 32.0 2023-06-21 09:41:10,940 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-21 09:42:09,554 INFO [train.py:1028] (2/4) Epoch 5, validation: loss=0.2543, simple_loss=0.346, pruned_loss=0.08133, over 1796401.00 frames. 2023-06-21 09:42:09,560 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-21 09:42:21,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=749874.0, ans=0.1 2023-06-21 09:43:11,773 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 09:44:23,314 INFO [train.py:996] (2/4) Epoch 5, batch 3050, loss[loss=0.231, simple_loss=0.3226, pruned_loss=0.06966, over 19863.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3138, pruned_loss=0.0829, over 4293049.62 frames. ], batch size: 703, lr: 6.52e-03, grad_scale: 32.0 2023-06-21 09:44:25,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=750174.0, ans=0.125 2023-06-21 09:44:43,389 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.887e+02 2.527e+02 2.843e+02 3.371e+02 5.319e+02, threshold=5.686e+02, percent-clipped=0.0 2023-06-21 09:46:14,850 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=750354.0, ans=0.125 2023-06-21 09:46:16,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=750354.0, ans=0.2 2023-06-21 09:46:48,370 INFO [train.py:996] (2/4) Epoch 5, batch 3100, loss[loss=0.2399, simple_loss=0.3277, pruned_loss=0.0761, over 21714.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3138, pruned_loss=0.08202, over 4283543.18 frames. ], batch size: 351, lr: 6.52e-03, grad_scale: 32.0 2023-06-21 09:47:18,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=750534.0, ans=0.0 2023-06-21 09:47:41,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=750534.0, ans=0.0 2023-06-21 09:47:49,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=750594.0, ans=0.04949747468305833 2023-06-21 09:47:53,501 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=750594.0, ans=0.2 2023-06-21 09:48:25,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=750654.0, ans=0.125 2023-06-21 09:48:57,625 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.65 vs. limit=22.5 2023-06-21 09:49:08,172 INFO [train.py:996] (2/4) Epoch 5, batch 3150, loss[loss=0.2792, simple_loss=0.3493, pruned_loss=0.1045, over 21272.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3151, pruned_loss=0.08225, over 4277957.15 frames. ], batch size: 143, lr: 6.52e-03, grad_scale: 32.0 2023-06-21 09:49:26,478 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.529e+02 2.952e+02 3.587e+02 6.103e+02, threshold=5.905e+02, percent-clipped=1.0 2023-06-21 09:50:04,278 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=750834.0, ans=0.0 2023-06-21 09:51:58,574 INFO [train.py:996] (2/4) Epoch 5, batch 3200, loss[loss=0.2066, simple_loss=0.2973, pruned_loss=0.05797, over 21811.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3177, pruned_loss=0.08274, over 4276075.25 frames. ], batch size: 282, lr: 6.52e-03, grad_scale: 32.0 2023-06-21 09:52:13,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=751074.0, ans=0.125 2023-06-21 09:52:26,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=751134.0, ans=0.1 2023-06-21 09:53:05,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=751194.0, ans=0.125 2023-06-21 09:53:09,896 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=751194.0, ans=0.0 2023-06-21 09:53:55,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=751314.0, ans=0.0 2023-06-21 09:53:57,816 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=751314.0, ans=0.2 2023-06-21 09:54:09,446 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.60 vs. limit=15.0 2023-06-21 09:54:09,984 INFO [train.py:996] (2/4) Epoch 5, batch 3250, loss[loss=0.2396, simple_loss=0.307, pruned_loss=0.08612, over 21812.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.319, pruned_loss=0.08436, over 4269709.29 frames. ], batch size: 118, lr: 6.52e-03, grad_scale: 16.0 2023-06-21 09:54:33,045 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.31 vs. limit=22.5 2023-06-21 09:54:34,747 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.832e+02 2.744e+02 3.235e+02 3.683e+02 5.247e+02, threshold=6.470e+02, percent-clipped=0.0 2023-06-21 09:55:41,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=751554.0, ans=0.125 2023-06-21 09:56:00,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=751554.0, ans=0.0 2023-06-21 09:56:54,120 INFO [train.py:996] (2/4) Epoch 5, batch 3300, loss[loss=0.2218, simple_loss=0.319, pruned_loss=0.06234, over 21759.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3139, pruned_loss=0.08333, over 4260789.22 frames. ], batch size: 351, lr: 6.52e-03, grad_scale: 16.0 2023-06-21 09:58:51,002 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=751914.0, ans=0.125 2023-06-21 09:58:55,788 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.04 vs. limit=10.0 2023-06-21 09:59:09,224 INFO [train.py:996] (2/4) Epoch 5, batch 3350, loss[loss=0.2583, simple_loss=0.3291, pruned_loss=0.09376, over 21791.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3158, pruned_loss=0.08397, over 4264261.35 frames. ], batch size: 414, lr: 6.52e-03, grad_scale: 16.0 2023-06-21 09:59:38,676 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.996e+02 2.776e+02 3.180e+02 3.722e+02 8.013e+02, threshold=6.359e+02, percent-clipped=4.0 2023-06-21 10:00:40,024 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.97 vs. limit=15.0 2023-06-21 10:01:27,372 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=752214.0, ans=0.04949747468305833 2023-06-21 10:01:43,802 INFO [train.py:996] (2/4) Epoch 5, batch 3400, loss[loss=0.2277, simple_loss=0.2957, pruned_loss=0.07981, over 21633.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3167, pruned_loss=0.08502, over 4274068.70 frames. ], batch size: 332, lr: 6.52e-03, grad_scale: 16.0 2023-06-21 10:04:00,533 INFO [train.py:996] (2/4) Epoch 5, batch 3450, loss[loss=0.3642, simple_loss=0.4287, pruned_loss=0.1498, over 21426.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3122, pruned_loss=0.0842, over 4279013.74 frames. ], batch size: 471, lr: 6.51e-03, grad_scale: 16.0 2023-06-21 10:04:11,207 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.968e+02 2.675e+02 2.919e+02 3.509e+02 4.747e+02, threshold=5.839e+02, percent-clipped=0.0 2023-06-21 10:05:39,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=752754.0, ans=0.1 2023-06-21 10:06:18,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=752814.0, ans=0.125 2023-06-21 10:06:26,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=752874.0, ans=0.1 2023-06-21 10:06:27,100 INFO [train.py:996] (2/4) Epoch 5, batch 3500, loss[loss=0.2596, simple_loss=0.3315, pruned_loss=0.09388, over 21246.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3193, pruned_loss=0.08748, over 4280310.85 frames. ], batch size: 159, lr: 6.51e-03, grad_scale: 16.0 2023-06-21 10:07:15,870 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.94 vs. limit=12.0 2023-06-21 10:08:29,266 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=753114.0, ans=0.2 2023-06-21 10:08:50,563 INFO [train.py:996] (2/4) Epoch 5, batch 3550, loss[loss=0.2224, simple_loss=0.283, pruned_loss=0.08087, over 21374.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3213, pruned_loss=0.08852, over 4280288.90 frames. ], batch size: 194, lr: 6.51e-03, grad_scale: 16.0 2023-06-21 10:09:06,196 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.097e+02 2.617e+02 3.171e+02 3.907e+02 6.956e+02, threshold=6.342e+02, percent-clipped=4.0 2023-06-21 10:09:35,466 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=753234.0, ans=0.125 2023-06-21 10:10:07,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=753354.0, ans=0.125 2023-06-21 10:10:07,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=753354.0, ans=0.125 2023-06-21 10:10:45,020 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.25 vs. limit=15.0 2023-06-21 10:10:52,572 INFO [train.py:996] (2/4) Epoch 5, batch 3600, loss[loss=0.2403, simple_loss=0.3012, pruned_loss=0.08975, over 21660.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3155, pruned_loss=0.08867, over 4281725.00 frames. ], batch size: 298, lr: 6.51e-03, grad_scale: 32.0 2023-06-21 10:11:14,587 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.52 vs. limit=22.5 2023-06-21 10:11:15,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=753534.0, ans=0.2 2023-06-21 10:11:16,320 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.37 vs. limit=15.0 2023-06-21 10:11:40,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=753594.0, ans=0.125 2023-06-21 10:11:53,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=753594.0, ans=0.2 2023-06-21 10:12:38,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=753714.0, ans=0.0 2023-06-21 10:12:59,209 INFO [train.py:996] (2/4) Epoch 5, batch 3650, loss[loss=0.2195, simple_loss=0.2812, pruned_loss=0.07893, over 21913.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3169, pruned_loss=0.08872, over 4282207.09 frames. ], batch size: 107, lr: 6.51e-03, grad_scale: 32.0 2023-06-21 10:13:10,053 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.861e+02 2.602e+02 2.931e+02 3.344e+02 6.459e+02, threshold=5.862e+02, percent-clipped=1.0 2023-06-21 10:15:21,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=754014.0, ans=0.0 2023-06-21 10:15:29,709 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=754014.0, ans=0.2 2023-06-21 10:15:32,322 INFO [train.py:996] (2/4) Epoch 5, batch 3700, loss[loss=0.2391, simple_loss=0.3124, pruned_loss=0.08294, over 21861.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3151, pruned_loss=0.08655, over 4286650.87 frames. ], batch size: 332, lr: 6.51e-03, grad_scale: 32.0 2023-06-21 10:15:37,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=754074.0, ans=0.2 2023-06-21 10:16:14,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=754194.0, ans=0.025 2023-06-21 10:16:38,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=754194.0, ans=0.125 2023-06-21 10:17:15,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=754314.0, ans=0.1 2023-06-21 10:17:26,623 INFO [train.py:996] (2/4) Epoch 5, batch 3750, loss[loss=0.1965, simple_loss=0.2672, pruned_loss=0.06288, over 21872.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3123, pruned_loss=0.08531, over 4287088.17 frames. ], batch size: 124, lr: 6.51e-03, grad_scale: 32.0 2023-06-21 10:17:29,952 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=754374.0, ans=0.1 2023-06-21 10:17:37,726 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.880e+02 2.410e+02 2.787e+02 3.137e+02 4.786e+02, threshold=5.574e+02, percent-clipped=0.0 2023-06-21 10:18:34,645 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=754494.0, ans=0.0 2023-06-21 10:20:02,554 INFO [train.py:996] (2/4) Epoch 5, batch 3800, loss[loss=0.2539, simple_loss=0.3172, pruned_loss=0.0953, over 21824.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.311, pruned_loss=0.08362, over 4283748.58 frames. ], batch size: 247, lr: 6.51e-03, grad_scale: 32.0 2023-06-21 10:21:23,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=754854.0, ans=0.125 2023-06-21 10:21:38,618 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.62 vs. limit=6.0 2023-06-21 10:21:55,056 INFO [train.py:996] (2/4) Epoch 5, batch 3850, loss[loss=0.2352, simple_loss=0.2985, pruned_loss=0.08595, over 21848.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3115, pruned_loss=0.08502, over 4280906.85 frames. ], batch size: 118, lr: 6.50e-03, grad_scale: 32.0 2023-06-21 10:21:57,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=754974.0, ans=0.125 2023-06-21 10:22:22,380 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.793e+02 2.524e+02 3.055e+02 3.931e+02 8.028e+02, threshold=6.111e+02, percent-clipped=3.0 2023-06-21 10:22:28,583 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=755034.0, ans=0.0 2023-06-21 10:24:04,989 INFO [train.py:996] (2/4) Epoch 5, batch 3900, loss[loss=0.219, simple_loss=0.2814, pruned_loss=0.07824, over 21313.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3065, pruned_loss=0.08447, over 4285964.87 frames. ], batch size: 159, lr: 6.50e-03, grad_scale: 32.0 2023-06-21 10:24:08,339 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=755274.0, ans=0.125 2023-06-21 10:24:31,028 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.10 vs. limit=10.0 2023-06-21 10:25:35,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=755454.0, ans=0.2 2023-06-21 10:25:41,721 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.27 vs. limit=22.5 2023-06-21 10:25:54,195 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.22 vs. limit=15.0 2023-06-21 10:26:19,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=755514.0, ans=0.0 2023-06-21 10:26:27,939 INFO [train.py:996] (2/4) Epoch 5, batch 3950, loss[loss=0.2562, simple_loss=0.3428, pruned_loss=0.08477, over 20702.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3073, pruned_loss=0.08329, over 4287004.92 frames. ], batch size: 607, lr: 6.50e-03, grad_scale: 32.0 2023-06-21 10:26:28,337 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=755574.0, ans=0.125 2023-06-21 10:26:46,966 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.953e+02 2.463e+02 2.789e+02 3.515e+02 5.351e+02, threshold=5.577e+02, percent-clipped=0.0 2023-06-21 10:27:39,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=755694.0, ans=0.125 2023-06-21 10:28:16,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=755754.0, ans=0.125 2023-06-21 10:28:18,099 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.72 vs. limit=12.0 2023-06-21 10:28:48,516 INFO [train.py:996] (2/4) Epoch 5, batch 4000, loss[loss=0.2401, simple_loss=0.3106, pruned_loss=0.08474, over 20192.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3008, pruned_loss=0.08087, over 4271484.18 frames. ], batch size: 703, lr: 6.50e-03, grad_scale: 32.0 2023-06-21 10:28:48,956 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=755874.0, ans=0.2 2023-06-21 10:29:55,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=755994.0, ans=0.1 2023-06-21 10:30:54,383 INFO [train.py:996] (2/4) Epoch 5, batch 4050, loss[loss=0.1975, simple_loss=0.2899, pruned_loss=0.05251, over 21647.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.2991, pruned_loss=0.0779, over 4276950.38 frames. ], batch size: 263, lr: 6.50e-03, grad_scale: 32.0 2023-06-21 10:31:15,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=756174.0, ans=0.1 2023-06-21 10:31:18,141 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=756174.0, ans=0.1 2023-06-21 10:31:19,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=756174.0, ans=0.1 2023-06-21 10:31:20,398 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.698e+02 2.419e+02 2.822e+02 3.375e+02 5.095e+02, threshold=5.643e+02, percent-clipped=0.0 2023-06-21 10:32:01,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=756234.0, ans=0.1 2023-06-21 10:33:22,504 INFO [train.py:996] (2/4) Epoch 5, batch 4100, loss[loss=0.2404, simple_loss=0.3229, pruned_loss=0.07892, over 21790.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3002, pruned_loss=0.07797, over 4280534.38 frames. ], batch size: 391, lr: 6.50e-03, grad_scale: 16.0 2023-06-21 10:34:08,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=756594.0, ans=0.1 2023-06-21 10:34:34,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=756654.0, ans=0.1 2023-06-21 10:34:57,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=756714.0, ans=0.2 2023-06-21 10:35:00,547 INFO [train.py:996] (2/4) Epoch 5, batch 4150, loss[loss=0.1813, simple_loss=0.277, pruned_loss=0.0428, over 21576.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3015, pruned_loss=0.07558, over 4264429.19 frames. ], batch size: 230, lr: 6.50e-03, grad_scale: 16.0 2023-06-21 10:35:08,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=756774.0, ans=0.125 2023-06-21 10:35:12,283 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.490e+02 2.501e+02 2.940e+02 3.469e+02 5.994e+02, threshold=5.880e+02, percent-clipped=1.0 2023-06-21 10:36:14,386 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=756954.0, ans=0.07 2023-06-21 10:36:17,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=756954.0, ans=0.125 2023-06-21 10:36:46,374 INFO [train.py:996] (2/4) Epoch 5, batch 4200, loss[loss=0.204, simple_loss=0.2761, pruned_loss=0.06595, over 21466.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3011, pruned_loss=0.07535, over 4263208.72 frames. ], batch size: 212, lr: 6.50e-03, grad_scale: 16.0 2023-06-21 10:37:19,036 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.43 vs. limit=15.0 2023-06-21 10:37:19,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=757134.0, ans=0.2 2023-06-21 10:38:21,782 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.06 vs. limit=15.0 2023-06-21 10:38:38,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=757314.0, ans=0.0 2023-06-21 10:38:47,053 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=757314.0, ans=0.125 2023-06-21 10:38:58,440 INFO [train.py:996] (2/4) Epoch 5, batch 4250, loss[loss=0.2599, simple_loss=0.3285, pruned_loss=0.09568, over 21283.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3071, pruned_loss=0.07871, over 4253263.33 frames. ], batch size: 159, lr: 6.49e-03, grad_scale: 16.0 2023-06-21 10:39:19,025 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.928e+02 2.640e+02 3.180e+02 4.167e+02 9.459e+02, threshold=6.360e+02, percent-clipped=16.0 2023-06-21 10:40:22,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=757554.0, ans=0.125 2023-06-21 10:40:39,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=757614.0, ans=0.0 2023-06-21 10:40:53,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=757614.0, ans=0.0 2023-06-21 10:40:59,202 INFO [train.py:996] (2/4) Epoch 5, batch 4300, loss[loss=0.2161, simple_loss=0.3102, pruned_loss=0.06098, over 21812.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3147, pruned_loss=0.08165, over 4260850.22 frames. ], batch size: 282, lr: 6.49e-03, grad_scale: 16.0 2023-06-21 10:42:56,755 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.43 vs. limit=15.0 2023-06-21 10:43:29,604 INFO [train.py:996] (2/4) Epoch 5, batch 4350, loss[loss=0.2487, simple_loss=0.3478, pruned_loss=0.07481, over 19911.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3116, pruned_loss=0.08019, over 4261102.33 frames. ], batch size: 702, lr: 6.49e-03, grad_scale: 16.0 2023-06-21 10:43:42,396 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=757974.0, ans=0.015 2023-06-21 10:43:42,493 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=757974.0, ans=0.125 2023-06-21 10:43:46,270 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.166e+02 2.644e+02 3.174e+02 3.856e+02 7.919e+02, threshold=6.347e+02, percent-clipped=3.0 2023-06-21 10:44:42,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=758154.0, ans=0.125 2023-06-21 10:44:54,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=758154.0, ans=0.2 2023-06-21 10:44:54,983 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=15.0 2023-06-21 10:45:13,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=758214.0, ans=0.125 2023-06-21 10:45:25,085 INFO [train.py:996] (2/4) Epoch 5, batch 4400, loss[loss=0.2126, simple_loss=0.2879, pruned_loss=0.06871, over 21203.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3088, pruned_loss=0.08005, over 4257430.09 frames. ], batch size: 143, lr: 6.49e-03, grad_scale: 32.0 2023-06-21 10:46:39,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=758394.0, ans=0.125 2023-06-21 10:47:02,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=758454.0, ans=0.125 2023-06-21 10:47:48,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=758514.0, ans=0.2 2023-06-21 10:47:50,942 INFO [train.py:996] (2/4) Epoch 5, batch 4450, loss[loss=0.2681, simple_loss=0.3518, pruned_loss=0.09219, over 21865.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3173, pruned_loss=0.08229, over 4259065.13 frames. ], batch size: 371, lr: 6.49e-03, grad_scale: 32.0 2023-06-21 10:48:08,173 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.953e+02 2.525e+02 2.989e+02 3.664e+02 5.986e+02, threshold=5.979e+02, percent-clipped=0.0 2023-06-21 10:48:38,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=758634.0, ans=0.1 2023-06-21 10:48:38,066 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=758634.0, ans=0.125 2023-06-21 10:48:58,010 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.85 vs. limit=6.0 2023-06-21 10:49:01,945 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=758694.0, ans=0.0 2023-06-21 10:49:02,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=758694.0, ans=0.125 2023-06-21 10:50:13,879 INFO [train.py:996] (2/4) Epoch 5, batch 4500, loss[loss=0.2596, simple_loss=0.3592, pruned_loss=0.08001, over 21255.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3177, pruned_loss=0.08352, over 4258094.20 frames. ], batch size: 548, lr: 6.49e-03, grad_scale: 32.0 2023-06-21 10:50:44,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=758934.0, ans=0.05 2023-06-21 10:50:56,070 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=758934.0, ans=0.125 2023-06-21 10:51:08,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=758994.0, ans=0.125 2023-06-21 10:51:24,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=758994.0, ans=0.125 2023-06-21 10:51:26,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=758994.0, ans=0.125 2023-06-21 10:51:52,908 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=759054.0, ans=0.125 2023-06-21 10:52:36,247 INFO [train.py:996] (2/4) Epoch 5, batch 4550, loss[loss=0.2338, simple_loss=0.3005, pruned_loss=0.08357, over 21316.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3225, pruned_loss=0.08418, over 4266307.28 frames. ], batch size: 551, lr: 6.49e-03, grad_scale: 32.0 2023-06-21 10:52:59,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=759174.0, ans=0.0 2023-06-21 10:53:00,221 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.936e+02 2.644e+02 2.944e+02 3.521e+02 6.236e+02, threshold=5.889e+02, percent-clipped=2.0 2023-06-21 10:54:09,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=759354.0, ans=0.125 2023-06-21 10:54:54,674 INFO [train.py:996] (2/4) Epoch 5, batch 4600, loss[loss=0.2421, simple_loss=0.3097, pruned_loss=0.08718, over 21290.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3241, pruned_loss=0.08571, over 4273202.48 frames. ], batch size: 143, lr: 6.49e-03, grad_scale: 16.0 2023-06-21 10:55:24,218 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.25 vs. limit=5.0 2023-06-21 10:55:47,469 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=759594.0, ans=0.2 2023-06-21 10:56:06,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=759654.0, ans=0.07 2023-06-21 10:56:30,760 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=759654.0, ans=0.1 2023-06-21 10:57:01,253 INFO [train.py:996] (2/4) Epoch 5, batch 4650, loss[loss=0.1883, simple_loss=0.2579, pruned_loss=0.05934, over 21273.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.317, pruned_loss=0.08314, over 4281690.48 frames. ], batch size: 159, lr: 6.48e-03, grad_scale: 16.0 2023-06-21 10:57:33,515 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.740e+02 2.446e+02 2.893e+02 3.577e+02 6.132e+02, threshold=5.786e+02, percent-clipped=2.0 2023-06-21 10:57:39,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=759834.0, ans=0.0 2023-06-21 10:57:42,087 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=759834.0, ans=0.0 2023-06-21 10:58:51,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=759954.0, ans=0.04949747468305833 2023-06-21 10:58:52,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=760014.0, ans=0.125 2023-06-21 10:59:05,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=760014.0, ans=0.125 2023-06-21 10:59:26,249 INFO [train.py:996] (2/4) Epoch 5, batch 4700, loss[loss=0.1912, simple_loss=0.2589, pruned_loss=0.0617, over 21525.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.308, pruned_loss=0.08141, over 4280340.94 frames. ], batch size: 230, lr: 6.48e-03, grad_scale: 16.0 2023-06-21 10:59:43,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=760074.0, ans=0.125 2023-06-21 10:59:50,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=760134.0, ans=0.0 2023-06-21 10:59:56,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=760134.0, ans=0.125 2023-06-21 11:01:43,755 INFO [train.py:996] (2/4) Epoch 5, batch 4750, loss[loss=0.2511, simple_loss=0.3115, pruned_loss=0.09534, over 21687.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3038, pruned_loss=0.08135, over 4279596.90 frames. ], batch size: 389, lr: 6.48e-03, grad_scale: 16.0 2023-06-21 11:01:58,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=760374.0, ans=0.125 2023-06-21 11:02:04,341 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.910e+02 2.535e+02 2.851e+02 3.318e+02 5.705e+02, threshold=5.702e+02, percent-clipped=0.0 2023-06-21 11:02:21,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=760494.0, ans=0.125 2023-06-21 11:02:36,633 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 11:03:03,285 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=760554.0, ans=0.035 2023-06-21 11:03:09,392 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=760554.0, ans=0.1 2023-06-21 11:03:37,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=760614.0, ans=0.2 2023-06-21 11:03:55,434 INFO [train.py:996] (2/4) Epoch 5, batch 4800, loss[loss=0.227, simple_loss=0.3203, pruned_loss=0.06685, over 21786.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3045, pruned_loss=0.08095, over 4287864.32 frames. ], batch size: 282, lr: 6.48e-03, grad_scale: 32.0 2023-06-21 11:04:19,941 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=760734.0, ans=0.125 2023-06-21 11:04:33,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=760794.0, ans=0.1 2023-06-21 11:06:01,334 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=760914.0, ans=0.0 2023-06-21 11:06:05,296 INFO [train.py:996] (2/4) Epoch 5, batch 4850, loss[loss=0.2296, simple_loss=0.2981, pruned_loss=0.08057, over 21284.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3042, pruned_loss=0.08099, over 4289601.07 frames. ], batch size: 159, lr: 6.48e-03, grad_scale: 32.0 2023-06-21 11:06:26,487 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.999e+02 2.507e+02 2.882e+02 3.548e+02 6.033e+02, threshold=5.763e+02, percent-clipped=2.0 2023-06-21 11:06:52,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=761094.0, ans=0.125 2023-06-21 11:07:07,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=761094.0, ans=0.1 2023-06-21 11:07:59,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=761214.0, ans=0.09899494936611666 2023-06-21 11:08:30,020 INFO [train.py:996] (2/4) Epoch 5, batch 4900, loss[loss=0.2937, simple_loss=0.3651, pruned_loss=0.1112, over 21481.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3053, pruned_loss=0.08088, over 4291259.88 frames. ], batch size: 471, lr: 6.48e-03, grad_scale: 32.0 2023-06-21 11:08:36,696 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=761274.0, ans=0.0 2023-06-21 11:08:55,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=761334.0, ans=0.125 2023-06-21 11:08:58,155 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=761334.0, ans=0.04949747468305833 2023-06-21 11:10:19,996 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.13 vs. limit=15.0 2023-06-21 11:10:22,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=761514.0, ans=0.0 2023-06-21 11:10:38,832 INFO [train.py:996] (2/4) Epoch 5, batch 4950, loss[loss=0.1961, simple_loss=0.2896, pruned_loss=0.05128, over 21724.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3088, pruned_loss=0.07938, over 4276948.05 frames. ], batch size: 351, lr: 6.48e-03, grad_scale: 32.0 2023-06-21 11:10:42,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=761574.0, ans=0.125 2023-06-21 11:10:52,200 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.871e+02 2.398e+02 2.770e+02 3.056e+02 4.888e+02, threshold=5.540e+02, percent-clipped=0.0 2023-06-21 11:12:19,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=761814.0, ans=0.125 2023-06-21 11:12:36,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=761814.0, ans=0.5 2023-06-21 11:12:40,749 INFO [train.py:996] (2/4) Epoch 5, batch 5000, loss[loss=0.2902, simple_loss=0.3388, pruned_loss=0.1208, over 21798.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3082, pruned_loss=0.07677, over 4278933.83 frames. ], batch size: 508, lr: 6.47e-03, grad_scale: 32.0 2023-06-21 11:12:49,146 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=761874.0, ans=0.125 2023-06-21 11:12:50,879 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=761874.0, ans=0.125 2023-06-21 11:13:55,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=762054.0, ans=0.0 2023-06-21 11:14:28,964 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=762114.0, ans=0.1 2023-06-21 11:14:46,857 INFO [train.py:996] (2/4) Epoch 5, batch 5050, loss[loss=0.2321, simple_loss=0.2972, pruned_loss=0.08351, over 21591.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3094, pruned_loss=0.07921, over 4287022.16 frames. ], batch size: 212, lr: 6.47e-03, grad_scale: 32.0 2023-06-21 11:15:06,913 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.856e+02 2.567e+02 3.027e+02 3.438e+02 5.567e+02, threshold=6.054e+02, percent-clipped=1.0 2023-06-21 11:15:09,772 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.48 vs. limit=12.0 2023-06-21 11:15:50,218 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 11:16:58,423 INFO [train.py:996] (2/4) Epoch 5, batch 5100, loss[loss=0.2186, simple_loss=0.2884, pruned_loss=0.07436, over 21579.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3095, pruned_loss=0.08001, over 4287849.31 frames. ], batch size: 212, lr: 6.47e-03, grad_scale: 32.0 2023-06-21 11:18:11,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=762654.0, ans=0.2 2023-06-21 11:19:20,185 INFO [train.py:996] (2/4) Epoch 5, batch 5150, loss[loss=0.2425, simple_loss=0.3035, pruned_loss=0.09077, over 21776.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3069, pruned_loss=0.08094, over 4290681.54 frames. ], batch size: 441, lr: 6.47e-03, grad_scale: 16.0 2023-06-21 11:19:32,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=762774.0, ans=0.125 2023-06-21 11:19:36,955 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.115e+02 2.594e+02 2.911e+02 3.354e+02 4.463e+02, threshold=5.822e+02, percent-clipped=0.0 2023-06-21 11:19:50,057 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=30.09 vs. limit=15.0 2023-06-21 11:20:47,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=762954.0, ans=0.0 2023-06-21 11:20:49,128 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=762954.0, ans=0.0 2023-06-21 11:21:14,828 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.04 vs. limit=22.5 2023-06-21 11:21:39,098 INFO [train.py:996] (2/4) Epoch 5, batch 5200, loss[loss=0.2157, simple_loss=0.2842, pruned_loss=0.07355, over 21083.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3069, pruned_loss=0.08069, over 4287286.40 frames. ], batch size: 607, lr: 6.47e-03, grad_scale: 32.0 2023-06-21 11:22:22,552 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.72 vs. limit=15.0 2023-06-21 11:22:44,865 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.79 vs. limit=15.0 2023-06-21 11:23:23,278 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=763254.0, ans=0.5 2023-06-21 11:24:02,754 INFO [train.py:996] (2/4) Epoch 5, batch 5250, loss[loss=0.2373, simple_loss=0.3129, pruned_loss=0.08084, over 21649.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3099, pruned_loss=0.07913, over 4285739.27 frames. ], batch size: 263, lr: 6.47e-03, grad_scale: 32.0 2023-06-21 11:24:08,529 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.88 vs. limit=12.0 2023-06-21 11:24:23,979 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.701e+02 2.681e+02 2.958e+02 3.448e+02 5.597e+02, threshold=5.917e+02, percent-clipped=0.0 2023-06-21 11:24:30,488 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=763434.0, ans=0.0 2023-06-21 11:24:36,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=763434.0, ans=0.0 2023-06-21 11:25:07,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=763494.0, ans=0.125 2023-06-21 11:26:08,829 INFO [train.py:996] (2/4) Epoch 5, batch 5300, loss[loss=0.2253, simple_loss=0.2973, pruned_loss=0.0767, over 21522.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3092, pruned_loss=0.08034, over 4297592.40 frames. ], batch size: 548, lr: 6.47e-03, grad_scale: 32.0 2023-06-21 11:26:18,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=763674.0, ans=0.2 2023-06-21 11:27:24,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=763794.0, ans=0.125 2023-06-21 11:27:54,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=763854.0, ans=0.1 2023-06-21 11:28:12,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=763914.0, ans=0.0 2023-06-21 11:28:24,468 INFO [train.py:996] (2/4) Epoch 5, batch 5350, loss[loss=0.2453, simple_loss=0.3158, pruned_loss=0.08737, over 21293.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3091, pruned_loss=0.08251, over 4302943.84 frames. ], batch size: 143, lr: 6.47e-03, grad_scale: 32.0 2023-06-21 11:28:44,829 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.947e+02 2.381e+02 2.626e+02 3.022e+02 5.115e+02, threshold=5.252e+02, percent-clipped=0.0 2023-06-21 11:28:53,212 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.17 vs. limit=5.0 2023-06-21 11:28:55,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=764034.0, ans=0.125 2023-06-21 11:29:06,859 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.37 vs. limit=8.0 2023-06-21 11:30:00,660 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 11:30:36,292 INFO [train.py:996] (2/4) Epoch 5, batch 5400, loss[loss=0.2271, simple_loss=0.29, pruned_loss=0.0821, over 21458.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3089, pruned_loss=0.0826, over 4302378.26 frames. ], batch size: 211, lr: 6.46e-03, grad_scale: 16.0 2023-06-21 11:30:45,805 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-06-21 11:30:56,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=764334.0, ans=0.07 2023-06-21 11:33:01,011 INFO [train.py:996] (2/4) Epoch 5, batch 5450, loss[loss=0.2685, simple_loss=0.3713, pruned_loss=0.08284, over 21645.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3092, pruned_loss=0.08034, over 4305504.89 frames. ], batch size: 389, lr: 6.46e-03, grad_scale: 16.0 2023-06-21 11:33:18,145 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.835e+02 2.472e+02 2.911e+02 3.692e+02 6.272e+02, threshold=5.821e+02, percent-clipped=3.0 2023-06-21 11:33:40,289 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=764634.0, ans=0.125 2023-06-21 11:33:46,656 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=764694.0, ans=0.125 2023-06-21 11:34:39,870 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.19 vs. limit=15.0 2023-06-21 11:35:01,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=764814.0, ans=0.1 2023-06-21 11:35:10,240 INFO [train.py:996] (2/4) Epoch 5, batch 5500, loss[loss=0.217, simple_loss=0.3117, pruned_loss=0.06115, over 21799.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3134, pruned_loss=0.07761, over 4305223.05 frames. ], batch size: 282, lr: 6.46e-03, grad_scale: 16.0 2023-06-21 11:35:10,791 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=764874.0, ans=0.125 2023-06-21 11:35:14,003 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=764874.0, ans=0.04949747468305833 2023-06-21 11:35:41,487 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.13 vs. limit=15.0 2023-06-21 11:35:48,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=764934.0, ans=0.0 2023-06-21 11:35:48,759 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=764934.0, ans=0.07 2023-06-21 11:35:57,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=764934.0, ans=0.125 2023-06-21 11:36:19,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=764994.0, ans=0.0 2023-06-21 11:36:58,604 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=765054.0, ans=0.04949747468305833 2023-06-21 11:37:21,023 INFO [train.py:996] (2/4) Epoch 5, batch 5550, loss[loss=0.1759, simple_loss=0.2682, pruned_loss=0.0418, over 21664.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3113, pruned_loss=0.07426, over 4303643.92 frames. ], batch size: 247, lr: 6.46e-03, grad_scale: 16.0 2023-06-21 11:38:08,765 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=765234.0, ans=0.125 2023-06-21 11:38:09,585 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.602e+02 2.186e+02 2.465e+02 2.869e+02 4.676e+02, threshold=4.930e+02, percent-clipped=0.0 2023-06-21 11:38:16,315 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=765234.0, ans=0.0 2023-06-21 11:39:22,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=765354.0, ans=0.125 2023-06-21 11:39:30,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=765414.0, ans=0.125 2023-06-21 11:39:40,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=765414.0, ans=0.125 2023-06-21 11:39:43,787 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=765474.0, ans=0.0 2023-06-21 11:39:55,441 INFO [train.py:996] (2/4) Epoch 5, batch 5600, loss[loss=0.2261, simple_loss=0.2989, pruned_loss=0.07665, over 21479.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3097, pruned_loss=0.07213, over 4292516.62 frames. ], batch size: 548, lr: 6.46e-03, grad_scale: 32.0 2023-06-21 11:40:06,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=765474.0, ans=0.125 2023-06-21 11:42:13,691 INFO [train.py:996] (2/4) Epoch 5, batch 5650, loss[loss=0.2343, simple_loss=0.3035, pruned_loss=0.08252, over 21932.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3151, pruned_loss=0.07517, over 4285104.27 frames. ], batch size: 316, lr: 6.46e-03, grad_scale: 32.0 2023-06-21 11:42:36,585 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.871e+02 2.490e+02 2.930e+02 3.681e+02 6.971e+02, threshold=5.860e+02, percent-clipped=6.0 2023-06-21 11:43:02,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=765834.0, ans=0.125 2023-06-21 11:43:33,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=765954.0, ans=0.125 2023-06-21 11:44:28,801 INFO [train.py:996] (2/4) Epoch 5, batch 5700, loss[loss=0.209, simple_loss=0.2792, pruned_loss=0.06939, over 21219.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3122, pruned_loss=0.0759, over 4287842.41 frames. ], batch size: 608, lr: 6.46e-03, grad_scale: 32.0 2023-06-21 11:44:58,886 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=766074.0, ans=0.04949747468305833 2023-06-21 11:45:20,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=766134.0, ans=0.1 2023-06-21 11:46:11,703 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.90 vs. limit=6.0 2023-06-21 11:46:14,309 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=766254.0, ans=0.1 2023-06-21 11:46:27,282 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.10 vs. limit=15.0 2023-06-21 11:47:17,571 INFO [train.py:996] (2/4) Epoch 5, batch 5750, loss[loss=0.3388, simple_loss=0.4386, pruned_loss=0.1195, over 21185.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3096, pruned_loss=0.07319, over 4291468.84 frames. ], batch size: 548, lr: 6.46e-03, grad_scale: 32.0 2023-06-21 11:47:37,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=766434.0, ans=0.1 2023-06-21 11:47:40,195 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.775e+02 2.289e+02 2.668e+02 3.231e+02 5.394e+02, threshold=5.337e+02, percent-clipped=0.0 2023-06-21 11:47:46,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=766434.0, ans=0.2 2023-06-21 11:47:46,434 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=766434.0, ans=0.125 2023-06-21 11:48:55,491 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=766554.0, ans=15.0 2023-06-21 11:49:57,049 INFO [train.py:996] (2/4) Epoch 5, batch 5800, loss[loss=0.2567, simple_loss=0.3549, pruned_loss=0.07929, over 21642.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3072, pruned_loss=0.0715, over 4282243.00 frames. ], batch size: 389, lr: 6.45e-03, grad_scale: 32.0 2023-06-21 11:50:03,571 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=766674.0, ans=0.0 2023-06-21 11:50:10,681 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.75 vs. limit=15.0 2023-06-21 11:50:20,758 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=766734.0, ans=0.0 2023-06-21 11:51:53,082 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=766854.0, ans=0.125 2023-06-21 11:51:54,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=766914.0, ans=0.035 2023-06-21 11:52:11,023 INFO [train.py:996] (2/4) Epoch 5, batch 5850, loss[loss=0.1759, simple_loss=0.2827, pruned_loss=0.03458, over 21768.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.3041, pruned_loss=0.06723, over 4282207.84 frames. ], batch size: 282, lr: 6.45e-03, grad_scale: 32.0 2023-06-21 11:52:16,440 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=17.51 vs. limit=22.5 2023-06-21 11:52:31,036 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 2.021e+02 2.544e+02 3.113e+02 4.412e+02, threshold=5.088e+02, percent-clipped=0.0 2023-06-21 11:53:02,449 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.21 vs. limit=22.5 2023-06-21 11:53:32,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=767094.0, ans=0.0 2023-06-21 11:53:32,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=767094.0, ans=0.125 2023-06-21 11:54:15,374 INFO [train.py:996] (2/4) Epoch 5, batch 5900, loss[loss=0.1168, simple_loss=0.1787, pruned_loss=0.02742, over 16622.00 frames. ], tot_loss[loss=0.212, simple_loss=0.298, pruned_loss=0.063, over 4277482.27 frames. ], batch size: 60, lr: 6.45e-03, grad_scale: 32.0 2023-06-21 11:54:59,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=767334.0, ans=0.125 2023-06-21 11:56:31,148 INFO [train.py:996] (2/4) Epoch 5, batch 5950, loss[loss=0.2284, simple_loss=0.2735, pruned_loss=0.09166, over 20241.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2979, pruned_loss=0.06675, over 4280164.83 frames. ], batch size: 703, lr: 6.45e-03, grad_scale: 32.0 2023-06-21 11:56:52,447 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=767634.0, ans=0.015 2023-06-21 11:56:54,941 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 2.360e+02 2.756e+02 3.348e+02 5.051e+02, threshold=5.512e+02, percent-clipped=0.0 2023-06-21 11:56:59,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=767634.0, ans=0.05 2023-06-21 11:57:47,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=767694.0, ans=0.125 2023-06-21 11:58:52,389 INFO [train.py:996] (2/4) Epoch 5, batch 6000, loss[loss=0.2058, simple_loss=0.2644, pruned_loss=0.07362, over 21629.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2948, pruned_loss=0.07033, over 4277468.79 frames. ], batch size: 264, lr: 6.45e-03, grad_scale: 32.0 2023-06-21 11:58:52,389 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-21 11:59:54,335 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.2.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([1.6611, 2.0686, 3.2963, 2.5359], device='cuda:2') 2023-06-21 11:59:55,605 INFO [train.py:1028] (2/4) Epoch 5, validation: loss=0.2623, simple_loss=0.3577, pruned_loss=0.08348, over 1796401.00 frames. 2023-06-21 11:59:55,605 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-21 12:00:11,636 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=767874.0, ans=0.125 2023-06-21 12:00:16,234 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.73 vs. limit=15.0 2023-06-21 12:00:20,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=767934.0, ans=0.1 2023-06-21 12:00:23,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=767934.0, ans=0.025 2023-06-21 12:00:45,392 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=767994.0, ans=0.125 2023-06-21 12:01:12,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=768054.0, ans=0.1 2023-06-21 12:01:56,882 INFO [train.py:996] (2/4) Epoch 5, batch 6050, loss[loss=0.1751, simple_loss=0.2466, pruned_loss=0.05179, over 21478.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2895, pruned_loss=0.07151, over 4280257.09 frames. ], batch size: 212, lr: 6.45e-03, grad_scale: 32.0 2023-06-21 12:02:09,964 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.20 vs. limit=15.0 2023-06-21 12:02:30,522 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.754e+02 2.508e+02 2.739e+02 3.269e+02 4.730e+02, threshold=5.478e+02, percent-clipped=0.0 2023-06-21 12:02:43,297 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 12:02:49,533 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.50 vs. limit=15.0 2023-06-21 12:02:51,927 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.23 vs. limit=15.0 2023-06-21 12:03:04,141 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-21 12:03:14,130 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.57 vs. limit=15.0 2023-06-21 12:03:18,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=768354.0, ans=0.0 2023-06-21 12:03:38,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=768414.0, ans=0.0 2023-06-21 12:03:53,964 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=768414.0, ans=0.125 2023-06-21 12:04:00,824 INFO [train.py:996] (2/4) Epoch 5, batch 6100, loss[loss=0.2241, simple_loss=0.2959, pruned_loss=0.07611, over 21793.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2887, pruned_loss=0.07057, over 4264244.28 frames. ], batch size: 282, lr: 6.45e-03, grad_scale: 32.0 2023-06-21 12:04:32,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=768534.0, ans=0.1 2023-06-21 12:05:06,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=768594.0, ans=0.0 2023-06-21 12:06:05,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=768714.0, ans=0.1 2023-06-21 12:06:07,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=768714.0, ans=0.0 2023-06-21 12:06:19,083 INFO [train.py:996] (2/4) Epoch 5, batch 6150, loss[loss=0.2255, simple_loss=0.2997, pruned_loss=0.07563, over 21730.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2932, pruned_loss=0.07347, over 4266504.62 frames. ], batch size: 282, lr: 6.45e-03, grad_scale: 32.0 2023-06-21 12:06:23,979 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=768774.0, ans=0.1 2023-06-21 12:06:40,061 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.765e+02 2.345e+02 2.949e+02 3.402e+02 5.805e+02, threshold=5.898e+02, percent-clipped=1.0 2023-06-21 12:06:43,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=768834.0, ans=0.0 2023-06-21 12:07:09,285 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=768894.0, ans=0.0 2023-06-21 12:07:27,618 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=768894.0, ans=0.125 2023-06-21 12:07:30,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=768894.0, ans=0.1 2023-06-21 12:07:55,175 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 12:08:18,332 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.62 vs. limit=22.5 2023-06-21 12:08:22,549 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=769014.0, ans=0.125 2023-06-21 12:08:31,448 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=769014.0, ans=0.0 2023-06-21 12:08:37,415 INFO [train.py:996] (2/4) Epoch 5, batch 6200, loss[loss=0.2443, simple_loss=0.3131, pruned_loss=0.08772, over 21505.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2962, pruned_loss=0.07394, over 4268152.38 frames. ], batch size: 441, lr: 6.44e-03, grad_scale: 32.0 2023-06-21 12:08:48,644 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-06-21 12:09:06,242 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.42 vs. limit=22.5 2023-06-21 12:09:54,896 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=15.0 2023-06-21 12:10:18,152 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=769254.0, ans=0.125 2023-06-21 12:10:33,592 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=769314.0, ans=15.0 2023-06-21 12:10:54,300 INFO [train.py:996] (2/4) Epoch 5, batch 6250, loss[loss=0.2224, simple_loss=0.3153, pruned_loss=0.06477, over 21392.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3018, pruned_loss=0.07382, over 4276801.88 frames. ], batch size: 194, lr: 6.44e-03, grad_scale: 32.0 2023-06-21 12:11:21,594 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.649e+02 2.368e+02 2.704e+02 3.313e+02 4.790e+02, threshold=5.409e+02, percent-clipped=0.0 2023-06-21 12:11:22,072 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=769434.0, ans=0.1 2023-06-21 12:11:23,525 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=769434.0, ans=0.125 2023-06-21 12:12:40,734 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.95 vs. limit=12.0 2023-06-21 12:13:07,525 INFO [train.py:996] (2/4) Epoch 5, batch 6300, loss[loss=0.2403, simple_loss=0.3068, pruned_loss=0.08693, over 21902.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3046, pruned_loss=0.07249, over 4282061.12 frames. ], batch size: 107, lr: 6.44e-03, grad_scale: 32.0 2023-06-21 12:13:38,694 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.11 vs. limit=10.0 2023-06-21 12:13:41,580 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.91 vs. limit=15.0 2023-06-21 12:14:10,529 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-06-21 12:14:24,711 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=769854.0, ans=10.0 2023-06-21 12:15:21,328 INFO [train.py:996] (2/4) Epoch 5, batch 6350, loss[loss=0.2507, simple_loss=0.3049, pruned_loss=0.09827, over 21581.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3086, pruned_loss=0.07667, over 4284099.64 frames. ], batch size: 548, lr: 6.44e-03, grad_scale: 32.0 2023-06-21 12:15:26,039 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.46 vs. limit=15.0 2023-06-21 12:15:50,448 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.727e+02 2.600e+02 2.925e+02 3.648e+02 4.818e+02, threshold=5.851e+02, percent-clipped=0.0 2023-06-21 12:16:39,455 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=770094.0, ans=0.125 2023-06-21 12:17:48,483 INFO [train.py:996] (2/4) Epoch 5, batch 6400, loss[loss=0.2593, simple_loss=0.3337, pruned_loss=0.09246, over 21352.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3164, pruned_loss=0.08113, over 4280791.67 frames. ], batch size: 131, lr: 6.44e-03, grad_scale: 32.0 2023-06-21 12:17:52,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=770274.0, ans=0.025 2023-06-21 12:18:02,703 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.08 vs. limit=15.0 2023-06-21 12:19:03,337 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=770394.0, ans=0.0 2023-06-21 12:19:12,170 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=770454.0, ans=0.0 2023-06-21 12:20:01,645 INFO [train.py:996] (2/4) Epoch 5, batch 6450, loss[loss=0.2056, simple_loss=0.292, pruned_loss=0.05957, over 21602.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3193, pruned_loss=0.08139, over 4281354.89 frames. ], batch size: 230, lr: 6.44e-03, grad_scale: 32.0 2023-06-21 12:20:19,530 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.950e+02 2.431e+02 2.805e+02 3.198e+02 5.945e+02, threshold=5.611e+02, percent-clipped=1.0 2023-06-21 12:20:44,595 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.50 vs. limit=15.0 2023-06-21 12:22:05,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=770814.0, ans=0.125 2023-06-21 12:22:15,288 INFO [train.py:996] (2/4) Epoch 5, batch 6500, loss[loss=0.2215, simple_loss=0.3108, pruned_loss=0.06609, over 21577.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3127, pruned_loss=0.08044, over 4277150.58 frames. ], batch size: 389, lr: 6.44e-03, grad_scale: 32.0 2023-06-21 12:22:20,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=770874.0, ans=0.09899494936611666 2023-06-21 12:22:38,136 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.42 vs. limit=15.0 2023-06-21 12:23:41,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=771054.0, ans=0.125 2023-06-21 12:23:44,584 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.18 vs. limit=6.0 2023-06-21 12:24:41,656 INFO [train.py:996] (2/4) Epoch 5, batch 6550, loss[loss=0.2208, simple_loss=0.2881, pruned_loss=0.07672, over 21249.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3108, pruned_loss=0.07966, over 4279657.21 frames. ], batch size: 176, lr: 6.44e-03, grad_scale: 32.0 2023-06-21 12:24:58,472 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.011e+02 2.666e+02 3.007e+02 3.674e+02 6.110e+02, threshold=6.015e+02, percent-clipped=2.0 2023-06-21 12:25:30,873 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=771294.0, ans=0.1 2023-06-21 12:25:52,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=771354.0, ans=0.125 2023-06-21 12:26:00,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=771354.0, ans=0.0 2023-06-21 12:26:03,155 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=771354.0, ans=0.5 2023-06-21 12:26:12,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=771354.0, ans=0.125 2023-06-21 12:26:19,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=771414.0, ans=0.0 2023-06-21 12:26:52,304 INFO [train.py:996] (2/4) Epoch 5, batch 6600, loss[loss=0.2706, simple_loss=0.3697, pruned_loss=0.08577, over 19905.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3053, pruned_loss=0.07908, over 4277261.41 frames. ], batch size: 703, lr: 6.43e-03, grad_scale: 16.0 2023-06-21 12:26:52,714 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=771474.0, ans=0.0 2023-06-21 12:27:29,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=771534.0, ans=0.04949747468305833 2023-06-21 12:28:30,685 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 12:28:31,352 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.71 vs. limit=15.0 2023-06-21 12:28:53,538 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.22 vs. limit=15.0 2023-06-21 12:28:56,944 INFO [train.py:996] (2/4) Epoch 5, batch 6650, loss[loss=0.1914, simple_loss=0.2445, pruned_loss=0.06912, over 21176.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2973, pruned_loss=0.07622, over 4267534.87 frames. ], batch size: 548, lr: 6.43e-03, grad_scale: 16.0 2023-06-21 12:29:40,251 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.766e+02 2.317e+02 2.595e+02 2.999e+02 4.464e+02, threshold=5.189e+02, percent-clipped=0.0 2023-06-21 12:29:44,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=771834.0, ans=0.125 2023-06-21 12:31:08,784 INFO [train.py:996] (2/4) Epoch 5, batch 6700, loss[loss=0.1922, simple_loss=0.262, pruned_loss=0.06122, over 21499.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2916, pruned_loss=0.07496, over 4268469.70 frames. ], batch size: 195, lr: 6.43e-03, grad_scale: 16.0 2023-06-21 12:33:10,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=772314.0, ans=0.0 2023-06-21 12:33:12,785 INFO [train.py:996] (2/4) Epoch 5, batch 6750, loss[loss=0.2118, simple_loss=0.2774, pruned_loss=0.07315, over 21650.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2905, pruned_loss=0.07554, over 4262920.99 frames. ], batch size: 247, lr: 6.43e-03, grad_scale: 16.0 2023-06-21 12:33:58,959 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.735e+02 2.401e+02 2.813e+02 3.332e+02 5.748e+02, threshold=5.626e+02, percent-clipped=2.0 2023-06-21 12:34:04,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=772434.0, ans=0.2 2023-06-21 12:35:25,716 INFO [train.py:996] (2/4) Epoch 5, batch 6800, loss[loss=0.2338, simple_loss=0.2947, pruned_loss=0.08647, over 21410.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.2932, pruned_loss=0.07815, over 4274156.68 frames. ], batch size: 548, lr: 6.43e-03, grad_scale: 32.0 2023-06-21 12:35:46,082 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.69 vs. limit=22.5 2023-06-21 12:36:41,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=772854.0, ans=0.2 2023-06-21 12:36:43,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=772854.0, ans=0.125 2023-06-21 12:36:59,145 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=772854.0, ans=0.125 2023-06-21 12:37:16,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=772914.0, ans=0.125 2023-06-21 12:37:19,276 INFO [train.py:996] (2/4) Epoch 5, batch 6850, loss[loss=0.2269, simple_loss=0.2869, pruned_loss=0.08345, over 21279.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2902, pruned_loss=0.07882, over 4274603.22 frames. ], batch size: 144, lr: 6.43e-03, grad_scale: 32.0 2023-06-21 12:37:42,122 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=772974.0, ans=0.125 2023-06-21 12:38:02,159 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.985e+02 2.533e+02 2.911e+02 3.331e+02 6.135e+02, threshold=5.822e+02, percent-clipped=1.0 2023-06-21 12:38:12,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=773034.0, ans=0.1 2023-06-21 12:38:25,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=773094.0, ans=0.125 2023-06-21 12:38:31,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=773094.0, ans=0.1 2023-06-21 12:39:46,838 INFO [train.py:996] (2/4) Epoch 5, batch 6900, loss[loss=0.3075, simple_loss=0.4244, pruned_loss=0.09527, over 19810.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.2939, pruned_loss=0.0788, over 4277243.67 frames. ], batch size: 702, lr: 6.43e-03, grad_scale: 32.0 2023-06-21 12:40:01,064 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.00 vs. limit=10.0 2023-06-21 12:40:14,850 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.49 vs. limit=15.0 2023-06-21 12:40:23,896 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=773334.0, ans=0.0 2023-06-21 12:41:11,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=773394.0, ans=0.1 2023-06-21 12:41:48,037 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 12:41:49,493 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=773514.0, ans=0.125 2023-06-21 12:42:07,215 INFO [train.py:996] (2/4) Epoch 5, batch 6950, loss[loss=0.2583, simple_loss=0.3283, pruned_loss=0.09411, over 21867.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2945, pruned_loss=0.07561, over 4285042.89 frames. ], batch size: 371, lr: 6.43e-03, grad_scale: 16.0 2023-06-21 12:42:45,185 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.920e+02 2.439e+02 2.793e+02 3.180e+02 5.285e+02, threshold=5.586e+02, percent-clipped=0.0 2023-06-21 12:44:15,867 INFO [train.py:996] (2/4) Epoch 5, batch 7000, loss[loss=0.2455, simple_loss=0.2949, pruned_loss=0.09804, over 21518.00 frames. ], tot_loss[loss=0.227, simple_loss=0.2977, pruned_loss=0.07812, over 4283908.97 frames. ], batch size: 441, lr: 6.42e-03, grad_scale: 16.0 2023-06-21 12:45:01,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=773934.0, ans=0.125 2023-06-21 12:46:18,259 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.98 vs. limit=15.0 2023-06-21 12:46:29,994 INFO [train.py:996] (2/4) Epoch 5, batch 7050, loss[loss=0.2106, simple_loss=0.286, pruned_loss=0.06762, over 21482.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.2956, pruned_loss=0.07645, over 4280370.56 frames. ], batch size: 389, lr: 6.42e-03, grad_scale: 16.0 2023-06-21 12:46:47,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=774174.0, ans=0.025 2023-06-21 12:47:07,611 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.826e+02 2.342e+02 2.952e+02 3.730e+02 6.285e+02, threshold=5.903e+02, percent-clipped=1.0 2023-06-21 12:47:42,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=774294.0, ans=0.0 2023-06-21 12:48:19,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=774354.0, ans=0.125 2023-06-21 12:48:31,639 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 12:48:43,880 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.45 vs. limit=22.5 2023-06-21 12:48:47,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=774414.0, ans=0.125 2023-06-21 12:48:50,490 INFO [train.py:996] (2/4) Epoch 5, batch 7100, loss[loss=0.2849, simple_loss=0.35, pruned_loss=0.11, over 21291.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.2995, pruned_loss=0.07751, over 4280412.34 frames. ], batch size: 143, lr: 6.42e-03, grad_scale: 16.0 2023-06-21 12:48:54,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=774474.0, ans=0.125 2023-06-21 12:50:19,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=774654.0, ans=0.125 2023-06-21 12:50:55,944 INFO [train.py:996] (2/4) Epoch 5, batch 7150, loss[loss=0.2228, simple_loss=0.3036, pruned_loss=0.07101, over 21748.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2964, pruned_loss=0.07534, over 4268457.43 frames. ], batch size: 298, lr: 6.42e-03, grad_scale: 16.0 2023-06-21 12:51:34,350 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.837e+02 2.394e+02 2.698e+02 3.391e+02 6.183e+02, threshold=5.396e+02, percent-clipped=2.0 2023-06-21 12:51:52,626 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=774834.0, ans=0.035 2023-06-21 12:52:13,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=774894.0, ans=0.2 2023-06-21 12:52:16,462 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=15.0 2023-06-21 12:52:17,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=774954.0, ans=0.125 2023-06-21 12:53:11,025 INFO [train.py:996] (2/4) Epoch 5, batch 7200, loss[loss=0.2173, simple_loss=0.3203, pruned_loss=0.05715, over 20742.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3004, pruned_loss=0.07822, over 4268933.74 frames. ], batch size: 607, lr: 6.42e-03, grad_scale: 32.0 2023-06-21 12:53:35,488 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.49 vs. limit=15.0 2023-06-21 12:54:21,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=775194.0, ans=0.0 2023-06-21 12:54:45,266 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=775254.0, ans=0.1 2023-06-21 12:54:46,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=775254.0, ans=0.2 2023-06-21 12:54:57,706 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.43 vs. limit=6.0 2023-06-21 12:55:19,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=775374.0, ans=0.125 2023-06-21 12:55:20,756 INFO [train.py:996] (2/4) Epoch 5, batch 7250, loss[loss=0.2242, simple_loss=0.2783, pruned_loss=0.08504, over 21890.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.2961, pruned_loss=0.07839, over 4271896.13 frames. ], batch size: 373, lr: 6.42e-03, grad_scale: 32.0 2023-06-21 12:55:57,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=775434.0, ans=0.1 2023-06-21 12:56:02,053 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.038e+02 2.505e+02 2.857e+02 3.370e+02 5.311e+02, threshold=5.714e+02, percent-clipped=0.0 2023-06-21 12:56:43,529 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.91 vs. limit=15.0 2023-06-21 12:57:36,335 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.00 vs. limit=12.0 2023-06-21 12:57:38,533 INFO [train.py:996] (2/4) Epoch 5, batch 7300, loss[loss=0.194, simple_loss=0.265, pruned_loss=0.06154, over 21900.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2905, pruned_loss=0.0766, over 4259991.51 frames. ], batch size: 125, lr: 6.42e-03, grad_scale: 32.0 2023-06-21 12:57:38,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=775674.0, ans=0.2 2023-06-21 12:57:38,953 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=775674.0, ans=0.0 2023-06-21 12:59:18,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=775914.0, ans=0.2 2023-06-21 12:59:43,013 INFO [train.py:996] (2/4) Epoch 5, batch 7350, loss[loss=0.2752, simple_loss=0.3453, pruned_loss=0.1025, over 21485.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2885, pruned_loss=0.07784, over 4262647.78 frames. ], batch size: 131, lr: 6.42e-03, grad_scale: 32.0 2023-06-21 13:00:25,505 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.952e+02 2.511e+02 2.899e+02 3.601e+02 8.387e+02, threshold=5.798e+02, percent-clipped=4.0 2023-06-21 13:00:57,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=776094.0, ans=0.125 2023-06-21 13:00:58,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=776154.0, ans=0.125 2023-06-21 13:01:56,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=776214.0, ans=0.125 2023-06-21 13:01:59,271 INFO [train.py:996] (2/4) Epoch 5, batch 7400, loss[loss=0.2824, simple_loss=0.3453, pruned_loss=0.1098, over 21764.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.2944, pruned_loss=0.08045, over 4261640.44 frames. ], batch size: 441, lr: 6.41e-03, grad_scale: 32.0 2023-06-21 13:02:39,348 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=776334.0, ans=0.07 2023-06-21 13:02:39,899 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.66 vs. limit=15.0 2023-06-21 13:03:42,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=776454.0, ans=0.2 2023-06-21 13:04:15,307 INFO [train.py:996] (2/4) Epoch 5, batch 7450, loss[loss=0.2435, simple_loss=0.3358, pruned_loss=0.07561, over 21557.00 frames. ], tot_loss[loss=0.226, simple_loss=0.2937, pruned_loss=0.07921, over 4263153.24 frames. ], batch size: 441, lr: 6.41e-03, grad_scale: 32.0 2023-06-21 13:04:26,817 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-21 13:04:40,925 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.920e+02 2.565e+02 3.143e+02 4.189e+02 6.806e+02, threshold=6.287e+02, percent-clipped=2.0 2023-06-21 13:06:30,092 INFO [train.py:996] (2/4) Epoch 5, batch 7500, loss[loss=0.228, simple_loss=0.3157, pruned_loss=0.07015, over 21235.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.2967, pruned_loss=0.07992, over 4269528.56 frames. ], batch size: 159, lr: 6.41e-03, grad_scale: 32.0 2023-06-21 13:06:30,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=776874.0, ans=0.125 2023-06-21 13:06:44,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=776934.0, ans=0.125 2023-06-21 13:08:47,647 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.89 vs. limit=15.0 2023-06-21 13:08:48,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=777174.0, ans=0.125 2023-06-21 13:08:49,589 INFO [train.py:996] (2/4) Epoch 5, batch 7550, loss[loss=0.1975, simple_loss=0.2901, pruned_loss=0.05243, over 21651.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3053, pruned_loss=0.07925, over 4260105.26 frames. ], batch size: 263, lr: 6.41e-03, grad_scale: 32.0 2023-06-21 13:09:33,355 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 2.666e+02 3.013e+02 3.483e+02 5.379e+02, threshold=6.026e+02, percent-clipped=0.0 2023-06-21 13:09:43,185 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.61 vs. limit=15.0 2023-06-21 13:10:08,308 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=777294.0, ans=0.125 2023-06-21 13:10:33,810 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=777354.0, ans=0.1 2023-06-21 13:10:38,691 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.09 vs. limit=15.0 2023-06-21 13:10:58,793 INFO [train.py:996] (2/4) Epoch 5, batch 7600, loss[loss=0.2244, simple_loss=0.2972, pruned_loss=0.07577, over 21895.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3045, pruned_loss=0.07757, over 4270214.43 frames. ], batch size: 332, lr: 6.41e-03, grad_scale: 32.0 2023-06-21 13:11:15,994 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 13:11:40,052 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=777534.0, ans=0.125 2023-06-21 13:12:08,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=777594.0, ans=0.2 2023-06-21 13:12:08,782 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-06-21 13:12:37,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=777654.0, ans=0.125 2023-06-21 13:12:46,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=777654.0, ans=0.125 2023-06-21 13:13:16,758 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=777714.0, ans=0.125 2023-06-21 13:13:19,300 INFO [train.py:996] (2/4) Epoch 5, batch 7650, loss[loss=0.2478, simple_loss=0.3202, pruned_loss=0.08772, over 21917.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3037, pruned_loss=0.07989, over 4280160.77 frames. ], batch size: 107, lr: 6.41e-03, grad_scale: 32.0 2023-06-21 13:13:19,980 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=777774.0, ans=0.125 2023-06-21 13:13:46,613 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.25 vs. limit=22.5 2023-06-21 13:13:56,554 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=777834.0, ans=0.2 2023-06-21 13:14:08,481 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.049e+02 2.535e+02 2.921e+02 3.505e+02 6.410e+02, threshold=5.842e+02, percent-clipped=1.0 2023-06-21 13:14:10,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=777834.0, ans=0.125 2023-06-21 13:14:36,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=777894.0, ans=0.0 2023-06-21 13:15:17,789 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=777954.0, ans=0.0 2023-06-21 13:15:45,153 INFO [train.py:996] (2/4) Epoch 5, batch 7700, loss[loss=0.274, simple_loss=0.3427, pruned_loss=0.1026, over 21431.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3072, pruned_loss=0.08296, over 4285364.72 frames. ], batch size: 131, lr: 6.41e-03, grad_scale: 32.0 2023-06-21 13:17:26,871 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=778314.0, ans=0.125 2023-06-21 13:17:32,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=778314.0, ans=0.0 2023-06-21 13:17:55,140 INFO [train.py:996] (2/4) Epoch 5, batch 7750, loss[loss=0.2373, simple_loss=0.3276, pruned_loss=0.0735, over 21289.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3146, pruned_loss=0.08338, over 4285024.00 frames. ], batch size: 176, lr: 6.41e-03, grad_scale: 16.0 2023-06-21 13:18:39,931 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.06 vs. limit=15.0 2023-06-21 13:18:43,588 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.999e+02 2.478e+02 2.745e+02 3.092e+02 4.488e+02, threshold=5.489e+02, percent-clipped=0.0 2023-06-21 13:18:45,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=778434.0, ans=0.0 2023-06-21 13:18:46,347 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.18 vs. limit=15.0 2023-06-21 13:18:57,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=778494.0, ans=0.1 2023-06-21 13:19:17,989 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.21 vs. limit=22.5 2023-06-21 13:19:39,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=778554.0, ans=10.0 2023-06-21 13:20:24,294 INFO [train.py:996] (2/4) Epoch 5, batch 7800, loss[loss=0.1941, simple_loss=0.2667, pruned_loss=0.06076, over 21408.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3149, pruned_loss=0.0829, over 4277685.92 frames. ], batch size: 194, lr: 6.40e-03, grad_scale: 16.0 2023-06-21 13:21:56,240 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.26 vs. limit=15.0 2023-06-21 13:22:25,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=778914.0, ans=0.2 2023-06-21 13:22:29,698 INFO [train.py:996] (2/4) Epoch 5, batch 7850, loss[loss=0.2509, simple_loss=0.2921, pruned_loss=0.1049, over 21345.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3089, pruned_loss=0.08232, over 4266927.18 frames. ], batch size: 473, lr: 6.40e-03, grad_scale: 16.0 2023-06-21 13:22:31,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=778974.0, ans=0.1 2023-06-21 13:22:55,229 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=779034.0, ans=0.125 2023-06-21 13:23:06,459 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.955e+02 2.498e+02 2.821e+02 3.538e+02 6.560e+02, threshold=5.643e+02, percent-clipped=3.0 2023-06-21 13:23:08,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=779034.0, ans=0.2 2023-06-21 13:23:08,566 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=779034.0, ans=0.07 2023-06-21 13:24:18,930 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.16 vs. limit=15.0 2023-06-21 13:24:41,376 INFO [train.py:996] (2/4) Epoch 5, batch 7900, loss[loss=0.2049, simple_loss=0.2696, pruned_loss=0.0701, over 21140.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3043, pruned_loss=0.08119, over 4258065.60 frames. ], batch size: 143, lr: 6.40e-03, grad_scale: 16.0 2023-06-21 13:26:07,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=779454.0, ans=0.125 2023-06-21 13:26:44,715 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=779514.0, ans=0.125 2023-06-21 13:27:07,065 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.30 vs. limit=22.5 2023-06-21 13:27:14,828 INFO [train.py:996] (2/4) Epoch 5, batch 7950, loss[loss=0.2348, simple_loss=0.3386, pruned_loss=0.06545, over 21337.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3077, pruned_loss=0.08023, over 4257384.56 frames. ], batch size: 548, lr: 6.40e-03, grad_scale: 16.0 2023-06-21 13:27:37,576 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.84 vs. limit=10.0 2023-06-21 13:27:42,670 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 2.594e+02 2.803e+02 3.768e+02 5.185e+02, threshold=5.606e+02, percent-clipped=0.0 2023-06-21 13:27:43,466 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=779634.0, ans=0.125 2023-06-21 13:27:43,493 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=779634.0, ans=0.125 2023-06-21 13:27:44,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=779634.0, ans=0.125 2023-06-21 13:28:10,392 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-06-21 13:29:18,079 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 13:29:26,312 INFO [train.py:996] (2/4) Epoch 5, batch 8000, loss[loss=0.2189, simple_loss=0.2863, pruned_loss=0.07578, over 16810.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3091, pruned_loss=0.0822, over 4259453.70 frames. ], batch size: 61, lr: 6.40e-03, grad_scale: 32.0 2023-06-21 13:30:44,331 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=15.0 2023-06-21 13:32:03,762 INFO [train.py:996] (2/4) Epoch 5, batch 8050, loss[loss=0.2389, simple_loss=0.3166, pruned_loss=0.08062, over 21863.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3124, pruned_loss=0.08261, over 4259710.81 frames. ], batch size: 317, lr: 6.40e-03, grad_scale: 32.0 2023-06-21 13:32:24,839 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.141e+02 2.993e+02 3.544e+02 4.620e+02 9.797e+02, threshold=7.088e+02, percent-clipped=13.0 2023-06-21 13:32:47,497 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=780234.0, ans=0.125 2023-06-21 13:33:45,808 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=780354.0, ans=0.2 2023-06-21 13:34:06,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=780414.0, ans=0.1 2023-06-21 13:34:12,516 INFO [train.py:996] (2/4) Epoch 5, batch 8100, loss[loss=0.2491, simple_loss=0.3161, pruned_loss=0.09099, over 21721.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.312, pruned_loss=0.08348, over 4261120.86 frames. ], batch size: 389, lr: 6.40e-03, grad_scale: 32.0 2023-06-21 13:34:33,584 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.26 vs. limit=15.0 2023-06-21 13:35:47,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=780594.0, ans=0.125 2023-06-21 13:35:48,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=780594.0, ans=0.0 2023-06-21 13:35:54,645 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=780654.0, ans=0.1 2023-06-21 13:36:19,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=780654.0, ans=0.09899494936611666 2023-06-21 13:36:19,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=780654.0, ans=0.2 2023-06-21 13:36:43,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=780714.0, ans=0.0 2023-06-21 13:36:53,134 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=15.0 2023-06-21 13:36:57,680 INFO [train.py:996] (2/4) Epoch 5, batch 8150, loss[loss=0.2153, simple_loss=0.2938, pruned_loss=0.06838, over 21460.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3204, pruned_loss=0.08456, over 4265316.22 frames. ], batch size: 212, lr: 6.40e-03, grad_scale: 32.0 2023-06-21 13:37:14,227 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.59 vs. limit=15.0 2023-06-21 13:37:21,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=780774.0, ans=0.125 2023-06-21 13:37:42,124 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.926e+02 2.573e+02 2.921e+02 3.508e+02 5.879e+02, threshold=5.842e+02, percent-clipped=0.0 2023-06-21 13:37:57,713 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=780894.0, ans=0.1 2023-06-21 13:38:19,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=780954.0, ans=0.125 2023-06-21 13:38:38,172 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=780954.0, ans=0.125 2023-06-21 13:38:57,005 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.08 vs. limit=22.5 2023-06-21 13:38:59,386 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=781014.0, ans=0.125 2023-06-21 13:39:08,747 INFO [train.py:996] (2/4) Epoch 5, batch 8200, loss[loss=0.2715, simple_loss=0.3044, pruned_loss=0.1193, over 21491.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3134, pruned_loss=0.08236, over 4251778.38 frames. ], batch size: 511, lr: 6.40e-03, grad_scale: 32.0 2023-06-21 13:40:23,025 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.47 vs. limit=15.0 2023-06-21 13:41:33,490 INFO [train.py:996] (2/4) Epoch 5, batch 8250, loss[loss=0.2095, simple_loss=0.3014, pruned_loss=0.05878, over 21780.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3135, pruned_loss=0.08265, over 4263725.97 frames. ], batch size: 282, lr: 6.39e-03, grad_scale: 32.0 2023-06-21 13:41:35,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=781374.0, ans=0.0 2023-06-21 13:41:42,025 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.47 vs. limit=15.0 2023-06-21 13:41:52,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=781434.0, ans=0.125 2023-06-21 13:42:06,530 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.991e+02 2.500e+02 2.988e+02 3.537e+02 7.334e+02, threshold=5.975e+02, percent-clipped=1.0 2023-06-21 13:42:38,430 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=781494.0, ans=0.1 2023-06-21 13:42:54,178 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=781554.0, ans=0.125 2023-06-21 13:43:42,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=781614.0, ans=0.125 2023-06-21 13:43:45,575 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=781674.0, ans=0.05 2023-06-21 13:43:46,534 INFO [train.py:996] (2/4) Epoch 5, batch 8300, loss[loss=0.2274, simple_loss=0.3122, pruned_loss=0.07131, over 21722.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3112, pruned_loss=0.07957, over 4271603.44 frames. ], batch size: 298, lr: 6.39e-03, grad_scale: 32.0 2023-06-21 13:43:50,572 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.33 vs. limit=15.0 2023-06-21 13:44:20,437 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.66 vs. limit=15.0 2023-06-21 13:45:04,942 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.37 vs. limit=6.0 2023-06-21 13:45:05,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=781854.0, ans=0.125 2023-06-21 13:46:00,685 INFO [train.py:996] (2/4) Epoch 5, batch 8350, loss[loss=0.2268, simple_loss=0.3154, pruned_loss=0.06908, over 20764.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3089, pruned_loss=0.07749, over 4273548.33 frames. ], batch size: 607, lr: 6.39e-03, grad_scale: 16.0 2023-06-21 13:46:45,201 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.722e+02 2.563e+02 2.962e+02 3.725e+02 6.327e+02, threshold=5.925e+02, percent-clipped=2.0 2023-06-21 13:46:53,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=782094.0, ans=0.125 2023-06-21 13:47:04,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=782094.0, ans=0.125 2023-06-21 13:47:27,258 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.26 vs. limit=15.0 2023-06-21 13:47:45,717 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.80 vs. limit=22.5 2023-06-21 13:47:56,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=782214.0, ans=0.125 2023-06-21 13:48:19,766 INFO [train.py:996] (2/4) Epoch 5, batch 8400, loss[loss=0.1791, simple_loss=0.258, pruned_loss=0.05009, over 21245.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3069, pruned_loss=0.07551, over 4275349.75 frames. ], batch size: 176, lr: 6.39e-03, grad_scale: 32.0 2023-06-21 13:48:36,637 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=782274.0, ans=0.1 2023-06-21 13:49:20,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=782394.0, ans=0.125 2023-06-21 13:49:21,582 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=782394.0, ans=0.125 2023-06-21 13:49:23,696 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-21 13:49:38,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=782454.0, ans=0.125 2023-06-21 13:49:41,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=782454.0, ans=0.125 2023-06-21 13:50:33,090 INFO [train.py:996] (2/4) Epoch 5, batch 8450, loss[loss=0.2242, simple_loss=0.2942, pruned_loss=0.07715, over 21818.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3046, pruned_loss=0.07461, over 4284415.69 frames. ], batch size: 333, lr: 6.39e-03, grad_scale: 32.0 2023-06-21 13:51:01,342 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=15.0 2023-06-21 13:51:01,846 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.706e+02 2.311e+02 2.681e+02 3.207e+02 6.839e+02, threshold=5.362e+02, percent-clipped=1.0 2023-06-21 13:51:33,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=782694.0, ans=0.125 2023-06-21 13:52:31,598 INFO [train.py:996] (2/4) Epoch 5, batch 8500, loss[loss=0.2041, simple_loss=0.2769, pruned_loss=0.06565, over 21977.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3023, pruned_loss=0.07591, over 4266451.27 frames. ], batch size: 103, lr: 6.39e-03, grad_scale: 16.0 2023-06-21 13:52:40,047 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=22.5 2023-06-21 13:52:57,109 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=782934.0, ans=0.125 2023-06-21 13:54:11,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=783054.0, ans=0.0 2023-06-21 13:54:41,556 INFO [train.py:996] (2/4) Epoch 5, batch 8550, loss[loss=0.2257, simple_loss=0.3102, pruned_loss=0.07064, over 21418.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.305, pruned_loss=0.07801, over 4269097.88 frames. ], batch size: 194, lr: 6.39e-03, grad_scale: 16.0 2023-06-21 13:55:30,949 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=783234.0, ans=0.125 2023-06-21 13:55:31,905 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.064e+02 2.687e+02 3.119e+02 3.831e+02 5.921e+02, threshold=6.237e+02, percent-clipped=6.0 2023-06-21 13:55:33,031 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.23 vs. limit=6.0 2023-06-21 13:56:51,112 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=783414.0, ans=0.05 2023-06-21 13:57:03,916 INFO [train.py:996] (2/4) Epoch 5, batch 8600, loss[loss=0.2737, simple_loss=0.344, pruned_loss=0.1017, over 21755.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3128, pruned_loss=0.08002, over 4272381.65 frames. ], batch size: 332, lr: 6.39e-03, grad_scale: 16.0 2023-06-21 13:57:47,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=783534.0, ans=0.0 2023-06-21 13:58:12,318 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=783594.0, ans=0.125 2023-06-21 13:59:26,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=783714.0, ans=0.125 2023-06-21 13:59:28,622 INFO [train.py:996] (2/4) Epoch 5, batch 8650, loss[loss=0.177, simple_loss=0.2808, pruned_loss=0.03664, over 21772.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.319, pruned_loss=0.08068, over 4271605.37 frames. ], batch size: 351, lr: 6.38e-03, grad_scale: 16.0 2023-06-21 13:59:28,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=783774.0, ans=0.0 2023-06-21 13:59:56,044 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.20 vs. limit=15.0 2023-06-21 14:00:03,579 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.745e+02 2.535e+02 2.923e+02 3.243e+02 4.542e+02, threshold=5.846e+02, percent-clipped=0.0 2023-06-21 14:00:05,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=783834.0, ans=0.2 2023-06-21 14:01:27,569 INFO [train.py:996] (2/4) Epoch 5, batch 8700, loss[loss=0.2125, simple_loss=0.2742, pruned_loss=0.07538, over 21675.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3123, pruned_loss=0.07764, over 4264605.29 frames. ], batch size: 333, lr: 6.38e-03, grad_scale: 16.0 2023-06-21 14:01:29,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=784074.0, ans=0.125 2023-06-21 14:01:45,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=784074.0, ans=10.0 2023-06-21 14:03:27,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=784314.0, ans=0.125 2023-06-21 14:03:36,319 INFO [train.py:996] (2/4) Epoch 5, batch 8750, loss[loss=0.2322, simple_loss=0.3019, pruned_loss=0.08123, over 21923.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3075, pruned_loss=0.07804, over 4269530.20 frames. ], batch size: 316, lr: 6.38e-03, grad_scale: 16.0 2023-06-21 14:03:43,156 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=784374.0, ans=0.125 2023-06-21 14:03:58,413 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=784434.0, ans=0.0 2023-06-21 14:04:08,569 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.869e+02 2.564e+02 3.026e+02 3.702e+02 5.969e+02, threshold=6.051e+02, percent-clipped=2.0 2023-06-21 14:05:37,442 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=784614.0, ans=0.025 2023-06-21 14:06:02,645 INFO [train.py:996] (2/4) Epoch 5, batch 8800, loss[loss=0.2614, simple_loss=0.3201, pruned_loss=0.1014, over 19996.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3154, pruned_loss=0.08125, over 4270358.16 frames. ], batch size: 702, lr: 6.38e-03, grad_scale: 32.0 2023-06-21 14:06:07,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=784674.0, ans=0.0 2023-06-21 14:07:13,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=784794.0, ans=0.125 2023-06-21 14:07:46,340 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=784854.0, ans=0.125 2023-06-21 14:08:15,089 INFO [train.py:996] (2/4) Epoch 5, batch 8850, loss[loss=0.2369, simple_loss=0.3333, pruned_loss=0.07019, over 21582.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3227, pruned_loss=0.08275, over 4274032.66 frames. ], batch size: 389, lr: 6.38e-03, grad_scale: 32.0 2023-06-21 14:08:56,125 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.156e+02 2.643e+02 2.984e+02 3.583e+02 6.158e+02, threshold=5.968e+02, percent-clipped=1.0 2023-06-21 14:09:13,146 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=785034.0, ans=0.125 2023-06-21 14:09:19,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=785094.0, ans=0.1 2023-06-21 14:09:19,649 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.12 vs. limit=15.0 2023-06-21 14:10:11,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=785214.0, ans=0.0 2023-06-21 14:10:42,213 INFO [train.py:996] (2/4) Epoch 5, batch 8900, loss[loss=0.2131, simple_loss=0.29, pruned_loss=0.06806, over 21586.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3195, pruned_loss=0.08226, over 4275071.10 frames. ], batch size: 263, lr: 6.38e-03, grad_scale: 32.0 2023-06-21 14:10:56,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=785274.0, ans=0.0 2023-06-21 14:11:21,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=785334.0, ans=0.1 2023-06-21 14:12:43,068 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=18.37 vs. limit=22.5 2023-06-21 14:12:58,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=785574.0, ans=0.1 2023-06-21 14:13:06,497 INFO [train.py:996] (2/4) Epoch 5, batch 8950, loss[loss=0.2423, simple_loss=0.3195, pruned_loss=0.08261, over 21552.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3166, pruned_loss=0.08102, over 4271473.12 frames. ], batch size: 389, lr: 6.38e-03, grad_scale: 32.0 2023-06-21 14:13:29,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=785574.0, ans=0.0 2023-06-21 14:13:33,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=785574.0, ans=0.1 2023-06-21 14:13:50,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=785634.0, ans=0.125 2023-06-21 14:13:51,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=785634.0, ans=0.125 2023-06-21 14:13:52,402 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.110e+02 2.821e+02 3.330e+02 4.267e+02 7.491e+02, threshold=6.660e+02, percent-clipped=6.0 2023-06-21 14:14:03,484 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 14:15:10,349 INFO [train.py:996] (2/4) Epoch 5, batch 9000, loss[loss=0.2117, simple_loss=0.2856, pruned_loss=0.06891, over 21564.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3114, pruned_loss=0.08074, over 4272459.70 frames. ], batch size: 263, lr: 6.38e-03, grad_scale: 32.0 2023-06-21 14:15:10,350 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-21 14:16:12,925 INFO [train.py:1028] (2/4) Epoch 5, validation: loss=0.2688, simple_loss=0.3596, pruned_loss=0.08904, over 1796401.00 frames. 2023-06-21 14:16:12,930 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-21 14:16:21,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=785874.0, ans=0.125 2023-06-21 14:16:54,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=785934.0, ans=0.125 2023-06-21 14:17:15,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=785994.0, ans=0.0 2023-06-21 14:17:21,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=786054.0, ans=0.125 2023-06-21 14:18:19,076 INFO [train.py:996] (2/4) Epoch 5, batch 9050, loss[loss=0.2132, simple_loss=0.3023, pruned_loss=0.06202, over 20807.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3078, pruned_loss=0.07822, over 4275390.09 frames. ], batch size: 607, lr: 6.37e-03, grad_scale: 32.0 2023-06-21 14:18:49,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=786234.0, ans=0.125 2023-06-21 14:19:10,717 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.897e+02 2.633e+02 2.951e+02 3.520e+02 5.679e+02, threshold=5.902e+02, percent-clipped=0.0 2023-06-21 14:19:14,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=786234.0, ans=0.0 2023-06-21 14:19:28,217 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=786294.0, ans=0.125 2023-06-21 14:20:40,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=786414.0, ans=0.125 2023-06-21 14:20:50,350 INFO [train.py:996] (2/4) Epoch 5, batch 9100, loss[loss=0.2334, simple_loss=0.3253, pruned_loss=0.07071, over 21792.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3132, pruned_loss=0.08056, over 4275673.49 frames. ], batch size: 316, lr: 6.37e-03, grad_scale: 32.0 2023-06-21 14:21:04,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=786534.0, ans=0.125 2023-06-21 14:21:41,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=786594.0, ans=0.2 2023-06-21 14:22:11,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=786654.0, ans=0.125 2023-06-21 14:23:17,967 INFO [train.py:996] (2/4) Epoch 5, batch 9150, loss[loss=0.2171, simple_loss=0.3047, pruned_loss=0.06481, over 21714.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.314, pruned_loss=0.07785, over 4272328.39 frames. ], batch size: 247, lr: 6.37e-03, grad_scale: 32.0 2023-06-21 14:23:21,457 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=786774.0, ans=0.1 2023-06-21 14:23:29,207 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=786774.0, ans=0.0 2023-06-21 14:23:53,855 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.847e+02 2.515e+02 3.021e+02 3.547e+02 6.572e+02, threshold=6.043e+02, percent-clipped=1.0 2023-06-21 14:25:26,980 INFO [train.py:996] (2/4) Epoch 5, batch 9200, loss[loss=0.2732, simple_loss=0.3785, pruned_loss=0.08397, over 20804.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3162, pruned_loss=0.07732, over 4271180.81 frames. ], batch size: 608, lr: 6.37e-03, grad_scale: 32.0 2023-06-21 14:26:57,286 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.87 vs. limit=22.5 2023-06-21 14:27:21,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=787254.0, ans=0.125 2023-06-21 14:27:45,227 INFO [train.py:996] (2/4) Epoch 5, batch 9250, loss[loss=0.2285, simple_loss=0.2878, pruned_loss=0.08466, over 21827.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3175, pruned_loss=0.08019, over 4276853.55 frames. ], batch size: 372, lr: 6.37e-03, grad_scale: 32.0 2023-06-21 14:28:01,663 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.74 vs. limit=22.5 2023-06-21 14:28:16,605 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.79 vs. limit=15.0 2023-06-21 14:28:28,357 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.689e+02 2.551e+02 3.044e+02 3.631e+02 5.502e+02, threshold=6.089e+02, percent-clipped=0.0 2023-06-21 14:30:00,551 INFO [train.py:996] (2/4) Epoch 5, batch 9300, loss[loss=0.265, simple_loss=0.3654, pruned_loss=0.08232, over 21250.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3132, pruned_loss=0.07985, over 4269791.29 frames. ], batch size: 549, lr: 6.37e-03, grad_scale: 16.0 2023-06-21 14:32:02,816 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=787854.0, ans=0.2 2023-06-21 14:32:22,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=787914.0, ans=0.125 2023-06-21 14:32:27,975 INFO [train.py:996] (2/4) Epoch 5, batch 9350, loss[loss=0.2587, simple_loss=0.3409, pruned_loss=0.08829, over 21437.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3164, pruned_loss=0.08062, over 4266779.89 frames. ], batch size: 131, lr: 6.37e-03, grad_scale: 16.0 2023-06-21 14:32:29,138 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.74 vs. limit=15.0 2023-06-21 14:33:03,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=788034.0, ans=0.125 2023-06-21 14:33:20,697 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.137e+02 2.848e+02 3.302e+02 4.167e+02 7.769e+02, threshold=6.603e+02, percent-clipped=5.0 2023-06-21 14:33:58,971 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.43 vs. limit=22.5 2023-06-21 14:34:04,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=788154.0, ans=0.0 2023-06-21 14:34:45,152 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 14:34:49,348 INFO [train.py:996] (2/4) Epoch 5, batch 9400, loss[loss=0.2045, simple_loss=0.274, pruned_loss=0.0675, over 21602.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3183, pruned_loss=0.08149, over 4262701.37 frames. ], batch size: 247, lr: 6.37e-03, grad_scale: 16.0 2023-06-21 14:34:49,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=788274.0, ans=0.1 2023-06-21 14:35:08,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=788274.0, ans=0.2 2023-06-21 14:35:22,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=788334.0, ans=0.125 2023-06-21 14:36:16,549 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=788454.0, ans=0.2 2023-06-21 14:36:18,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=788454.0, ans=0.125 2023-06-21 14:36:50,907 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.48 vs. limit=15.0 2023-06-21 14:37:20,231 INFO [train.py:996] (2/4) Epoch 5, batch 9450, loss[loss=0.2449, simple_loss=0.341, pruned_loss=0.07443, over 20721.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3119, pruned_loss=0.08032, over 4267508.48 frames. ], batch size: 607, lr: 6.36e-03, grad_scale: 16.0 2023-06-21 14:37:35,595 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 14:37:45,132 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.057e+02 2.567e+02 2.945e+02 3.778e+02 6.288e+02, threshold=5.890e+02, percent-clipped=0.0 2023-06-21 14:38:34,907 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.52 vs. limit=15.0 2023-06-21 14:39:04,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=788814.0, ans=0.125 2023-06-21 14:39:23,917 INFO [train.py:996] (2/4) Epoch 5, batch 9500, loss[loss=0.1934, simple_loss=0.2737, pruned_loss=0.05656, over 21711.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3044, pruned_loss=0.07869, over 4259359.54 frames. ], batch size: 298, lr: 6.36e-03, grad_scale: 16.0 2023-06-21 14:39:24,959 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.78 vs. limit=15.0 2023-06-21 14:39:49,191 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=788934.0, ans=0.125 2023-06-21 14:39:50,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=788934.0, ans=0.125 2023-06-21 14:40:00,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=788934.0, ans=0.125 2023-06-21 14:40:22,286 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.23 vs. limit=15.0 2023-06-21 14:40:59,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=789054.0, ans=0.0 2023-06-21 14:41:16,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=789114.0, ans=0.1 2023-06-21 14:41:24,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=789114.0, ans=0.0 2023-06-21 14:41:45,129 INFO [train.py:996] (2/4) Epoch 5, batch 9550, loss[loss=0.2719, simple_loss=0.3457, pruned_loss=0.09903, over 21763.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3088, pruned_loss=0.08129, over 4267532.71 frames. ], batch size: 124, lr: 6.36e-03, grad_scale: 16.0 2023-06-21 14:42:09,711 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.128e+02 2.687e+02 3.290e+02 3.942e+02 9.010e+02, threshold=6.580e+02, percent-clipped=4.0 2023-06-21 14:44:01,407 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 14:44:03,881 INFO [train.py:996] (2/4) Epoch 5, batch 9600, loss[loss=0.1973, simple_loss=0.2751, pruned_loss=0.05969, over 21432.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3105, pruned_loss=0.08191, over 4278280.07 frames. ], batch size: 194, lr: 6.36e-03, grad_scale: 32.0 2023-06-21 14:44:14,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=789474.0, ans=0.035 2023-06-21 14:46:21,980 INFO [train.py:996] (2/4) Epoch 5, batch 9650, loss[loss=0.2764, simple_loss=0.3563, pruned_loss=0.0983, over 21416.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3121, pruned_loss=0.08248, over 4274872.28 frames. ], batch size: 131, lr: 6.36e-03, grad_scale: 32.0 2023-06-21 14:46:56,316 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-21 14:46:57,774 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.61 vs. limit=15.0 2023-06-21 14:47:15,181 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 2.578e+02 2.899e+02 3.353e+02 5.343e+02, threshold=5.797e+02, percent-clipped=0.0 2023-06-21 14:47:17,509 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=789834.0, ans=0.125 2023-06-21 14:47:31,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=789894.0, ans=0.125 2023-06-21 14:48:34,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=790014.0, ans=0.0 2023-06-21 14:48:44,245 INFO [train.py:996] (2/4) Epoch 5, batch 9700, loss[loss=0.2418, simple_loss=0.3303, pruned_loss=0.07669, over 20784.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3148, pruned_loss=0.08295, over 4275743.53 frames. ], batch size: 609, lr: 6.36e-03, grad_scale: 32.0 2023-06-21 14:49:09,776 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=790074.0, ans=0.125 2023-06-21 14:49:30,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=790134.0, ans=0.2 2023-06-21 14:49:30,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=790134.0, ans=0.2 2023-06-21 14:50:34,888 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=790314.0, ans=0.0 2023-06-21 14:50:37,899 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=790314.0, ans=0.035 2023-06-21 14:50:41,659 INFO [train.py:996] (2/4) Epoch 5, batch 9750, loss[loss=0.1847, simple_loss=0.2519, pruned_loss=0.05873, over 21540.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3083, pruned_loss=0.08094, over 4273303.26 frames. ], batch size: 263, lr: 6.36e-03, grad_scale: 32.0 2023-06-21 14:50:54,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=790374.0, ans=0.125 2023-06-21 14:51:25,939 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.076e+02 2.430e+02 2.788e+02 3.260e+02 5.197e+02, threshold=5.575e+02, percent-clipped=0.0 2023-06-21 14:52:19,409 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=790614.0, ans=0.04949747468305833 2023-06-21 14:52:45,329 INFO [train.py:996] (2/4) Epoch 5, batch 9800, loss[loss=0.232, simple_loss=0.3216, pruned_loss=0.07118, over 21829.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3084, pruned_loss=0.08178, over 4274558.57 frames. ], batch size: 124, lr: 6.36e-03, grad_scale: 32.0 2023-06-21 14:53:07,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=790674.0, ans=0.125 2023-06-21 14:53:27,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=790734.0, ans=0.0 2023-06-21 14:53:27,285 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=790734.0, ans=0.125 2023-06-21 14:53:28,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=790734.0, ans=0.125 2023-06-21 14:53:40,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=790734.0, ans=0.2 2023-06-21 14:54:10,321 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.29 vs. limit=15.0 2023-06-21 14:54:13,064 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=22.5 2023-06-21 14:54:29,496 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.40 vs. limit=10.0 2023-06-21 14:54:38,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=790914.0, ans=0.0 2023-06-21 14:54:42,816 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=790914.0, ans=0.0 2023-06-21 14:54:51,736 INFO [train.py:996] (2/4) Epoch 5, batch 9850, loss[loss=0.2046, simple_loss=0.2629, pruned_loss=0.07311, over 21651.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3063, pruned_loss=0.08158, over 4254139.00 frames. ], batch size: 247, lr: 6.35e-03, grad_scale: 32.0 2023-06-21 14:54:57,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=790974.0, ans=0.0 2023-06-21 14:55:42,147 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.409e+02 2.659e+02 3.112e+02 4.458e+02, threshold=5.319e+02, percent-clipped=0.0 2023-06-21 14:56:06,059 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=791094.0, ans=0.125 2023-06-21 14:56:44,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=791214.0, ans=0.5 2023-06-21 14:57:02,488 INFO [train.py:996] (2/4) Epoch 5, batch 9900, loss[loss=0.2642, simple_loss=0.332, pruned_loss=0.09826, over 21661.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3029, pruned_loss=0.08085, over 4255254.03 frames. ], batch size: 441, lr: 6.35e-03, grad_scale: 32.0 2023-06-21 14:57:47,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=791334.0, ans=0.125 2023-06-21 14:58:17,887 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.63 vs. limit=6.0 2023-06-21 14:58:42,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=791454.0, ans=0.125 2023-06-21 14:59:03,122 INFO [train.py:996] (2/4) Epoch 5, batch 9950, loss[loss=0.2171, simple_loss=0.2743, pruned_loss=0.08001, over 21319.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3052, pruned_loss=0.08311, over 4256105.90 frames. ], batch size: 211, lr: 6.35e-03, grad_scale: 32.0 2023-06-21 14:59:09,787 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-21 15:00:12,918 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.060e+02 2.669e+02 3.087e+02 3.517e+02 5.049e+02, threshold=6.174e+02, percent-clipped=0.0 2023-06-21 15:01:24,108 INFO [train.py:996] (2/4) Epoch 5, batch 10000, loss[loss=0.194, simple_loss=0.2569, pruned_loss=0.06557, over 21082.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3007, pruned_loss=0.08175, over 4257085.47 frames. ], batch size: 143, lr: 6.35e-03, grad_scale: 32.0 2023-06-21 15:02:43,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=791994.0, ans=0.2 2023-06-21 15:03:02,271 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.67 vs. limit=15.0 2023-06-21 15:03:03,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=792054.0, ans=0.0 2023-06-21 15:03:34,531 INFO [train.py:996] (2/4) Epoch 5, batch 10050, loss[loss=0.1898, simple_loss=0.2632, pruned_loss=0.05823, over 21437.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.302, pruned_loss=0.08154, over 4257473.57 frames. ], batch size: 194, lr: 6.35e-03, grad_scale: 32.0 2023-06-21 15:04:01,759 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=792174.0, ans=0.125 2023-06-21 15:04:41,625 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.869e+02 2.500e+02 2.849e+02 3.392e+02 5.365e+02, threshold=5.698e+02, percent-clipped=0.0 2023-06-21 15:05:03,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=792294.0, ans=0.125 2023-06-21 15:05:08,967 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=792354.0, ans=0.125 2023-06-21 15:05:52,688 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.28 vs. limit=15.0 2023-06-21 15:05:53,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=792414.0, ans=0.0 2023-06-21 15:06:11,530 INFO [train.py:996] (2/4) Epoch 5, batch 10100, loss[loss=0.2388, simple_loss=0.3333, pruned_loss=0.0721, over 21243.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.2983, pruned_loss=0.07894, over 4266673.87 frames. ], batch size: 548, lr: 6.35e-03, grad_scale: 32.0 2023-06-21 15:06:16,830 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_na.min_abs, batch_count=792474.0, ans=0.02 2023-06-21 15:06:48,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=792534.0, ans=0.2 2023-06-21 15:07:23,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=792594.0, ans=0.125 2023-06-21 15:07:40,505 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=792654.0, ans=0.1 2023-06-21 15:08:30,819 INFO [train.py:996] (2/4) Epoch 5, batch 10150, loss[loss=0.2382, simple_loss=0.3115, pruned_loss=0.08244, over 21567.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3051, pruned_loss=0.08181, over 4264292.52 frames. ], batch size: 414, lr: 6.35e-03, grad_scale: 32.0 2023-06-21 15:08:34,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=792774.0, ans=0.125 2023-06-21 15:08:36,100 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=792774.0, ans=0.125 2023-06-21 15:09:16,249 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.922e+02 2.586e+02 2.981e+02 3.713e+02 5.514e+02, threshold=5.962e+02, percent-clipped=0.0 2023-06-21 15:10:42,456 INFO [train.py:996] (2/4) Epoch 5, batch 10200, loss[loss=0.1794, simple_loss=0.2481, pruned_loss=0.05538, over 16329.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3052, pruned_loss=0.08064, over 4263584.31 frames. ], batch size: 63, lr: 6.35e-03, grad_scale: 32.0 2023-06-21 15:11:24,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=793134.0, ans=0.05 2023-06-21 15:11:38,627 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.18 vs. limit=22.5 2023-06-21 15:12:29,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=793314.0, ans=0.0 2023-06-21 15:12:58,743 INFO [train.py:996] (2/4) Epoch 5, batch 10250, loss[loss=0.1858, simple_loss=0.2733, pruned_loss=0.04912, over 21336.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3006, pruned_loss=0.07547, over 4255820.31 frames. ], batch size: 194, lr: 6.35e-03, grad_scale: 32.0 2023-06-21 15:13:09,579 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=793374.0, ans=0.0 2023-06-21 15:13:09,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=793374.0, ans=0.125 2023-06-21 15:13:43,069 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 2.089e+02 2.610e+02 3.127e+02 4.884e+02, threshold=5.220e+02, percent-clipped=0.0 2023-06-21 15:14:06,041 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=793494.0, ans=0.125 2023-06-21 15:14:36,511 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.00 vs. limit=6.0 2023-06-21 15:15:11,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=793614.0, ans=0.125 2023-06-21 15:15:13,520 INFO [train.py:996] (2/4) Epoch 5, batch 10300, loss[loss=0.2368, simple_loss=0.3078, pruned_loss=0.08292, over 21265.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3022, pruned_loss=0.07668, over 4256501.98 frames. ], batch size: 176, lr: 6.34e-03, grad_scale: 32.0 2023-06-21 15:15:21,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=793674.0, ans=0.1 2023-06-21 15:16:06,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=793734.0, ans=0.125 2023-06-21 15:17:41,405 INFO [train.py:996] (2/4) Epoch 5, batch 10350, loss[loss=0.2037, simple_loss=0.2748, pruned_loss=0.0663, over 21680.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3047, pruned_loss=0.07697, over 4258102.36 frames. ], batch size: 298, lr: 6.34e-03, grad_scale: 32.0 2023-06-21 15:18:08,721 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.795e+02 2.871e+02 3.499e+02 4.355e+02 9.193e+02, threshold=6.998e+02, percent-clipped=17.0 2023-06-21 15:18:28,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=794094.0, ans=0.0 2023-06-21 15:19:40,569 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=794214.0, ans=0.125 2023-06-21 15:19:49,808 INFO [train.py:996] (2/4) Epoch 5, batch 10400, loss[loss=0.2113, simple_loss=0.287, pruned_loss=0.06776, over 21914.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2974, pruned_loss=0.07539, over 4261360.99 frames. ], batch size: 373, lr: 6.34e-03, grad_scale: 32.0 2023-06-21 15:19:58,632 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=15.0 2023-06-21 15:20:43,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=794394.0, ans=0.0 2023-06-21 15:21:46,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=794454.0, ans=0.0 2023-06-21 15:22:05,754 INFO [train.py:996] (2/4) Epoch 5, batch 10450, loss[loss=0.2647, simple_loss=0.3377, pruned_loss=0.0959, over 21829.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3015, pruned_loss=0.07849, over 4257643.12 frames. ], batch size: 124, lr: 6.34e-03, grad_scale: 32.0 2023-06-21 15:22:09,165 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=794574.0, ans=0.0 2023-06-21 15:22:25,460 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 15:22:46,904 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=794634.0, ans=0.1 2023-06-21 15:23:13,418 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.562e+02 2.841e+02 3.622e+02 6.027e+02, threshold=5.681e+02, percent-clipped=0.0 2023-06-21 15:23:18,973 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.43 vs. limit=15.0 2023-06-21 15:23:37,760 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=794694.0, ans=0.0 2023-06-21 15:24:27,790 INFO [train.py:996] (2/4) Epoch 5, batch 10500, loss[loss=0.2023, simple_loss=0.2672, pruned_loss=0.06874, over 21248.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3003, pruned_loss=0.07674, over 4261194.09 frames. ], batch size: 159, lr: 6.34e-03, grad_scale: 32.0 2023-06-21 15:25:18,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=794934.0, ans=0.1 2023-06-21 15:25:21,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=794994.0, ans=0.125 2023-06-21 15:25:30,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=794994.0, ans=0.125 2023-06-21 15:26:19,328 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.43 vs. limit=15.0 2023-06-21 15:26:36,374 INFO [train.py:996] (2/4) Epoch 5, batch 10550, loss[loss=0.1901, simple_loss=0.2533, pruned_loss=0.06343, over 21416.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2944, pruned_loss=0.07552, over 4251659.08 frames. ], batch size: 212, lr: 6.34e-03, grad_scale: 32.0 2023-06-21 15:27:03,840 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=795234.0, ans=0.1 2023-06-21 15:27:28,622 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=795234.0, ans=0.015 2023-06-21 15:27:31,062 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.952e+02 2.401e+02 2.781e+02 3.246e+02 4.477e+02, threshold=5.561e+02, percent-clipped=0.0 2023-06-21 15:28:19,953 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.57 vs. limit=15.0 2023-06-21 15:28:22,779 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=795414.0, ans=0.2 2023-06-21 15:28:22,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=795414.0, ans=0.1 2023-06-21 15:28:39,019 INFO [train.py:996] (2/4) Epoch 5, batch 10600, loss[loss=0.1944, simple_loss=0.2832, pruned_loss=0.05278, over 21719.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2901, pruned_loss=0.07382, over 4253990.88 frames. ], batch size: 247, lr: 6.34e-03, grad_scale: 32.0 2023-06-21 15:30:05,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=795654.0, ans=0.2 2023-06-21 15:30:24,351 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=795654.0, ans=0.0 2023-06-21 15:30:46,714 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=795714.0, ans=0.0 2023-06-21 15:31:11,930 INFO [train.py:996] (2/4) Epoch 5, batch 10650, loss[loss=0.1975, simple_loss=0.285, pruned_loss=0.05494, over 21576.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2937, pruned_loss=0.07281, over 4253249.25 frames. ], batch size: 389, lr: 6.34e-03, grad_scale: 32.0 2023-06-21 15:32:11,570 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.758e+02 2.303e+02 2.833e+02 3.261e+02 4.754e+02, threshold=5.666e+02, percent-clipped=0.0 2023-06-21 15:32:22,432 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.03 vs. limit=12.0 2023-06-21 15:33:00,741 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.06 vs. limit=15.0 2023-06-21 15:33:24,934 INFO [train.py:996] (2/4) Epoch 5, batch 10700, loss[loss=0.2531, simple_loss=0.3242, pruned_loss=0.09104, over 21716.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2932, pruned_loss=0.07343, over 4252154.66 frames. ], batch size: 298, lr: 6.33e-03, grad_scale: 32.0 2023-06-21 15:33:53,964 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=796134.0, ans=0.125 2023-06-21 15:34:04,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=796134.0, ans=0.025 2023-06-21 15:34:46,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=796194.0, ans=0.95 2023-06-21 15:34:53,374 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.63 vs. limit=15.0 2023-06-21 15:35:54,018 INFO [train.py:996] (2/4) Epoch 5, batch 10750, loss[loss=0.2941, simple_loss=0.3866, pruned_loss=0.1008, over 21556.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3046, pruned_loss=0.0774, over 4256158.43 frames. ], batch size: 471, lr: 6.33e-03, grad_scale: 32.0 2023-06-21 15:36:04,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=796374.0, ans=0.1 2023-06-21 15:36:30,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=796434.0, ans=0.125 2023-06-21 15:36:32,191 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=796434.0, ans=0.035 2023-06-21 15:36:33,819 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=796434.0, ans=0.1 2023-06-21 15:36:38,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=796434.0, ans=0.1 2023-06-21 15:36:39,161 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.995e+02 2.624e+02 2.949e+02 3.817e+02 5.681e+02, threshold=5.899e+02, percent-clipped=1.0 2023-06-21 15:36:51,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=796494.0, ans=0.125 2023-06-21 15:37:40,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=796554.0, ans=0.05 2023-06-21 15:37:49,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=796614.0, ans=0.2 2023-06-21 15:37:49,713 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.81 vs. limit=12.0 2023-06-21 15:38:28,030 INFO [train.py:996] (2/4) Epoch 5, batch 10800, loss[loss=0.2612, simple_loss=0.3332, pruned_loss=0.09454, over 21568.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3099, pruned_loss=0.07797, over 4259849.37 frames. ], batch size: 230, lr: 6.33e-03, grad_scale: 32.0 2023-06-21 15:38:31,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=796674.0, ans=0.2 2023-06-21 15:39:36,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=796794.0, ans=0.0 2023-06-21 15:40:03,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=796794.0, ans=0.125 2023-06-21 15:40:17,439 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.09 vs. limit=6.0 2023-06-21 15:40:48,670 INFO [train.py:996] (2/4) Epoch 5, batch 10850, loss[loss=0.2173, simple_loss=0.2895, pruned_loss=0.07256, over 21999.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3092, pruned_loss=0.07841, over 4265611.74 frames. ], batch size: 103, lr: 6.33e-03, grad_scale: 32.0 2023-06-21 15:40:49,289 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=796974.0, ans=0.0 2023-06-21 15:41:27,124 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.048e+02 2.616e+02 2.809e+02 3.256e+02 4.598e+02, threshold=5.618e+02, percent-clipped=0.0 2023-06-21 15:41:51,888 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=797094.0, ans=0.04949747468305833 2023-06-21 15:42:33,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=797154.0, ans=0.1 2023-06-21 15:42:35,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=797154.0, ans=0.04949747468305833 2023-06-21 15:42:50,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=797214.0, ans=0.04949747468305833 2023-06-21 15:43:00,583 INFO [train.py:996] (2/4) Epoch 5, batch 10900, loss[loss=0.2074, simple_loss=0.292, pruned_loss=0.06141, over 21376.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3027, pruned_loss=0.07586, over 4267182.58 frames. ], batch size: 211, lr: 6.33e-03, grad_scale: 32.0 2023-06-21 15:43:57,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=797394.0, ans=0.125 2023-06-21 15:44:30,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=797454.0, ans=0.0 2023-06-21 15:44:45,948 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=797514.0, ans=0.1 2023-06-21 15:45:02,684 INFO [train.py:996] (2/4) Epoch 5, batch 10950, loss[loss=0.2095, simple_loss=0.2714, pruned_loss=0.07377, over 21414.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.2988, pruned_loss=0.07444, over 4266194.75 frames. ], batch size: 194, lr: 6.33e-03, grad_scale: 32.0 2023-06-21 15:45:19,053 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=797574.0, ans=0.125 2023-06-21 15:45:27,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=797634.0, ans=0.025 2023-06-21 15:45:44,738 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.884e+02 2.495e+02 2.960e+02 3.280e+02 5.814e+02, threshold=5.920e+02, percent-clipped=2.0 2023-06-21 15:47:12,721 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=797874.0, ans=0.125 2023-06-21 15:47:13,637 INFO [train.py:996] (2/4) Epoch 5, batch 11000, loss[loss=0.2875, simple_loss=0.3297, pruned_loss=0.1226, over 21728.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.297, pruned_loss=0.07581, over 4267967.57 frames. ], batch size: 508, lr: 6.33e-03, grad_scale: 32.0 2023-06-21 15:47:56,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=797934.0, ans=0.0 2023-06-21 15:47:56,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=797934.0, ans=0.125 2023-06-21 15:47:59,672 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.24 vs. limit=6.0 2023-06-21 15:48:36,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=797994.0, ans=0.125 2023-06-21 15:49:06,672 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.39 vs. limit=22.5 2023-06-21 15:49:26,927 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.61 vs. limit=22.5 2023-06-21 15:49:31,655 INFO [train.py:996] (2/4) Epoch 5, batch 11050, loss[loss=0.2377, simple_loss=0.2979, pruned_loss=0.08874, over 14964.00 frames. ], tot_loss[loss=0.225, simple_loss=0.2954, pruned_loss=0.07733, over 4269200.79 frames. ], batch size: 61, lr: 6.33e-03, grad_scale: 32.0 2023-06-21 15:50:25,814 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 2.567e+02 2.755e+02 3.188e+02 5.366e+02, threshold=5.510e+02, percent-clipped=0.0 2023-06-21 15:50:33,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=798294.0, ans=0.125 2023-06-21 15:50:43,574 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 15:51:34,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=798414.0, ans=0.2 2023-06-21 15:51:34,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=798414.0, ans=0.125 2023-06-21 15:51:44,981 INFO [train.py:996] (2/4) Epoch 5, batch 11100, loss[loss=0.204, simple_loss=0.28, pruned_loss=0.06398, over 21755.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.2959, pruned_loss=0.07829, over 4269142.37 frames. ], batch size: 371, lr: 6.33e-03, grad_scale: 32.0 2023-06-21 15:52:23,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=798534.0, ans=0.125 2023-06-21 15:52:49,967 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.94 vs. limit=22.5 2023-06-21 15:53:44,979 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.81 vs. limit=15.0 2023-06-21 15:53:59,004 INFO [train.py:996] (2/4) Epoch 5, batch 11150, loss[loss=0.2245, simple_loss=0.2919, pruned_loss=0.07858, over 21174.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.2949, pruned_loss=0.07778, over 4255410.50 frames. ], batch size: 548, lr: 6.32e-03, grad_scale: 32.0 2023-06-21 15:54:52,811 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.895e+02 2.453e+02 2.788e+02 3.224e+02 5.463e+02, threshold=5.576e+02, percent-clipped=0.0 2023-06-21 15:55:02,062 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=798894.0, ans=0.2 2023-06-21 15:56:13,271 INFO [train.py:996] (2/4) Epoch 5, batch 11200, loss[loss=0.2235, simple_loss=0.2899, pruned_loss=0.07858, over 21538.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.293, pruned_loss=0.07775, over 4262120.40 frames. ], batch size: 414, lr: 6.32e-03, grad_scale: 32.0 2023-06-21 15:58:02,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=799254.0, ans=0.05 2023-06-21 15:58:17,396 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.33 vs. limit=15.0 2023-06-21 15:58:25,662 INFO [train.py:996] (2/4) Epoch 5, batch 11250, loss[loss=0.2047, simple_loss=0.2818, pruned_loss=0.06382, over 21787.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2925, pruned_loss=0.07762, over 4247695.14 frames. ], batch size: 317, lr: 6.32e-03, grad_scale: 32.0 2023-06-21 15:59:15,276 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.984e+02 2.425e+02 2.726e+02 3.130e+02 5.032e+02, threshold=5.452e+02, percent-clipped=0.0 2023-06-21 15:59:56,557 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=799554.0, ans=0.125 2023-06-21 16:00:34,385 INFO [train.py:996] (2/4) Epoch 5, batch 11300, loss[loss=0.203, simple_loss=0.2813, pruned_loss=0.06237, over 21797.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2932, pruned_loss=0.07711, over 4256171.52 frames. ], batch size: 332, lr: 6.32e-03, grad_scale: 32.0 2023-06-21 16:01:21,457 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 16:01:22,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=799734.0, ans=0.0 2023-06-21 16:01:39,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=799794.0, ans=0.1 2023-06-21 16:02:34,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=799914.0, ans=0.125 2023-06-21 16:02:50,115 INFO [train.py:996] (2/4) Epoch 5, batch 11350, loss[loss=0.2328, simple_loss=0.3129, pruned_loss=0.07641, over 21690.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2948, pruned_loss=0.07645, over 4256892.35 frames. ], batch size: 298, lr: 6.32e-03, grad_scale: 16.0 2023-06-21 16:03:35,365 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=800034.0, ans=0.2 2023-06-21 16:03:52,039 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.858e+02 2.490e+02 2.826e+02 3.230e+02 4.921e+02, threshold=5.651e+02, percent-clipped=0.0 2023-06-21 16:05:18,315 INFO [train.py:996] (2/4) Epoch 5, batch 11400, loss[loss=0.2318, simple_loss=0.3228, pruned_loss=0.07036, over 21758.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3009, pruned_loss=0.07909, over 4257141.13 frames. ], batch size: 332, lr: 6.32e-03, grad_scale: 16.0 2023-06-21 16:05:18,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=800274.0, ans=0.125 2023-06-21 16:06:29,499 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=800394.0, ans=0.0 2023-06-21 16:07:37,438 INFO [train.py:996] (2/4) Epoch 5, batch 11450, loss[loss=0.231, simple_loss=0.31, pruned_loss=0.07603, over 21724.00 frames. ], tot_loss[loss=0.227, simple_loss=0.2998, pruned_loss=0.0771, over 4256880.02 frames. ], batch size: 332, lr: 6.32e-03, grad_scale: 16.0 2023-06-21 16:08:16,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=800634.0, ans=0.1 2023-06-21 16:08:44,906 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.947e+02 2.457e+02 2.800e+02 3.196e+02 5.475e+02, threshold=5.600e+02, percent-clipped=0.0 2023-06-21 16:08:59,195 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.71 vs. limit=15.0 2023-06-21 16:09:29,019 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.15 vs. limit=6.0 2023-06-21 16:09:46,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=800814.0, ans=0.0 2023-06-21 16:09:46,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=800814.0, ans=0.125 2023-06-21 16:09:50,523 INFO [train.py:996] (2/4) Epoch 5, batch 11500, loss[loss=0.22, simple_loss=0.3146, pruned_loss=0.06273, over 21672.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3027, pruned_loss=0.07804, over 4257930.40 frames. ], batch size: 389, lr: 6.32e-03, grad_scale: 16.0 2023-06-21 16:09:55,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=800874.0, ans=0.125 2023-06-21 16:09:57,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=800874.0, ans=0.035 2023-06-21 16:11:41,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=801054.0, ans=10.0 2023-06-21 16:12:26,614 INFO [train.py:996] (2/4) Epoch 5, batch 11550, loss[loss=0.3933, simple_loss=0.4776, pruned_loss=0.1545, over 21452.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3099, pruned_loss=0.07843, over 4262354.42 frames. ], batch size: 508, lr: 6.31e-03, grad_scale: 16.0 2023-06-21 16:13:41,602 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.790e+02 2.641e+02 3.057e+02 3.432e+02 5.620e+02, threshold=6.114e+02, percent-clipped=1.0 2023-06-21 16:13:45,266 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=801294.0, ans=0.125 2023-06-21 16:14:12,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=801294.0, ans=0.1 2023-06-21 16:14:14,149 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=801294.0, ans=0.0 2023-06-21 16:14:47,758 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.05 vs. limit=10.0 2023-06-21 16:14:58,155 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-06-21 16:15:01,401 INFO [train.py:996] (2/4) Epoch 5, batch 11600, loss[loss=0.2813, simple_loss=0.378, pruned_loss=0.09233, over 21765.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3241, pruned_loss=0.08118, over 4261230.01 frames. ], batch size: 332, lr: 6.31e-03, grad_scale: 32.0 2023-06-21 16:15:43,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=801534.0, ans=15.0 2023-06-21 16:15:44,112 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=801534.0, ans=0.125 2023-06-21 16:16:04,849 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.74 vs. limit=22.5 2023-06-21 16:17:14,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=801714.0, ans=0.125 2023-06-21 16:17:18,047 INFO [train.py:996] (2/4) Epoch 5, batch 11650, loss[loss=0.2341, simple_loss=0.3031, pruned_loss=0.08257, over 21818.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3311, pruned_loss=0.08248, over 4267295.96 frames. ], batch size: 107, lr: 6.31e-03, grad_scale: 32.0 2023-06-21 16:18:16,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=801834.0, ans=0.125 2023-06-21 16:18:27,542 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.073e+02 2.640e+02 3.062e+02 3.776e+02 6.699e+02, threshold=6.124e+02, percent-clipped=2.0 2023-06-21 16:18:48,719 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=801954.0, ans=0.5 2023-06-21 16:18:50,741 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.02 vs. limit=12.0 2023-06-21 16:19:28,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=802014.0, ans=0.125 2023-06-21 16:19:48,622 INFO [train.py:996] (2/4) Epoch 5, batch 11700, loss[loss=0.1977, simple_loss=0.263, pruned_loss=0.06618, over 21590.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3212, pruned_loss=0.08184, over 4272796.82 frames. ], batch size: 298, lr: 6.31e-03, grad_scale: 16.0 2023-06-21 16:20:10,436 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 16:20:26,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=802134.0, ans=0.2 2023-06-21 16:21:59,038 INFO [train.py:996] (2/4) Epoch 5, batch 11750, loss[loss=0.2471, simple_loss=0.3137, pruned_loss=0.09026, over 21868.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3125, pruned_loss=0.08122, over 4277786.89 frames. ], batch size: 317, lr: 6.31e-03, grad_scale: 16.0 2023-06-21 16:22:47,696 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.968e+02 2.493e+02 2.846e+02 3.170e+02 4.478e+02, threshold=5.693e+02, percent-clipped=0.0 2023-06-21 16:23:56,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=802614.0, ans=0.125 2023-06-21 16:24:20,699 INFO [train.py:996] (2/4) Epoch 5, batch 11800, loss[loss=0.2408, simple_loss=0.3292, pruned_loss=0.07624, over 21629.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3142, pruned_loss=0.08266, over 4277837.60 frames. ], batch size: 389, lr: 6.31e-03, grad_scale: 16.0 2023-06-21 16:24:43,516 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 16:25:21,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=802794.0, ans=0.125 2023-06-21 16:26:37,621 INFO [train.py:996] (2/4) Epoch 5, batch 11850, loss[loss=0.2436, simple_loss=0.3302, pruned_loss=0.07846, over 21766.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.316, pruned_loss=0.08169, over 4274513.01 frames. ], batch size: 414, lr: 6.31e-03, grad_scale: 16.0 2023-06-21 16:26:47,792 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=802974.0, ans=15.0 2023-06-21 16:27:35,194 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.786e+02 2.385e+02 2.718e+02 3.148e+02 5.334e+02, threshold=5.436e+02, percent-clipped=0.0 2023-06-21 16:28:21,961 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.94 vs. limit=22.5 2023-06-21 16:29:02,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=803214.0, ans=10.0 2023-06-21 16:29:11,174 INFO [train.py:996] (2/4) Epoch 5, batch 11900, loss[loss=0.2254, simple_loss=0.2878, pruned_loss=0.08154, over 21175.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3158, pruned_loss=0.07928, over 4273118.14 frames. ], batch size: 159, lr: 6.31e-03, grad_scale: 16.0 2023-06-21 16:29:33,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=803334.0, ans=0.0 2023-06-21 16:31:11,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=803514.0, ans=0.125 2023-06-21 16:31:16,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=803514.0, ans=0.125 2023-06-21 16:31:19,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=803514.0, ans=0.5 2023-06-21 16:31:26,064 INFO [train.py:996] (2/4) Epoch 5, batch 11950, loss[loss=0.169, simple_loss=0.2542, pruned_loss=0.04186, over 21555.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3163, pruned_loss=0.0769, over 4276136.45 frames. ], batch size: 230, lr: 6.31e-03, grad_scale: 16.0 2023-06-21 16:31:38,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=803574.0, ans=0.125 2023-06-21 16:31:47,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=803574.0, ans=0.2 2023-06-21 16:32:18,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=803634.0, ans=0.0 2023-06-21 16:32:21,842 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.695e+02 2.301e+02 2.718e+02 3.108e+02 3.993e+02, threshold=5.436e+02, percent-clipped=0.0 2023-06-21 16:33:28,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=803814.0, ans=0.2 2023-06-21 16:33:28,754 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.92 vs. limit=12.0 2023-06-21 16:33:39,362 INFO [train.py:996] (2/4) Epoch 5, batch 12000, loss[loss=0.2056, simple_loss=0.2668, pruned_loss=0.07224, over 21196.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3114, pruned_loss=0.07488, over 4273052.85 frames. ], batch size: 176, lr: 6.30e-03, grad_scale: 32.0 2023-06-21 16:33:39,362 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-21 16:34:35,000 INFO [train.py:1028] (2/4) Epoch 5, validation: loss=0.2672, simple_loss=0.3583, pruned_loss=0.08803, over 1796401.00 frames. 2023-06-21 16:34:35,001 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-21 16:36:20,409 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.15 vs. limit=10.0 2023-06-21 16:36:25,242 INFO [train.py:996] (2/4) Epoch 5, batch 12050, loss[loss=0.2404, simple_loss=0.3008, pruned_loss=0.08998, over 21412.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3076, pruned_loss=0.07721, over 4279920.16 frames. ], batch size: 177, lr: 6.30e-03, grad_scale: 32.0 2023-06-21 16:37:02,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=804234.0, ans=0.1 2023-06-21 16:37:26,744 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.847e+02 2.690e+02 3.066e+02 3.586e+02 5.948e+02, threshold=6.132e+02, percent-clipped=2.0 2023-06-21 16:37:58,471 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=804354.0, ans=0.125 2023-06-21 16:38:41,154 INFO [train.py:996] (2/4) Epoch 5, batch 12100, loss[loss=0.2677, simple_loss=0.3425, pruned_loss=0.09646, over 21162.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3131, pruned_loss=0.08152, over 4281391.48 frames. ], batch size: 143, lr: 6.30e-03, grad_scale: 32.0 2023-06-21 16:39:10,628 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=804534.0, ans=0.2 2023-06-21 16:39:29,003 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=804534.0, ans=0.125 2023-06-21 16:39:29,790 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.15 vs. limit=10.0 2023-06-21 16:40:39,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=804654.0, ans=0.1 2023-06-21 16:41:30,848 INFO [train.py:996] (2/4) Epoch 5, batch 12150, loss[loss=0.2039, simple_loss=0.2838, pruned_loss=0.062, over 21230.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.317, pruned_loss=0.08132, over 4280160.78 frames. ], batch size: 176, lr: 6.30e-03, grad_scale: 32.0 2023-06-21 16:42:08,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=804834.0, ans=0.125 2023-06-21 16:42:18,165 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=804894.0, ans=0.125 2023-06-21 16:42:19,168 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.923e+02 2.609e+02 3.178e+02 3.769e+02 6.443e+02, threshold=6.356e+02, percent-clipped=2.0 2023-06-21 16:42:48,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=804954.0, ans=0.125 2023-06-21 16:43:38,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=805014.0, ans=0.09899494936611666 2023-06-21 16:43:40,082 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=805074.0, ans=0.125 2023-06-21 16:43:41,192 INFO [train.py:996] (2/4) Epoch 5, batch 12200, loss[loss=0.2448, simple_loss=0.2917, pruned_loss=0.09897, over 21227.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3138, pruned_loss=0.07959, over 4281238.91 frames. ], batch size: 471, lr: 6.30e-03, grad_scale: 32.0 2023-06-21 16:44:15,663 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.21 vs. limit=15.0 2023-06-21 16:44:23,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=805194.0, ans=0.2 2023-06-21 16:45:12,550 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.81 vs. limit=15.0 2023-06-21 16:45:23,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=805314.0, ans=0.1 2023-06-21 16:45:25,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=805314.0, ans=0.0 2023-06-21 16:45:35,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=805314.0, ans=0.125 2023-06-21 16:45:53,348 INFO [train.py:996] (2/4) Epoch 5, batch 12250, loss[loss=0.2129, simple_loss=0.2971, pruned_loss=0.06431, over 21669.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3051, pruned_loss=0.07654, over 4289174.04 frames. ], batch size: 391, lr: 6.30e-03, grad_scale: 32.0 2023-06-21 16:45:56,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=805374.0, ans=0.04949747468305833 2023-06-21 16:46:05,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=805374.0, ans=0.1 2023-06-21 16:46:28,144 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=805434.0, ans=0.2 2023-06-21 16:46:32,854 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=805434.0, ans=0.1 2023-06-21 16:46:35,364 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.566e+02 2.451e+02 2.848e+02 3.373e+02 5.263e+02, threshold=5.696e+02, percent-clipped=0.0 2023-06-21 16:46:38,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=805494.0, ans=0.125 2023-06-21 16:46:58,645 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=805494.0, ans=0.0 2023-06-21 16:47:51,386 INFO [train.py:996] (2/4) Epoch 5, batch 12300, loss[loss=0.1551, simple_loss=0.2362, pruned_loss=0.03702, over 21773.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2971, pruned_loss=0.07187, over 4286661.54 frames. ], batch size: 124, lr: 6.30e-03, grad_scale: 32.0 2023-06-21 16:47:59,605 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-21 16:48:00,447 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=805674.0, ans=0.0 2023-06-21 16:48:17,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=805674.0, ans=0.125 2023-06-21 16:50:30,867 INFO [train.py:996] (2/4) Epoch 5, batch 12350, loss[loss=0.2345, simple_loss=0.3111, pruned_loss=0.0789, over 21888.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3008, pruned_loss=0.07249, over 4286301.98 frames. ], batch size: 118, lr: 6.30e-03, grad_scale: 32.0 2023-06-21 16:50:52,953 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=806034.0, ans=0.125 2023-06-21 16:51:10,993 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.596e+02 2.374e+02 2.755e+02 3.213e+02 5.680e+02, threshold=5.510e+02, percent-clipped=0.0 2023-06-21 16:51:54,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=806154.0, ans=0.125 2023-06-21 16:52:11,101 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=806214.0, ans=0.0 2023-06-21 16:52:12,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=806214.0, ans=10.0 2023-06-21 16:52:16,328 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.93 vs. limit=22.5 2023-06-21 16:52:32,123 INFO [train.py:996] (2/4) Epoch 5, batch 12400, loss[loss=0.2586, simple_loss=0.3242, pruned_loss=0.09656, over 21847.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3038, pruned_loss=0.07537, over 4284056.34 frames. ], batch size: 371, lr: 6.29e-03, grad_scale: 32.0 2023-06-21 16:54:45,665 INFO [train.py:996] (2/4) Epoch 5, batch 12450, loss[loss=0.2798, simple_loss=0.3559, pruned_loss=0.1019, over 21379.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3074, pruned_loss=0.07798, over 4284318.37 frames. ], batch size: 131, lr: 6.29e-03, grad_scale: 32.0 2023-06-21 16:55:57,447 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.958e+02 2.571e+02 2.934e+02 3.546e+02 5.506e+02, threshold=5.868e+02, percent-clipped=0.0 2023-06-21 16:57:17,120 INFO [train.py:996] (2/4) Epoch 5, batch 12500, loss[loss=0.2742, simple_loss=0.362, pruned_loss=0.09322, over 21483.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.318, pruned_loss=0.08088, over 4281464.28 frames. ], batch size: 194, lr: 6.29e-03, grad_scale: 16.0 2023-06-21 16:57:17,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=806874.0, ans=0.0 2023-06-21 16:57:37,261 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.00 vs. limit=10.0 2023-06-21 16:59:36,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=807114.0, ans=0.0 2023-06-21 16:59:45,086 INFO [train.py:996] (2/4) Epoch 5, batch 12550, loss[loss=0.2435, simple_loss=0.3337, pruned_loss=0.07663, over 21730.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3227, pruned_loss=0.08362, over 4283754.15 frames. ], batch size: 332, lr: 6.29e-03, grad_scale: 16.0 2023-06-21 16:59:46,945 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=807174.0, ans=0.125 2023-06-21 17:00:06,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=807234.0, ans=0.04949747468305833 2023-06-21 17:00:08,177 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=807234.0, ans=0.125 2023-06-21 17:00:08,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=807234.0, ans=0.0 2023-06-21 17:00:39,259 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.130e+02 2.629e+02 2.998e+02 3.510e+02 7.002e+02, threshold=5.996e+02, percent-clipped=1.0 2023-06-21 17:01:34,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=807414.0, ans=0.125 2023-06-21 17:01:36,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=807414.0, ans=0.125 2023-06-21 17:01:56,445 INFO [train.py:996] (2/4) Epoch 5, batch 12600, loss[loss=0.2569, simple_loss=0.3666, pruned_loss=0.07362, over 20787.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3215, pruned_loss=0.08123, over 4282536.06 frames. ], batch size: 608, lr: 6.29e-03, grad_scale: 16.0 2023-06-21 17:02:19,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=807474.0, ans=0.0 2023-06-21 17:04:03,811 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.83 vs. limit=22.5 2023-06-21 17:04:10,528 INFO [train.py:996] (2/4) Epoch 5, batch 12650, loss[loss=0.2417, simple_loss=0.3042, pruned_loss=0.08958, over 21312.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3129, pruned_loss=0.07763, over 4275129.75 frames. ], batch size: 176, lr: 6.29e-03, grad_scale: 16.0 2023-06-21 17:04:12,575 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=807774.0, ans=0.1 2023-06-21 17:05:09,546 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.734e+02 2.417e+02 2.707e+02 3.120e+02 6.136e+02, threshold=5.414e+02, percent-clipped=1.0 2023-06-21 17:06:18,780 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.64 vs. limit=15.0 2023-06-21 17:06:32,811 INFO [train.py:996] (2/4) Epoch 5, batch 12700, loss[loss=0.2123, simple_loss=0.2841, pruned_loss=0.07021, over 21068.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3116, pruned_loss=0.0797, over 4278149.51 frames. ], batch size: 608, lr: 6.29e-03, grad_scale: 16.0 2023-06-21 17:06:33,335 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=808074.0, ans=0.125 2023-06-21 17:07:01,842 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-21 17:08:11,336 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.86 vs. limit=6.0 2023-06-21 17:08:29,349 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=15.0 2023-06-21 17:08:45,689 INFO [train.py:996] (2/4) Epoch 5, batch 12750, loss[loss=0.249, simple_loss=0.3201, pruned_loss=0.08894, over 21872.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3141, pruned_loss=0.08087, over 4278781.75 frames. ], batch size: 107, lr: 6.29e-03, grad_scale: 16.0 2023-06-21 17:08:58,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=808374.0, ans=0.05 2023-06-21 17:09:01,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=808374.0, ans=0.125 2023-06-21 17:09:02,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=808374.0, ans=0.1 2023-06-21 17:09:38,282 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=808434.0, ans=0.2 2023-06-21 17:09:43,832 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.994e+02 2.507e+02 2.930e+02 3.517e+02 6.177e+02, threshold=5.859e+02, percent-clipped=3.0 2023-06-21 17:09:48,469 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=808494.0, ans=0.015 2023-06-21 17:10:53,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=808614.0, ans=0.035 2023-06-21 17:10:55,726 INFO [train.py:996] (2/4) Epoch 5, batch 12800, loss[loss=0.2246, simple_loss=0.291, pruned_loss=0.07912, over 21558.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.313, pruned_loss=0.08105, over 4282884.10 frames. ], batch size: 548, lr: 6.29e-03, grad_scale: 32.0 2023-06-21 17:11:39,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=808734.0, ans=10.0 2023-06-21 17:12:27,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=808794.0, ans=0.125 2023-06-21 17:12:30,554 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=808794.0, ans=0.0 2023-06-21 17:12:40,469 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=808854.0, ans=0.1 2023-06-21 17:13:22,402 INFO [train.py:996] (2/4) Epoch 5, batch 12850, loss[loss=0.2111, simple_loss=0.3091, pruned_loss=0.05653, over 21729.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3161, pruned_loss=0.08231, over 4277967.38 frames. ], batch size: 351, lr: 6.28e-03, grad_scale: 32.0 2023-06-21 17:14:02,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=809034.0, ans=0.1 2023-06-21 17:14:34,768 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.692e+02 2.342e+02 2.616e+02 2.870e+02 3.698e+02, threshold=5.233e+02, percent-clipped=0.0 2023-06-21 17:15:18,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=809214.0, ans=0.0 2023-06-21 17:15:48,056 INFO [train.py:996] (2/4) Epoch 5, batch 12900, loss[loss=0.1953, simple_loss=0.2752, pruned_loss=0.05776, over 21567.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3131, pruned_loss=0.07883, over 4281575.66 frames. ], batch size: 212, lr: 6.28e-03, grad_scale: 32.0 2023-06-21 17:17:03,696 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=809394.0, ans=0.125 2023-06-21 17:17:06,442 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=809394.0, ans=0.2 2023-06-21 17:17:23,149 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=809454.0, ans=0.09899494936611666 2023-06-21 17:17:33,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=809454.0, ans=0.125 2023-06-21 17:18:10,348 INFO [train.py:996] (2/4) Epoch 5, batch 12950, loss[loss=0.2292, simple_loss=0.316, pruned_loss=0.07125, over 21719.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.312, pruned_loss=0.07688, over 4278553.03 frames. ], batch size: 298, lr: 6.28e-03, grad_scale: 32.0 2023-06-21 17:18:14,918 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=809574.0, ans=0.0 2023-06-21 17:18:15,312 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.58 vs. limit=15.0 2023-06-21 17:19:10,176 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.876e+02 2.360e+02 2.684e+02 3.163e+02 5.049e+02, threshold=5.368e+02, percent-clipped=0.0 2023-06-21 17:19:32,547 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=809754.0, ans=0.125 2023-06-21 17:20:24,713 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=809874.0, ans=0.1 2023-06-21 17:20:25,663 INFO [train.py:996] (2/4) Epoch 5, batch 13000, loss[loss=0.1802, simple_loss=0.2504, pruned_loss=0.05502, over 21143.00 frames. ], tot_loss[loss=0.233, simple_loss=0.312, pruned_loss=0.07697, over 4278373.49 frames. ], batch size: 143, lr: 6.28e-03, grad_scale: 32.0 2023-06-21 17:21:04,077 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=809934.0, ans=0.025 2023-06-21 17:22:56,632 INFO [train.py:996] (2/4) Epoch 5, batch 13050, loss[loss=0.2375, simple_loss=0.3085, pruned_loss=0.08328, over 21757.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3102, pruned_loss=0.07526, over 4261299.72 frames. ], batch size: 389, lr: 6.28e-03, grad_scale: 32.0 2023-06-21 17:23:39,281 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.10 vs. limit=15.0 2023-06-21 17:23:39,773 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.862e+02 2.464e+02 2.848e+02 3.249e+02 5.080e+02, threshold=5.696e+02, percent-clipped=0.0 2023-06-21 17:24:11,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=810294.0, ans=0.0 2023-06-21 17:24:45,926 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 17:24:48,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=810414.0, ans=0.1 2023-06-21 17:25:03,338 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=810414.0, ans=0.125 2023-06-21 17:25:16,441 INFO [train.py:996] (2/4) Epoch 5, batch 13100, loss[loss=0.2158, simple_loss=0.3007, pruned_loss=0.06542, over 21729.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3112, pruned_loss=0.07556, over 4270162.80 frames. ], batch size: 247, lr: 6.28e-03, grad_scale: 32.0 2023-06-21 17:25:18,900 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-21 17:25:24,434 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=810474.0, ans=0.0 2023-06-21 17:25:59,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=810534.0, ans=0.125 2023-06-21 17:26:29,041 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.54 vs. limit=22.5 2023-06-21 17:26:36,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=810594.0, ans=0.2 2023-06-21 17:27:12,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=810654.0, ans=0.0 2023-06-21 17:27:26,096 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.77 vs. limit=12.0 2023-06-21 17:27:33,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=810714.0, ans=0.125 2023-06-21 17:27:37,121 INFO [train.py:996] (2/4) Epoch 5, batch 13150, loss[loss=0.2109, simple_loss=0.2861, pruned_loss=0.0679, over 21833.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.311, pruned_loss=0.07778, over 4275163.68 frames. ], batch size: 317, lr: 6.28e-03, grad_scale: 32.0 2023-06-21 17:27:50,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=810774.0, ans=0.125 2023-06-21 17:27:58,355 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=810834.0, ans=0.1 2023-06-21 17:28:39,919 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.173e+02 2.725e+02 3.119e+02 3.667e+02 5.511e+02, threshold=6.238e+02, percent-clipped=0.0 2023-06-21 17:28:43,710 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=810894.0, ans=0.0 2023-06-21 17:29:41,531 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=811014.0, ans=0.0 2023-06-21 17:29:44,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=811014.0, ans=0.025 2023-06-21 17:29:46,998 INFO [train.py:996] (2/4) Epoch 5, batch 13200, loss[loss=0.2371, simple_loss=0.3115, pruned_loss=0.08138, over 21948.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3116, pruned_loss=0.07918, over 4277157.97 frames. ], batch size: 372, lr: 6.28e-03, grad_scale: 32.0 2023-06-21 17:30:22,499 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=811134.0, ans=0.0 2023-06-21 17:30:36,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=811134.0, ans=0.125 2023-06-21 17:31:29,491 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=811254.0, ans=0.0 2023-06-21 17:32:01,968 INFO [train.py:996] (2/4) Epoch 5, batch 13250, loss[loss=0.2501, simple_loss=0.3115, pruned_loss=0.09432, over 21508.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3106, pruned_loss=0.08017, over 4271343.39 frames. ], batch size: 548, lr: 6.27e-03, grad_scale: 32.0 2023-06-21 17:32:34,918 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=811374.0, ans=0.0 2023-06-21 17:32:41,072 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=811434.0, ans=0.2 2023-06-21 17:33:16,247 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.845e+02 2.617e+02 2.907e+02 3.504e+02 5.770e+02, threshold=5.814e+02, percent-clipped=0.0 2023-06-21 17:33:45,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=811494.0, ans=0.035 2023-06-21 17:33:54,917 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=811554.0, ans=0.2 2023-06-21 17:34:32,712 INFO [train.py:996] (2/4) Epoch 5, batch 13300, loss[loss=0.2348, simple_loss=0.3103, pruned_loss=0.07964, over 21828.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3144, pruned_loss=0.08007, over 4269507.68 frames. ], batch size: 298, lr: 6.27e-03, grad_scale: 16.0 2023-06-21 17:35:06,763 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.26 vs. limit=15.0 2023-06-21 17:35:40,793 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 17:35:56,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=811794.0, ans=0.125 2023-06-21 17:36:02,791 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.32 vs. limit=15.0 2023-06-21 17:36:08,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=811854.0, ans=0.125 2023-06-21 17:36:49,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=811914.0, ans=0.125 2023-06-21 17:36:50,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=811914.0, ans=0.1 2023-06-21 17:36:54,454 INFO [train.py:996] (2/4) Epoch 5, batch 13350, loss[loss=0.2473, simple_loss=0.3276, pruned_loss=0.08351, over 21630.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3186, pruned_loss=0.0833, over 4273205.50 frames. ], batch size: 230, lr: 6.27e-03, grad_scale: 16.0 2023-06-21 17:37:46,762 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=812034.0, ans=0.2 2023-06-21 17:38:06,697 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.987e+02 2.647e+02 2.976e+02 3.366e+02 5.108e+02, threshold=5.953e+02, percent-clipped=0.0 2023-06-21 17:38:38,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=812154.0, ans=0.1 2023-06-21 17:39:07,688 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.73 vs. limit=15.0 2023-06-21 17:39:08,457 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=812214.0, ans=0.0 2023-06-21 17:39:13,761 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.61 vs. limit=22.5 2023-06-21 17:39:15,666 INFO [train.py:996] (2/4) Epoch 5, batch 13400, loss[loss=0.3036, simple_loss=0.3547, pruned_loss=0.1262, over 21507.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3189, pruned_loss=0.08549, over 4277444.16 frames. ], batch size: 507, lr: 6.27e-03, grad_scale: 16.0 2023-06-21 17:39:33,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=812274.0, ans=0.05 2023-06-21 17:39:33,853 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.52 vs. limit=15.0 2023-06-21 17:39:38,478 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.77 vs. limit=15.0 2023-06-21 17:40:33,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=812394.0, ans=0.07 2023-06-21 17:40:57,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=812454.0, ans=0.125 2023-06-21 17:41:42,771 INFO [train.py:996] (2/4) Epoch 5, batch 13450, loss[loss=0.2107, simple_loss=0.278, pruned_loss=0.07167, over 21469.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3205, pruned_loss=0.088, over 4273964.11 frames. ], batch size: 194, lr: 6.27e-03, grad_scale: 16.0 2023-06-21 17:41:44,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=812574.0, ans=0.125 2023-06-21 17:41:46,728 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-21 17:42:31,140 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.096e+02 2.609e+02 2.992e+02 3.362e+02 4.963e+02, threshold=5.984e+02, percent-clipped=0.0 2023-06-21 17:42:31,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=812694.0, ans=0.125 2023-06-21 17:42:59,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=812754.0, ans=0.0 2023-06-21 17:43:47,969 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.69 vs. limit=10.0 2023-06-21 17:43:58,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=812874.0, ans=0.2 2023-06-21 17:43:59,479 INFO [train.py:996] (2/4) Epoch 5, batch 13500, loss[loss=0.2178, simple_loss=0.2814, pruned_loss=0.07711, over 21365.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3117, pruned_loss=0.08514, over 4268529.38 frames. ], batch size: 211, lr: 6.27e-03, grad_scale: 16.0 2023-06-21 17:44:28,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=812934.0, ans=0.125 2023-06-21 17:45:18,625 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.46 vs. limit=15.0 2023-06-21 17:45:26,505 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=813054.0, ans=0.0 2023-06-21 17:46:05,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=813114.0, ans=0.2 2023-06-21 17:46:37,685 INFO [train.py:996] (2/4) Epoch 5, batch 13550, loss[loss=0.2168, simple_loss=0.3141, pruned_loss=0.05972, over 21400.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3138, pruned_loss=0.08356, over 4267303.66 frames. ], batch size: 194, lr: 6.27e-03, grad_scale: 16.0 2023-06-21 17:46:48,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=813174.0, ans=0.1 2023-06-21 17:46:50,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=813174.0, ans=0.2 2023-06-21 17:47:28,927 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.975e+02 2.549e+02 2.990e+02 3.504e+02 5.055e+02, threshold=5.980e+02, percent-clipped=0.0 2023-06-21 17:47:29,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=813294.0, ans=0.125 2023-06-21 17:47:31,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=813294.0, ans=15.0 2023-06-21 17:47:39,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=813354.0, ans=0.125 2023-06-21 17:47:40,354 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.50 vs. limit=10.0 2023-06-21 17:48:45,932 INFO [train.py:996] (2/4) Epoch 5, batch 13600, loss[loss=0.2093, simple_loss=0.2846, pruned_loss=0.06699, over 21491.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.316, pruned_loss=0.08327, over 4272632.75 frames. ], batch size: 194, lr: 6.27e-03, grad_scale: 32.0 2023-06-21 17:48:46,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=813474.0, ans=0.0 2023-06-21 17:49:43,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=813594.0, ans=0.1 2023-06-21 17:50:14,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=813654.0, ans=0.125 2023-06-21 17:50:19,684 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.03 vs. limit=15.0 2023-06-21 17:51:06,766 INFO [train.py:996] (2/4) Epoch 5, batch 13650, loss[loss=0.2099, simple_loss=0.273, pruned_loss=0.07343, over 21737.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3122, pruned_loss=0.08022, over 4266463.08 frames. ], batch size: 112, lr: 6.27e-03, grad_scale: 32.0 2023-06-21 17:51:13,768 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=15.0 2023-06-21 17:52:01,854 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.954e+02 2.314e+02 2.698e+02 3.279e+02 5.824e+02, threshold=5.397e+02, percent-clipped=0.0 2023-06-21 17:52:23,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=813954.0, ans=0.2 2023-06-21 17:53:30,167 INFO [train.py:996] (2/4) Epoch 5, batch 13700, loss[loss=0.2501, simple_loss=0.3248, pruned_loss=0.08767, over 21791.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3077, pruned_loss=0.07969, over 4264439.52 frames. ], batch size: 332, lr: 6.26e-03, grad_scale: 32.0 2023-06-21 17:54:19,630 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=18.60 vs. limit=22.5 2023-06-21 17:54:26,084 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.73 vs. limit=15.0 2023-06-21 17:54:51,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=814254.0, ans=0.0 2023-06-21 17:55:08,357 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=814314.0, ans=0.125 2023-06-21 17:55:20,749 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.73 vs. limit=15.0 2023-06-21 17:55:26,875 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=814314.0, ans=0.125 2023-06-21 17:55:39,232 INFO [train.py:996] (2/4) Epoch 5, batch 13750, loss[loss=0.2289, simple_loss=0.2847, pruned_loss=0.08655, over 20269.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3065, pruned_loss=0.07958, over 4266118.46 frames. ], batch size: 703, lr: 6.26e-03, grad_scale: 32.0 2023-06-21 17:56:40,357 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 2.570e+02 2.907e+02 3.246e+02 5.241e+02, threshold=5.813e+02, percent-clipped=0.0 2023-06-21 17:56:41,645 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.97 vs. limit=15.0 2023-06-21 17:56:53,320 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.64 vs. limit=15.0 2023-06-21 17:57:09,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=814554.0, ans=0.125 2023-06-21 17:57:54,993 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=814614.0, ans=0.0 2023-06-21 17:58:04,562 INFO [train.py:996] (2/4) Epoch 5, batch 13800, loss[loss=0.2332, simple_loss=0.3196, pruned_loss=0.07342, over 20767.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3095, pruned_loss=0.07875, over 4261763.64 frames. ], batch size: 608, lr: 6.26e-03, grad_scale: 32.0 2023-06-21 17:59:10,449 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 17:59:14,107 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.36 vs. limit=15.0 2023-06-21 17:59:28,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=814854.0, ans=0.2 2023-06-21 17:59:38,439 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.04 vs. limit=6.0 2023-06-21 17:59:52,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=814914.0, ans=0.2 2023-06-21 18:00:36,378 INFO [train.py:996] (2/4) Epoch 5, batch 13850, loss[loss=0.3077, simple_loss=0.381, pruned_loss=0.1172, over 21322.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.316, pruned_loss=0.08048, over 4264495.55 frames. ], batch size: 548, lr: 6.26e-03, grad_scale: 32.0 2023-06-21 18:00:39,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=814974.0, ans=0.125 2023-06-21 18:01:14,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=815034.0, ans=0.125 2023-06-21 18:01:21,811 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.956e+02 2.680e+02 3.025e+02 3.461e+02 6.759e+02, threshold=6.050e+02, percent-clipped=1.0 2023-06-21 18:02:03,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=815154.0, ans=0.125 2023-06-21 18:02:28,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=815214.0, ans=0.125 2023-06-21 18:02:45,800 INFO [train.py:996] (2/4) Epoch 5, batch 13900, loss[loss=0.3122, simple_loss=0.3523, pruned_loss=0.136, over 21620.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3209, pruned_loss=0.08415, over 4272137.73 frames. ], batch size: 507, lr: 6.26e-03, grad_scale: 32.0 2023-06-21 18:04:10,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=815454.0, ans=0.125 2023-06-21 18:04:59,430 INFO [train.py:996] (2/4) Epoch 5, batch 13950, loss[loss=0.3015, simple_loss=0.3501, pruned_loss=0.1265, over 21588.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3199, pruned_loss=0.08483, over 4277538.51 frames. ], batch size: 471, lr: 6.26e-03, grad_scale: 16.0 2023-06-21 18:05:40,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=815634.0, ans=0.0 2023-06-21 18:05:48,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=815634.0, ans=10.0 2023-06-21 18:05:56,658 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.798e+02 2.602e+02 2.918e+02 3.271e+02 5.546e+02, threshold=5.836e+02, percent-clipped=0.0 2023-06-21 18:06:22,750 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.04 vs. limit=15.0 2023-06-21 18:06:52,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=815814.0, ans=0.5 2023-06-21 18:07:06,576 INFO [train.py:996] (2/4) Epoch 5, batch 14000, loss[loss=0.1999, simple_loss=0.3075, pruned_loss=0.04613, over 19769.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3161, pruned_loss=0.08221, over 4274831.04 frames. ], batch size: 702, lr: 6.26e-03, grad_scale: 32.0 2023-06-21 18:07:17,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=815874.0, ans=0.125 2023-06-21 18:08:13,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=815994.0, ans=0.1 2023-06-21 18:09:05,307 INFO [train.py:996] (2/4) Epoch 5, batch 14050, loss[loss=0.1847, simple_loss=0.2522, pruned_loss=0.05861, over 21481.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3098, pruned_loss=0.07794, over 4279112.94 frames. ], batch size: 230, lr: 6.26e-03, grad_scale: 32.0 2023-06-21 18:10:08,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=816294.0, ans=0.2 2023-06-21 18:10:10,812 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.610e+02 2.220e+02 2.635e+02 3.062e+02 5.472e+02, threshold=5.269e+02, percent-clipped=0.0 2023-06-21 18:10:23,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=816354.0, ans=0.025 2023-06-21 18:10:28,282 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.60 vs. limit=15.0 2023-06-21 18:11:16,446 INFO [train.py:996] (2/4) Epoch 5, batch 14100, loss[loss=0.2412, simple_loss=0.3081, pruned_loss=0.08713, over 21691.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3036, pruned_loss=0.07784, over 4272758.87 frames. ], batch size: 298, lr: 6.26e-03, grad_scale: 32.0 2023-06-21 18:11:20,283 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 18:12:36,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=816654.0, ans=0.1 2023-06-21 18:12:51,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=816714.0, ans=0.0 2023-06-21 18:13:10,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=816714.0, ans=0.125 2023-06-21 18:13:18,686 INFO [train.py:996] (2/4) Epoch 5, batch 14150, loss[loss=0.2325, simple_loss=0.3185, pruned_loss=0.07329, over 21878.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3081, pruned_loss=0.07873, over 4279167.74 frames. ], batch size: 98, lr: 6.25e-03, grad_scale: 32.0 2023-06-21 18:14:02,784 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.04 vs. limit=10.0 2023-06-21 18:14:05,090 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.788e+02 2.286e+02 2.751e+02 3.276e+02 5.188e+02, threshold=5.503e+02, percent-clipped=0.0 2023-06-21 18:14:32,703 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=816954.0, ans=0.125 2023-06-21 18:14:51,764 INFO [train.py:996] (2/4) Epoch 5, batch 14200, loss[loss=0.2404, simple_loss=0.3001, pruned_loss=0.0904, over 21637.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3081, pruned_loss=0.07791, over 4267110.43 frames. ], batch size: 414, lr: 6.25e-03, grad_scale: 32.0 2023-06-21 18:16:00,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=817194.0, ans=0.1 2023-06-21 18:16:39,357 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=817314.0, ans=0.125 2023-06-21 18:16:47,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=817314.0, ans=0.0 2023-06-21 18:16:58,891 INFO [train.py:996] (2/4) Epoch 5, batch 14250, loss[loss=0.2012, simple_loss=0.2524, pruned_loss=0.07501, over 20276.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3017, pruned_loss=0.07779, over 4261385.84 frames. ], batch size: 703, lr: 6.25e-03, grad_scale: 16.0 2023-06-21 18:17:21,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=817374.0, ans=0.1 2023-06-21 18:17:24,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=817434.0, ans=0.04949747468305833 2023-06-21 18:17:45,631 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.41 vs. limit=15.0 2023-06-21 18:17:57,863 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.598e+02 2.265e+02 2.747e+02 3.168e+02 5.793e+02, threshold=5.495e+02, percent-clipped=1.0 2023-06-21 18:18:15,270 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.45 vs. limit=22.5 2023-06-21 18:18:26,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=817554.0, ans=0.2 2023-06-21 18:18:30,935 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=817614.0, ans=0.125 2023-06-21 18:18:53,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=817614.0, ans=0.125 2023-06-21 18:19:15,224 INFO [train.py:996] (2/4) Epoch 5, batch 14300, loss[loss=0.3487, simple_loss=0.4349, pruned_loss=0.1313, over 21700.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.2991, pruned_loss=0.07618, over 4241915.35 frames. ], batch size: 414, lr: 6.25e-03, grad_scale: 16.0 2023-06-21 18:20:25,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=817794.0, ans=0.0 2023-06-21 18:20:39,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=817854.0, ans=0.0 2023-06-21 18:21:34,198 INFO [train.py:996] (2/4) Epoch 5, batch 14350, loss[loss=0.213, simple_loss=0.2904, pruned_loss=0.06777, over 21882.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3075, pruned_loss=0.07808, over 4252928.03 frames. ], batch size: 316, lr: 6.25e-03, grad_scale: 16.0 2023-06-21 18:21:44,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=817974.0, ans=0.125 2023-06-21 18:21:54,355 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=817974.0, ans=0.125 2023-06-21 18:22:48,703 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.826e+02 2.439e+02 2.916e+02 4.284e+02 1.022e+03, threshold=5.832e+02, percent-clipped=15.0 2023-06-21 18:22:52,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=818094.0, ans=0.125 2023-06-21 18:23:46,451 INFO [train.py:996] (2/4) Epoch 5, batch 14400, loss[loss=0.2228, simple_loss=0.2897, pruned_loss=0.07797, over 21827.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3053, pruned_loss=0.07813, over 4266427.53 frames. ], batch size: 351, lr: 6.25e-03, grad_scale: 32.0 2023-06-21 18:23:48,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=818274.0, ans=0.0 2023-06-21 18:24:17,952 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=818334.0, ans=0.1 2023-06-21 18:24:28,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=818334.0, ans=0.05 2023-06-21 18:24:29,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=818334.0, ans=0.0 2023-06-21 18:24:52,907 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.89 vs. limit=15.0 2023-06-21 18:25:47,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=818514.0, ans=0.125 2023-06-21 18:25:55,574 INFO [train.py:996] (2/4) Epoch 5, batch 14450, loss[loss=0.2133, simple_loss=0.288, pruned_loss=0.06931, over 21490.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3005, pruned_loss=0.07843, over 4263000.98 frames. ], batch size: 131, lr: 6.25e-03, grad_scale: 32.0 2023-06-21 18:26:02,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=818574.0, ans=0.0 2023-06-21 18:26:02,029 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=818574.0, ans=0.1 2023-06-21 18:26:41,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=818634.0, ans=0.125 2023-06-21 18:27:01,100 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.873e+02 2.447e+02 2.694e+02 3.271e+02 4.968e+02, threshold=5.388e+02, percent-clipped=0.0 2023-06-21 18:27:20,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=818754.0, ans=0.125 2023-06-21 18:27:26,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=818814.0, ans=0.125 2023-06-21 18:27:53,510 INFO [train.py:996] (2/4) Epoch 5, batch 14500, loss[loss=0.2049, simple_loss=0.2697, pruned_loss=0.07005, over 21789.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.2967, pruned_loss=0.078, over 4269234.10 frames. ], batch size: 316, lr: 6.25e-03, grad_scale: 32.0 2023-06-21 18:29:16,730 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 18:29:21,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=819054.0, ans=0.0 2023-06-21 18:30:23,196 INFO [train.py:996] (2/4) Epoch 5, batch 14550, loss[loss=0.2167, simple_loss=0.301, pruned_loss=0.06622, over 19996.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3018, pruned_loss=0.07945, over 4269956.52 frames. ], batch size: 703, lr: 6.24e-03, grad_scale: 32.0 2023-06-21 18:31:05,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=819234.0, ans=0.125 2023-06-21 18:31:32,100 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=819294.0, ans=0.2 2023-06-21 18:31:33,099 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.000e+02 2.710e+02 3.093e+02 3.463e+02 5.528e+02, threshold=6.187e+02, percent-clipped=2.0 2023-06-21 18:31:36,656 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=819294.0, ans=0.025 2023-06-21 18:32:38,579 INFO [train.py:996] (2/4) Epoch 5, batch 14600, loss[loss=0.2343, simple_loss=0.3281, pruned_loss=0.07028, over 21666.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3109, pruned_loss=0.08213, over 4274809.17 frames. ], batch size: 263, lr: 6.24e-03, grad_scale: 16.0 2023-06-21 18:32:53,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=819474.0, ans=0.125 2023-06-21 18:32:56,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=819474.0, ans=0.125 2023-06-21 18:32:57,776 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=819474.0, ans=0.125 2023-06-21 18:33:39,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=819594.0, ans=0.0 2023-06-21 18:33:43,776 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=819594.0, ans=0.0 2023-06-21 18:33:55,515 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=14.21 vs. limit=15.0 2023-06-21 18:34:13,731 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=819654.0, ans=0.125 2023-06-21 18:34:14,299 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.03 vs. limit=15.0 2023-06-21 18:34:27,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=819714.0, ans=0.2 2023-06-21 18:34:57,398 INFO [train.py:996] (2/4) Epoch 5, batch 14650, loss[loss=0.1983, simple_loss=0.287, pruned_loss=0.05475, over 21687.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3133, pruned_loss=0.08127, over 4268598.99 frames. ], batch size: 263, lr: 6.24e-03, grad_scale: 16.0 2023-06-21 18:35:04,606 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.81 vs. limit=15.0 2023-06-21 18:35:21,434 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=819834.0, ans=0.125 2023-06-21 18:35:59,438 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 2.281e+02 2.602e+02 3.168e+02 7.024e+02, threshold=5.204e+02, percent-clipped=1.0 2023-06-21 18:36:45,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=819954.0, ans=0.0 2023-06-21 18:36:52,459 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=820014.0, ans=0.125 2023-06-21 18:36:52,494 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=820014.0, ans=0.2 2023-06-21 18:37:01,828 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.00 vs. limit=22.5 2023-06-21 18:37:05,115 INFO [train.py:996] (2/4) Epoch 5, batch 14700, loss[loss=0.195, simple_loss=0.2787, pruned_loss=0.05569, over 21346.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3051, pruned_loss=0.07562, over 4255830.56 frames. ], batch size: 176, lr: 6.24e-03, grad_scale: 16.0 2023-06-21 18:38:03,443 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.61 vs. limit=15.0 2023-06-21 18:38:05,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=820194.0, ans=0.0 2023-06-21 18:39:07,389 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.22 vs. limit=15.0 2023-06-21 18:39:16,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=820314.0, ans=0.0 2023-06-21 18:39:18,484 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 18:39:44,740 INFO [train.py:996] (2/4) Epoch 5, batch 14750, loss[loss=0.2609, simple_loss=0.3332, pruned_loss=0.09432, over 21491.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3093, pruned_loss=0.07772, over 4261913.07 frames. ], batch size: 131, lr: 6.24e-03, grad_scale: 16.0 2023-06-21 18:40:36,041 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=820494.0, ans=0.05 2023-06-21 18:40:37,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=820494.0, ans=0.2 2023-06-21 18:40:41,789 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.541e+02 2.601e+02 3.057e+02 3.762e+02 6.456e+02, threshold=6.114e+02, percent-clipped=6.0 2023-06-21 18:40:44,120 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=820494.0, ans=15.0 2023-06-21 18:40:56,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=820554.0, ans=0.125 2023-06-21 18:41:58,025 INFO [train.py:996] (2/4) Epoch 5, batch 14800, loss[loss=0.2191, simple_loss=0.3331, pruned_loss=0.0525, over 19958.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3207, pruned_loss=0.08364, over 4264499.40 frames. ], batch size: 702, lr: 6.24e-03, grad_scale: 32.0 2023-06-21 18:43:20,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=820854.0, ans=0.1 2023-06-21 18:43:53,730 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.39 vs. limit=15.0 2023-06-21 18:44:08,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=820914.0, ans=0.09899494936611666 2023-06-21 18:44:18,453 INFO [train.py:996] (2/4) Epoch 5, batch 14850, loss[loss=0.196, simple_loss=0.2615, pruned_loss=0.06522, over 21823.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3149, pruned_loss=0.08325, over 4263815.43 frames. ], batch size: 352, lr: 6.24e-03, grad_scale: 16.0 2023-06-21 18:44:25,980 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=820974.0, ans=0.0 2023-06-21 18:45:45,997 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 2.574e+02 2.949e+02 3.565e+02 8.325e+02, threshold=5.898e+02, percent-clipped=1.0 2023-06-21 18:46:46,438 INFO [train.py:996] (2/4) Epoch 5, batch 14900, loss[loss=0.2723, simple_loss=0.3325, pruned_loss=0.106, over 21243.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3175, pruned_loss=0.08538, over 4264374.38 frames. ], batch size: 143, lr: 6.24e-03, grad_scale: 16.0 2023-06-21 18:47:36,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=821334.0, ans=0.1 2023-06-21 18:47:41,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=821334.0, ans=0.125 2023-06-21 18:48:32,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=821454.0, ans=0.035 2023-06-21 18:48:32,867 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.08 vs. limit=15.0 2023-06-21 18:48:40,141 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=821514.0, ans=0.0 2023-06-21 18:49:04,618 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=821574.0, ans=0.2 2023-06-21 18:49:05,467 INFO [train.py:996] (2/4) Epoch 5, batch 14950, loss[loss=0.2508, simple_loss=0.2996, pruned_loss=0.1009, over 20131.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3183, pruned_loss=0.08491, over 4262657.92 frames. ], batch size: 703, lr: 6.24e-03, grad_scale: 16.0 2023-06-21 18:50:16,230 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=821694.0, ans=0.0 2023-06-21 18:50:23,017 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.55 vs. limit=10.0 2023-06-21 18:50:24,842 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.985e+02 2.514e+02 2.835e+02 3.523e+02 6.432e+02, threshold=5.669e+02, percent-clipped=1.0 2023-06-21 18:50:29,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=821754.0, ans=0.125 2023-06-21 18:50:31,072 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=821754.0, ans=0.125 2023-06-21 18:51:26,643 INFO [train.py:996] (2/4) Epoch 5, batch 15000, loss[loss=0.2852, simple_loss=0.348, pruned_loss=0.1112, over 21765.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3201, pruned_loss=0.08638, over 4263960.26 frames. ], batch size: 441, lr: 6.23e-03, grad_scale: 16.0 2023-06-21 18:51:26,644 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-21 18:52:15,534 INFO [train.py:1028] (2/4) Epoch 5, validation: loss=0.2599, simple_loss=0.3537, pruned_loss=0.08302, over 1796401.00 frames. 2023-06-21 18:52:15,536 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-21 18:52:17,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=821874.0, ans=0.125 2023-06-21 18:53:18,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=821994.0, ans=0.2 2023-06-21 18:54:16,696 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=822114.0, ans=0.0 2023-06-21 18:54:29,552 INFO [train.py:996] (2/4) Epoch 5, batch 15050, loss[loss=0.2274, simple_loss=0.3179, pruned_loss=0.06843, over 21862.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3212, pruned_loss=0.08733, over 4266457.06 frames. ], batch size: 316, lr: 6.23e-03, grad_scale: 16.0 2023-06-21 18:54:29,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=822174.0, ans=0.125 2023-06-21 18:55:41,366 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.049e+02 2.696e+02 3.166e+02 3.893e+02 6.757e+02, threshold=6.331e+02, percent-clipped=4.0 2023-06-21 18:55:51,508 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=15.0 2023-06-21 18:56:12,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=822354.0, ans=0.0 2023-06-21 18:56:42,207 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=822414.0, ans=0.1 2023-06-21 18:56:56,370 INFO [train.py:996] (2/4) Epoch 5, batch 15100, loss[loss=0.3465, simple_loss=0.3942, pruned_loss=0.1494, over 21314.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.3263, pruned_loss=0.08818, over 4271682.52 frames. ], batch size: 507, lr: 6.23e-03, grad_scale: 16.0 2023-06-21 18:57:44,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=822534.0, ans=0.125 2023-06-21 18:57:44,953 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.63 vs. limit=6.0 2023-06-21 18:57:54,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=822594.0, ans=0.0 2023-06-21 18:57:54,296 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=822594.0, ans=0.125 2023-06-21 18:57:58,724 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=822594.0, ans=0.125 2023-06-21 18:58:00,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=822594.0, ans=0.0 2023-06-21 18:59:22,802 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.87 vs. limit=22.5 2023-06-21 18:59:23,188 INFO [train.py:996] (2/4) Epoch 5, batch 15150, loss[loss=0.2111, simple_loss=0.262, pruned_loss=0.08008, over 21186.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3233, pruned_loss=0.08812, over 4266513.05 frames. ], batch size: 548, lr: 6.23e-03, grad_scale: 16.0 2023-06-21 18:59:33,265 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.08 vs. limit=6.0 2023-06-21 18:59:37,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=822834.0, ans=0.0 2023-06-21 18:59:52,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=822834.0, ans=0.0 2023-06-21 19:00:22,613 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.171e+02 2.670e+02 3.218e+02 3.613e+02 4.681e+02, threshold=6.436e+02, percent-clipped=0.0 2023-06-21 19:00:31,793 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.94 vs. limit=15.0 2023-06-21 19:00:44,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=822954.0, ans=0.125 2023-06-21 19:01:35,531 INFO [train.py:996] (2/4) Epoch 5, batch 15200, loss[loss=0.203, simple_loss=0.2871, pruned_loss=0.05945, over 21267.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3138, pruned_loss=0.08403, over 4263160.18 frames. ], batch size: 551, lr: 6.23e-03, grad_scale: 32.0 2023-06-21 19:02:46,547 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.38 vs. limit=15.0 2023-06-21 19:03:37,123 INFO [train.py:996] (2/4) Epoch 5, batch 15250, loss[loss=0.2019, simple_loss=0.2706, pruned_loss=0.06662, over 21676.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3068, pruned_loss=0.08202, over 4264732.98 frames. ], batch size: 282, lr: 6.23e-03, grad_scale: 32.0 2023-06-21 19:04:32,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=823494.0, ans=0.125 2023-06-21 19:04:33,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=823494.0, ans=0.125 2023-06-21 19:04:36,150 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.735e+02 2.376e+02 2.792e+02 3.272e+02 5.081e+02, threshold=5.584e+02, percent-clipped=0.0 2023-06-21 19:04:57,570 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=823554.0, ans=0.125 2023-06-21 19:05:26,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=823614.0, ans=0.0 2023-06-21 19:05:50,662 INFO [train.py:996] (2/4) Epoch 5, batch 15300, loss[loss=0.272, simple_loss=0.3381, pruned_loss=0.103, over 21754.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3112, pruned_loss=0.08556, over 4263012.67 frames. ], batch size: 332, lr: 6.23e-03, grad_scale: 16.0 2023-06-21 19:06:15,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=823674.0, ans=0.09899494936611666 2023-06-21 19:06:17,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=823674.0, ans=15.0 2023-06-21 19:06:21,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=823734.0, ans=0.125 2023-06-21 19:06:26,006 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=823734.0, ans=0.05 2023-06-21 19:06:31,020 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.38 vs. limit=12.0 2023-06-21 19:07:06,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=823794.0, ans=0.0 2023-06-21 19:07:06,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=823794.0, ans=0.2 2023-06-21 19:08:11,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=823974.0, ans=0.125 2023-06-21 19:08:12,194 INFO [train.py:996] (2/4) Epoch 5, batch 15350, loss[loss=0.2605, simple_loss=0.3434, pruned_loss=0.08874, over 21819.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3178, pruned_loss=0.08722, over 4263001.08 frames. ], batch size: 118, lr: 6.23e-03, grad_scale: 16.0 2023-06-21 19:09:02,222 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.71 vs. limit=15.0 2023-06-21 19:09:18,324 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.888e+02 2.632e+02 3.057e+02 3.588e+02 5.490e+02, threshold=6.113e+02, percent-clipped=0.0 2023-06-21 19:09:21,691 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=824154.0, ans=0.125 2023-06-21 19:09:58,087 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=824214.0, ans=0.1 2023-06-21 19:10:24,826 INFO [train.py:996] (2/4) Epoch 5, batch 15400, loss[loss=0.2132, simple_loss=0.2999, pruned_loss=0.06327, over 16318.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3198, pruned_loss=0.08487, over 4259525.07 frames. ], batch size: 63, lr: 6.23e-03, grad_scale: 16.0 2023-06-21 19:11:07,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=824394.0, ans=0.125 2023-06-21 19:11:08,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=824394.0, ans=0.07 2023-06-21 19:11:18,970 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.95 vs. limit=15.0 2023-06-21 19:11:53,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=824514.0, ans=0.125 2023-06-21 19:12:33,777 INFO [train.py:996] (2/4) Epoch 5, batch 15450, loss[loss=0.28, simple_loss=0.3565, pruned_loss=0.1018, over 20005.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3171, pruned_loss=0.08377, over 4253810.80 frames. ], batch size: 703, lr: 6.22e-03, grad_scale: 16.0 2023-06-21 19:13:12,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=824634.0, ans=0.1 2023-06-21 19:13:23,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=824694.0, ans=0.125 2023-06-21 19:13:23,455 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=824694.0, ans=0.125 2023-06-21 19:13:30,624 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.818e+02 2.430e+02 2.746e+02 3.207e+02 5.836e+02, threshold=5.491e+02, percent-clipped=0.0 2023-06-21 19:14:02,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=824754.0, ans=0.2 2023-06-21 19:14:37,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=824814.0, ans=0.0 2023-06-21 19:14:48,606 INFO [train.py:996] (2/4) Epoch 5, batch 15500, loss[loss=0.2026, simple_loss=0.2504, pruned_loss=0.07737, over 20838.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3183, pruned_loss=0.0839, over 4248259.22 frames. ], batch size: 611, lr: 6.22e-03, grad_scale: 16.0 2023-06-21 19:15:43,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=824994.0, ans=0.125 2023-06-21 19:15:45,656 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.36 vs. limit=15.0 2023-06-21 19:16:42,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=825054.0, ans=0.1 2023-06-21 19:17:11,614 INFO [train.py:996] (2/4) Epoch 5, batch 15550, loss[loss=0.1802, simple_loss=0.2394, pruned_loss=0.06044, over 17191.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3144, pruned_loss=0.08152, over 4250384.69 frames. ], batch size: 68, lr: 6.22e-03, grad_scale: 16.0 2023-06-21 19:17:13,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=825174.0, ans=0.0 2023-06-21 19:17:20,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=825174.0, ans=0.0 2023-06-21 19:17:34,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=825234.0, ans=0.1 2023-06-21 19:18:22,524 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.760e+02 2.386e+02 2.767e+02 3.218e+02 7.331e+02, threshold=5.534e+02, percent-clipped=1.0 2023-06-21 19:18:56,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=825414.0, ans=0.125 2023-06-21 19:18:58,696 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-21 19:19:00,154 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.97 vs. limit=15.0 2023-06-21 19:19:21,996 INFO [train.py:996] (2/4) Epoch 5, batch 15600, loss[loss=0.2291, simple_loss=0.3045, pruned_loss=0.07681, over 21763.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3062, pruned_loss=0.07974, over 4251109.58 frames. ], batch size: 371, lr: 6.22e-03, grad_scale: 32.0 2023-06-21 19:19:52,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=825474.0, ans=0.95 2023-06-21 19:20:09,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=825534.0, ans=0.0 2023-06-21 19:20:18,402 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 19:20:35,524 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=825594.0, ans=0.125 2023-06-21 19:20:48,709 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=825654.0, ans=0.125 2023-06-21 19:21:05,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=825654.0, ans=0.125 2023-06-21 19:21:09,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=825714.0, ans=0.125 2023-06-21 19:21:27,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=825714.0, ans=0.1 2023-06-21 19:21:31,869 INFO [train.py:996] (2/4) Epoch 5, batch 15650, loss[loss=0.2644, simple_loss=0.3212, pruned_loss=0.1038, over 21316.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3049, pruned_loss=0.07879, over 4255900.04 frames. ], batch size: 471, lr: 6.22e-03, grad_scale: 32.0 2023-06-21 19:21:39,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=825774.0, ans=0.0 2023-06-21 19:21:57,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=825834.0, ans=0.1 2023-06-21 19:22:54,913 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.907e+02 2.460e+02 2.777e+02 3.351e+02 5.058e+02, threshold=5.554e+02, percent-clipped=0.0 2023-06-21 19:23:48,493 INFO [train.py:996] (2/4) Epoch 5, batch 15700, loss[loss=0.2157, simple_loss=0.3037, pruned_loss=0.06387, over 21686.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3003, pruned_loss=0.07738, over 4257813.98 frames. ], batch size: 332, lr: 6.22e-03, grad_scale: 32.0 2023-06-21 19:24:21,713 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=826134.0, ans=0.125 2023-06-21 19:24:23,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=826134.0, ans=0.2 2023-06-21 19:25:29,564 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=826254.0, ans=0.015 2023-06-21 19:25:40,626 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=826254.0, ans=0.2 2023-06-21 19:25:56,248 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=826314.0, ans=0.125 2023-06-21 19:26:06,626 INFO [train.py:996] (2/4) Epoch 5, batch 15750, loss[loss=0.2206, simple_loss=0.2807, pruned_loss=0.08023, over 15244.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.2974, pruned_loss=0.07758, over 4248426.77 frames. ], batch size: 61, lr: 6.22e-03, grad_scale: 32.0 2023-06-21 19:26:14,950 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.91 vs. limit=22.5 2023-06-21 19:26:17,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=826374.0, ans=0.02 2023-06-21 19:27:19,064 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.882e+02 2.398e+02 2.705e+02 3.127e+02 4.328e+02, threshold=5.411e+02, percent-clipped=0.0 2023-06-21 19:27:48,936 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=826554.0, ans=0.2 2023-06-21 19:28:07,043 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=826614.0, ans=0.05 2023-06-21 19:28:12,456 INFO [train.py:996] (2/4) Epoch 5, batch 15800, loss[loss=0.26, simple_loss=0.302, pruned_loss=0.109, over 21325.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2933, pruned_loss=0.07728, over 4258753.44 frames. ], batch size: 473, lr: 6.22e-03, grad_scale: 32.0 2023-06-21 19:29:00,041 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=826734.0, ans=0.0 2023-06-21 19:29:01,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=826734.0, ans=0.2 2023-06-21 19:29:32,729 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=826794.0, ans=0.1 2023-06-21 19:29:55,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=826914.0, ans=0.0 2023-06-21 19:30:25,355 INFO [train.py:996] (2/4) Epoch 5, batch 15850, loss[loss=0.2597, simple_loss=0.3191, pruned_loss=0.1002, over 21474.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.2978, pruned_loss=0.08023, over 4261482.55 frames. ], batch size: 389, lr: 6.22e-03, grad_scale: 32.0 2023-06-21 19:30:36,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=826974.0, ans=0.0 2023-06-21 19:31:35,851 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.910e+02 2.608e+02 3.041e+02 3.663e+02 6.488e+02, threshold=6.081e+02, percent-clipped=4.0 2023-06-21 19:32:08,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=827214.0, ans=0.125 2023-06-21 19:32:24,221 INFO [train.py:996] (2/4) Epoch 5, batch 15900, loss[loss=0.2772, simple_loss=0.3392, pruned_loss=0.1076, over 21596.00 frames. ], tot_loss[loss=0.228, simple_loss=0.2963, pruned_loss=0.07986, over 4241947.69 frames. ], batch size: 441, lr: 6.21e-03, grad_scale: 32.0 2023-06-21 19:32:38,714 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=827334.0, ans=0.2 2023-06-21 19:32:47,936 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=827334.0, ans=0.035 2023-06-21 19:32:50,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=827334.0, ans=0.0 2023-06-21 19:33:18,908 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=827394.0, ans=0.125 2023-06-21 19:33:52,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=827454.0, ans=0.1 2023-06-21 19:33:55,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=827454.0, ans=0.0 2023-06-21 19:34:21,039 INFO [train.py:996] (2/4) Epoch 5, batch 15950, loss[loss=0.2098, simple_loss=0.2822, pruned_loss=0.06871, over 20693.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2953, pruned_loss=0.07722, over 4250276.38 frames. ], batch size: 608, lr: 6.21e-03, grad_scale: 16.0 2023-06-21 19:35:15,525 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=827694.0, ans=0.0 2023-06-21 19:35:37,326 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.851e+02 2.379e+02 2.684e+02 3.180e+02 4.998e+02, threshold=5.368e+02, percent-clipped=0.0 2023-06-21 19:36:22,752 INFO [train.py:996] (2/4) Epoch 5, batch 16000, loss[loss=0.2276, simple_loss=0.3183, pruned_loss=0.06849, over 21650.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2968, pruned_loss=0.07502, over 4262713.93 frames. ], batch size: 263, lr: 6.21e-03, grad_scale: 32.0 2023-06-21 19:36:30,715 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=827874.0, ans=0.1 2023-06-21 19:36:36,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=827874.0, ans=0.0 2023-06-21 19:36:45,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=827934.0, ans=0.125 2023-06-21 19:36:53,906 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.26 vs. limit=6.0 2023-06-21 19:38:40,868 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.17 vs. limit=15.0 2023-06-21 19:38:41,473 INFO [train.py:996] (2/4) Epoch 5, batch 16050, loss[loss=0.1777, simple_loss=0.2366, pruned_loss=0.05939, over 16942.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2994, pruned_loss=0.07363, over 4262827.35 frames. ], batch size: 63, lr: 6.21e-03, grad_scale: 32.0 2023-06-21 19:39:45,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=828294.0, ans=0.125 2023-06-21 19:39:54,519 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.880e+02 2.402e+02 2.679e+02 3.498e+02 5.563e+02, threshold=5.357e+02, percent-clipped=1.0 2023-06-21 19:40:44,448 INFO [train.py:996] (2/4) Epoch 5, batch 16100, loss[loss=0.2153, simple_loss=0.2845, pruned_loss=0.07302, over 21800.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3036, pruned_loss=0.075, over 4267452.90 frames. ], batch size: 247, lr: 6.21e-03, grad_scale: 32.0 2023-06-21 19:41:05,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=828474.0, ans=0.125 2023-06-21 19:41:25,026 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=828534.0, ans=0.125 2023-06-21 19:41:36,362 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.20 vs. limit=15.0 2023-06-21 19:43:08,244 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.26 vs. limit=15.0 2023-06-21 19:43:08,403 INFO [train.py:996] (2/4) Epoch 5, batch 16150, loss[loss=0.2295, simple_loss=0.2978, pruned_loss=0.08066, over 21951.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.303, pruned_loss=0.07764, over 4280433.64 frames. ], batch size: 316, lr: 6.21e-03, grad_scale: 32.0 2023-06-21 19:43:13,469 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=828774.0, ans=0.0 2023-06-21 19:43:20,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=828774.0, ans=0.0 2023-06-21 19:44:04,932 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.50 vs. limit=15.0 2023-06-21 19:44:15,379 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=828894.0, ans=0.125 2023-06-21 19:44:19,016 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.113e+02 2.586e+02 2.871e+02 3.411e+02 6.404e+02, threshold=5.741e+02, percent-clipped=1.0 2023-06-21 19:44:37,788 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.71 vs. limit=15.0 2023-06-21 19:44:43,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=828954.0, ans=0.0 2023-06-21 19:44:47,655 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.97 vs. limit=10.0 2023-06-21 19:45:12,460 INFO [train.py:996] (2/4) Epoch 5, batch 16200, loss[loss=0.2559, simple_loss=0.3337, pruned_loss=0.08908, over 21347.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3079, pruned_loss=0.07976, over 4283508.30 frames. ], batch size: 548, lr: 6.21e-03, grad_scale: 32.0 2023-06-21 19:45:18,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=829074.0, ans=0.0 2023-06-21 19:46:04,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=829134.0, ans=0.125 2023-06-21 19:46:38,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=829194.0, ans=0.0 2023-06-21 19:47:35,965 INFO [train.py:996] (2/4) Epoch 5, batch 16250, loss[loss=0.2361, simple_loss=0.2991, pruned_loss=0.08652, over 21356.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3068, pruned_loss=0.0791, over 4277812.02 frames. ], batch size: 471, lr: 6.21e-03, grad_scale: 32.0 2023-06-21 19:48:31,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=829494.0, ans=0.125 2023-06-21 19:48:52,455 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=829494.0, ans=0.125 2023-06-21 19:48:54,919 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.815e+02 2.259e+02 2.680e+02 3.153e+02 6.826e+02, threshold=5.361e+02, percent-clipped=1.0 2023-06-21 19:49:03,759 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 19:49:43,581 INFO [train.py:996] (2/4) Epoch 5, batch 16300, loss[loss=0.1999, simple_loss=0.3034, pruned_loss=0.0482, over 21176.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3025, pruned_loss=0.0749, over 4276230.55 frames. ], batch size: 548, lr: 6.21e-03, grad_scale: 32.0 2023-06-21 19:50:08,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=829674.0, ans=0.0 2023-06-21 19:50:22,746 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=829734.0, ans=0.0 2023-06-21 19:51:32,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=829914.0, ans=0.0 2023-06-21 19:52:05,558 INFO [train.py:996] (2/4) Epoch 5, batch 16350, loss[loss=0.2553, simple_loss=0.3215, pruned_loss=0.09459, over 21347.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3041, pruned_loss=0.07686, over 4281037.39 frames. ], batch size: 176, lr: 6.20e-03, grad_scale: 32.0 2023-06-21 19:53:09,372 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=830034.0, ans=0.125 2023-06-21 19:53:25,388 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.849e+02 2.400e+02 2.710e+02 3.275e+02 5.510e+02, threshold=5.421e+02, percent-clipped=1.0 2023-06-21 19:54:08,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=830214.0, ans=0.125 2023-06-21 19:54:20,641 INFO [train.py:996] (2/4) Epoch 5, batch 16400, loss[loss=0.2289, simple_loss=0.3052, pruned_loss=0.07624, over 21879.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3094, pruned_loss=0.07937, over 4282769.28 frames. ], batch size: 118, lr: 6.20e-03, grad_scale: 32.0 2023-06-21 19:55:02,562 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 19:55:21,349 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=830334.0, ans=0.0 2023-06-21 19:55:38,372 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=830394.0, ans=0.0 2023-06-21 19:55:43,638 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.13 vs. limit=6.0 2023-06-21 19:56:39,866 INFO [train.py:996] (2/4) Epoch 5, batch 16450, loss[loss=0.2184, simple_loss=0.2939, pruned_loss=0.07151, over 21664.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3083, pruned_loss=0.07994, over 4292297.31 frames. ], batch size: 263, lr: 6.20e-03, grad_scale: 32.0 2023-06-21 19:57:57,677 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 2.625e+02 2.903e+02 3.534e+02 6.213e+02, threshold=5.806e+02, percent-clipped=3.0 2023-06-21 19:58:47,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=830814.0, ans=0.2 2023-06-21 19:58:58,126 INFO [train.py:996] (2/4) Epoch 5, batch 16500, loss[loss=0.2275, simple_loss=0.2966, pruned_loss=0.07916, over 21852.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3077, pruned_loss=0.08058, over 4289681.63 frames. ], batch size: 316, lr: 6.20e-03, grad_scale: 32.0 2023-06-21 20:00:38,449 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=831054.0, ans=0.125 2023-06-21 20:00:51,169 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=831114.0, ans=0.07 2023-06-21 20:00:55,379 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=831114.0, ans=0.125 2023-06-21 20:01:20,114 INFO [train.py:996] (2/4) Epoch 5, batch 16550, loss[loss=0.2086, simple_loss=0.2852, pruned_loss=0.06598, over 21512.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3052, pruned_loss=0.07759, over 4285891.10 frames. ], batch size: 194, lr: 6.20e-03, grad_scale: 16.0 2023-06-21 20:01:26,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=831174.0, ans=0.125 2023-06-21 20:02:09,912 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.34 vs. limit=10.0 2023-06-21 20:02:10,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=831234.0, ans=0.125 2023-06-21 20:02:39,889 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.869e+02 2.770e+02 3.273e+02 4.136e+02 6.995e+02, threshold=6.546e+02, percent-clipped=5.0 2023-06-21 20:02:59,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=831354.0, ans=0.125 2023-06-21 20:03:47,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=831474.0, ans=0.0 2023-06-21 20:03:48,555 INFO [train.py:996] (2/4) Epoch 5, batch 16600, loss[loss=0.2627, simple_loss=0.342, pruned_loss=0.09164, over 21294.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.313, pruned_loss=0.08068, over 4281403.03 frames. ], batch size: 548, lr: 6.20e-03, grad_scale: 16.0 2023-06-21 20:04:06,334 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=831474.0, ans=0.125 2023-06-21 20:04:34,058 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.30 vs. limit=6.0 2023-06-21 20:05:51,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=831714.0, ans=0.125 2023-06-21 20:06:01,531 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.79 vs. limit=10.0 2023-06-21 20:06:09,560 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.54 vs. limit=15.0 2023-06-21 20:06:10,106 INFO [train.py:996] (2/4) Epoch 5, batch 16650, loss[loss=0.2794, simple_loss=0.3455, pruned_loss=0.1067, over 21234.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3212, pruned_loss=0.0831, over 4276625.59 frames. ], batch size: 143, lr: 6.20e-03, grad_scale: 16.0 2023-06-21 20:06:32,823 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=831834.0, ans=0.125 2023-06-21 20:06:57,918 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=831894.0, ans=0.125 2023-06-21 20:07:21,253 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.206e+02 2.847e+02 3.289e+02 3.856e+02 6.866e+02, threshold=6.578e+02, percent-clipped=1.0 2023-06-21 20:07:58,019 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.20 vs. limit=22.5 2023-06-21 20:08:19,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=832014.0, ans=0.0 2023-06-21 20:08:20,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=832014.0, ans=0.0 2023-06-21 20:08:28,176 INFO [train.py:996] (2/4) Epoch 5, batch 16700, loss[loss=0.1723, simple_loss=0.2296, pruned_loss=0.05748, over 21059.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3211, pruned_loss=0.08383, over 4269240.72 frames. ], batch size: 143, lr: 6.20e-03, grad_scale: 16.0 2023-06-21 20:08:52,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=832134.0, ans=0.125 2023-06-21 20:09:23,816 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=832194.0, ans=0.0 2023-06-21 20:10:49,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=832314.0, ans=0.0 2023-06-21 20:10:56,350 INFO [train.py:996] (2/4) Epoch 5, batch 16750, loss[loss=0.2531, simple_loss=0.3208, pruned_loss=0.09265, over 21328.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3247, pruned_loss=0.08561, over 4268823.07 frames. ], batch size: 176, lr: 6.20e-03, grad_scale: 16.0 2023-06-21 20:11:28,898 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=832374.0, ans=0.0 2023-06-21 20:11:28,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=832374.0, ans=0.2 2023-06-21 20:12:05,799 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=832434.0, ans=0.125 2023-06-21 20:12:20,171 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=832494.0, ans=0.125 2023-06-21 20:12:26,186 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=832494.0, ans=0.125 2023-06-21 20:12:28,148 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.24 vs. limit=12.0 2023-06-21 20:12:34,204 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.152e+02 2.676e+02 3.031e+02 3.509e+02 7.132e+02, threshold=6.063e+02, percent-clipped=1.0 2023-06-21 20:13:37,286 INFO [train.py:996] (2/4) Epoch 5, batch 16800, loss[loss=0.2285, simple_loss=0.304, pruned_loss=0.07649, over 21841.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3274, pruned_loss=0.086, over 4270891.62 frames. ], batch size: 298, lr: 6.19e-03, grad_scale: 32.0 2023-06-21 20:15:19,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=832914.0, ans=0.125 2023-06-21 20:15:58,069 INFO [train.py:996] (2/4) Epoch 5, batch 16850, loss[loss=0.2305, simple_loss=0.2961, pruned_loss=0.08247, over 21874.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3228, pruned_loss=0.08613, over 4280557.65 frames. ], batch size: 351, lr: 6.19e-03, grad_scale: 32.0 2023-06-21 20:16:42,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=833034.0, ans=0.0 2023-06-21 20:17:05,599 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.933e+02 2.685e+02 3.021e+02 3.703e+02 6.356e+02, threshold=6.041e+02, percent-clipped=1.0 2023-06-21 20:17:11,546 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 20:17:12,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=833154.0, ans=0.2 2023-06-21 20:17:13,472 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.19 vs. limit=22.5 2023-06-21 20:17:34,161 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.79 vs. limit=22.5 2023-06-21 20:17:46,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=833214.0, ans=0.0 2023-06-21 20:17:56,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=833214.0, ans=0.125 2023-06-21 20:18:07,190 INFO [train.py:996] (2/4) Epoch 5, batch 16900, loss[loss=0.2609, simple_loss=0.3744, pruned_loss=0.07369, over 20738.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3178, pruned_loss=0.08501, over 4281984.73 frames. ], batch size: 607, lr: 6.19e-03, grad_scale: 16.0 2023-06-21 20:18:42,679 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=833274.0, ans=0.0 2023-06-21 20:19:26,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=833454.0, ans=0.0 2023-06-21 20:19:28,156 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=833454.0, ans=0.0 2023-06-21 20:20:22,845 INFO [train.py:996] (2/4) Epoch 5, batch 16950, loss[loss=0.2396, simple_loss=0.2993, pruned_loss=0.08997, over 21393.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3124, pruned_loss=0.08361, over 4283659.09 frames. ], batch size: 159, lr: 6.19e-03, grad_scale: 16.0 2023-06-21 20:21:28,768 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 20:21:33,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=833694.0, ans=0.0 2023-06-21 20:21:36,918 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.99 vs. limit=22.5 2023-06-21 20:21:46,292 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.999e+02 2.433e+02 2.745e+02 3.481e+02 5.788e+02, threshold=5.489e+02, percent-clipped=0.0 2023-06-21 20:22:52,011 INFO [train.py:996] (2/4) Epoch 5, batch 17000, loss[loss=0.2273, simple_loss=0.2956, pruned_loss=0.07954, over 21860.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3099, pruned_loss=0.08351, over 4288328.85 frames. ], batch size: 298, lr: 6.19e-03, grad_scale: 16.0 2023-06-21 20:23:14,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=833874.0, ans=0.125 2023-06-21 20:23:37,756 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=833934.0, ans=0.125 2023-06-21 20:23:43,720 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 20:23:50,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=833994.0, ans=0.125 2023-06-21 20:24:03,800 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.59 vs. limit=15.0 2023-06-21 20:24:32,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=834054.0, ans=0.125 2023-06-21 20:25:07,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=834114.0, ans=0.125 2023-06-21 20:25:19,785 INFO [train.py:996] (2/4) Epoch 5, batch 17050, loss[loss=0.2664, simple_loss=0.3476, pruned_loss=0.09256, over 21390.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3179, pruned_loss=0.08599, over 4296259.52 frames. ], batch size: 548, lr: 6.19e-03, grad_scale: 16.0 2023-06-21 20:25:41,831 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.04 vs. limit=22.5 2023-06-21 20:26:22,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=834354.0, ans=0.125 2023-06-21 20:26:25,074 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.296e+02 2.807e+02 3.360e+02 4.026e+02 6.543e+02, threshold=6.720e+02, percent-clipped=2.0 2023-06-21 20:26:28,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=834354.0, ans=0.125 2023-06-21 20:26:29,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=834354.0, ans=0.2 2023-06-21 20:26:39,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=834354.0, ans=0.04949747468305833 2023-06-21 20:26:47,725 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.77 vs. limit=15.0 2023-06-21 20:27:33,437 INFO [train.py:996] (2/4) Epoch 5, batch 17100, loss[loss=0.2399, simple_loss=0.3085, pruned_loss=0.0856, over 21773.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3163, pruned_loss=0.0862, over 4298290.06 frames. ], batch size: 389, lr: 6.19e-03, grad_scale: 16.0 2023-06-21 20:28:17,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=834594.0, ans=0.1 2023-06-21 20:29:11,107 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=834714.0, ans=0.125 2023-06-21 20:29:14,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=834714.0, ans=0.2 2023-06-21 20:29:47,409 INFO [train.py:996] (2/4) Epoch 5, batch 17150, loss[loss=0.2066, simple_loss=0.2741, pruned_loss=0.0696, over 21326.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3113, pruned_loss=0.08402, over 4288432.44 frames. ], batch size: 143, lr: 6.19e-03, grad_scale: 16.0 2023-06-21 20:30:38,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=834894.0, ans=0.0 2023-06-21 20:31:03,890 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.851e+02 2.351e+02 2.648e+02 3.061e+02 4.435e+02, threshold=5.296e+02, percent-clipped=0.0 2023-06-21 20:32:11,740 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-06-21 20:32:12,205 INFO [train.py:996] (2/4) Epoch 5, batch 17200, loss[loss=0.2499, simple_loss=0.3526, pruned_loss=0.07363, over 19741.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3124, pruned_loss=0.08466, over 4293043.79 frames. ], batch size: 703, lr: 6.19e-03, grad_scale: 32.0 2023-06-21 20:32:14,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=835074.0, ans=0.0 2023-06-21 20:32:18,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=835074.0, ans=0.0 2023-06-21 20:32:20,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=835074.0, ans=0.1 2023-06-21 20:34:00,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=835254.0, ans=0.1 2023-06-21 20:34:22,683 INFO [train.py:996] (2/4) Epoch 5, batch 17250, loss[loss=0.3616, simple_loss=0.4007, pruned_loss=0.1613, over 21316.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3159, pruned_loss=0.08595, over 4287341.71 frames. ], batch size: 507, lr: 6.18e-03, grad_scale: 32.0 2023-06-21 20:34:42,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=835374.0, ans=0.0 2023-06-21 20:34:43,150 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.93 vs. limit=15.0 2023-06-21 20:35:16,895 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=835434.0, ans=0.125 2023-06-21 20:35:39,953 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.17 vs. limit=15.0 2023-06-21 20:35:51,571 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.029e+02 2.701e+02 3.017e+02 3.644e+02 7.802e+02, threshold=6.033e+02, percent-clipped=3.0 2023-06-21 20:36:42,222 INFO [train.py:996] (2/4) Epoch 5, batch 17300, loss[loss=0.2687, simple_loss=0.3462, pruned_loss=0.09556, over 21522.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.3233, pruned_loss=0.08948, over 4284331.79 frames. ], batch size: 112, lr: 6.18e-03, grad_scale: 16.0 2023-06-21 20:37:40,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=835734.0, ans=0.2 2023-06-21 20:38:19,897 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 20:38:42,757 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.22 vs. limit=12.0 2023-06-21 20:39:02,700 INFO [train.py:996] (2/4) Epoch 5, batch 17350, loss[loss=0.2344, simple_loss=0.3366, pruned_loss=0.06609, over 21282.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3229, pruned_loss=0.08847, over 4285371.97 frames. ], batch size: 548, lr: 6.18e-03, grad_scale: 16.0 2023-06-21 20:40:45,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=836154.0, ans=0.125 2023-06-21 20:40:49,981 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.074e+02 2.583e+02 2.886e+02 3.231e+02 4.631e+02, threshold=5.772e+02, percent-clipped=0.0 2023-06-21 20:41:32,767 INFO [train.py:996] (2/4) Epoch 5, batch 17400, loss[loss=0.2125, simple_loss=0.2887, pruned_loss=0.06816, over 21710.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3183, pruned_loss=0.08499, over 4275675.39 frames. ], batch size: 247, lr: 6.18e-03, grad_scale: 16.0 2023-06-21 20:44:15,071 INFO [train.py:996] (2/4) Epoch 5, batch 17450, loss[loss=0.1923, simple_loss=0.2769, pruned_loss=0.05384, over 21369.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3155, pruned_loss=0.08246, over 4272650.33 frames. ], batch size: 211, lr: 6.18e-03, grad_scale: 16.0 2023-06-21 20:45:09,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=836694.0, ans=0.1 2023-06-21 20:45:13,628 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=836694.0, ans=0.07 2023-06-21 20:45:30,571 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.951e+02 2.434e+02 2.809e+02 3.515e+02 5.944e+02, threshold=5.617e+02, percent-clipped=1.0 2023-06-21 20:46:17,027 INFO [train.py:996] (2/4) Epoch 5, batch 17500, loss[loss=0.2601, simple_loss=0.3305, pruned_loss=0.0948, over 21871.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3125, pruned_loss=0.08024, over 4275443.21 frames. ], batch size: 124, lr: 6.18e-03, grad_scale: 16.0 2023-06-21 20:46:19,251 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-06-21 20:46:40,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=836874.0, ans=0.07 2023-06-21 20:47:11,945 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.37 vs. limit=15.0 2023-06-21 20:47:23,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=836994.0, ans=0.2 2023-06-21 20:47:56,675 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=15.0 2023-06-21 20:48:30,508 INFO [train.py:996] (2/4) Epoch 5, batch 17550, loss[loss=0.2193, simple_loss=0.3156, pruned_loss=0.06147, over 21371.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3108, pruned_loss=0.07906, over 4278111.35 frames. ], batch size: 131, lr: 6.18e-03, grad_scale: 16.0 2023-06-21 20:48:34,522 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.44 vs. limit=15.0 2023-06-21 20:49:00,229 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=837234.0, ans=0.1 2023-06-21 20:49:08,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=837234.0, ans=0.125 2023-06-21 20:49:11,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=837234.0, ans=0.2 2023-06-21 20:49:50,420 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.814e+02 2.518e+02 2.821e+02 3.568e+02 5.525e+02, threshold=5.643e+02, percent-clipped=0.0 2023-06-21 20:50:45,974 INFO [train.py:996] (2/4) Epoch 5, batch 17600, loss[loss=0.2417, simple_loss=0.3171, pruned_loss=0.08314, over 21733.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3124, pruned_loss=0.07865, over 4276693.52 frames. ], batch size: 298, lr: 6.18e-03, grad_scale: 32.0 2023-06-21 20:51:21,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=837534.0, ans=0.0 2023-06-21 20:51:27,754 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=12.0 2023-06-21 20:52:42,137 INFO [train.py:996] (2/4) Epoch 5, batch 17650, loss[loss=0.2127, simple_loss=0.2918, pruned_loss=0.06683, over 21706.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3108, pruned_loss=0.07868, over 4268393.38 frames. ], batch size: 391, lr: 6.18e-03, grad_scale: 32.0 2023-06-21 20:53:25,491 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=837834.0, ans=0.025 2023-06-21 20:53:51,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=837954.0, ans=0.05 2023-06-21 20:54:08,338 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.934e+02 2.478e+02 2.944e+02 3.611e+02 6.334e+02, threshold=5.887e+02, percent-clipped=3.0 2023-06-21 20:54:13,461 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=837954.0, ans=0.0 2023-06-21 20:54:28,212 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=838014.0, ans=0.125 2023-06-21 20:54:35,549 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=838014.0, ans=0.1 2023-06-21 20:54:41,009 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=838014.0, ans=0.125 2023-06-21 20:54:56,280 INFO [train.py:996] (2/4) Epoch 5, batch 17700, loss[loss=0.2354, simple_loss=0.3151, pruned_loss=0.0779, over 20776.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3053, pruned_loss=0.07633, over 4261530.94 frames. ], batch size: 607, lr: 6.17e-03, grad_scale: 32.0 2023-06-21 20:55:02,744 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.43 vs. limit=15.0 2023-06-21 20:55:24,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=838134.0, ans=0.5 2023-06-21 20:55:33,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=838134.0, ans=10.0 2023-06-21 20:56:22,985 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.05 vs. limit=15.0 2023-06-21 20:56:40,580 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=838254.0, ans=0.1 2023-06-21 20:57:13,447 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 20:57:30,208 INFO [train.py:996] (2/4) Epoch 5, batch 17750, loss[loss=0.2237, simple_loss=0.3122, pruned_loss=0.0676, over 20733.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3111, pruned_loss=0.07903, over 4259384.96 frames. ], batch size: 607, lr: 6.17e-03, grad_scale: 16.0 2023-06-21 20:57:33,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=838374.0, ans=0.0 2023-06-21 20:57:35,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=838374.0, ans=6.0 2023-06-21 20:57:39,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=838374.0, ans=0.0 2023-06-21 20:57:42,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=838374.0, ans=0.1 2023-06-21 20:58:05,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=838434.0, ans=0.2 2023-06-21 20:58:39,391 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=12.0 2023-06-21 20:58:57,743 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.963e+02 2.544e+02 3.029e+02 3.560e+02 6.664e+02, threshold=6.058e+02, percent-clipped=1.0 2023-06-21 20:59:12,943 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=838554.0, ans=0.0 2023-06-21 20:59:28,006 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=838614.0, ans=0.125 2023-06-21 20:59:49,792 INFO [train.py:996] (2/4) Epoch 5, batch 17800, loss[loss=0.1911, simple_loss=0.2621, pruned_loss=0.0601, over 21299.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3098, pruned_loss=0.07833, over 4260158.20 frames. ], batch size: 176, lr: 6.17e-03, grad_scale: 16.0 2023-06-21 21:00:11,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=838734.0, ans=0.2 2023-06-21 21:01:15,749 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=838854.0, ans=0.125 2023-06-21 21:01:33,318 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.40 vs. limit=15.0 2023-06-21 21:01:45,052 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=838914.0, ans=0.125 2023-06-21 21:01:58,536 INFO [train.py:996] (2/4) Epoch 5, batch 17850, loss[loss=0.2647, simple_loss=0.3476, pruned_loss=0.09086, over 20660.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3123, pruned_loss=0.07966, over 4268921.88 frames. ], batch size: 607, lr: 6.17e-03, grad_scale: 16.0 2023-06-21 21:02:35,774 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=839034.0, ans=0.125 2023-06-21 21:02:35,779 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=839034.0, ans=0.125 2023-06-21 21:03:03,457 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 21:03:43,551 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.973e+02 2.616e+02 2.932e+02 3.373e+02 4.756e+02, threshold=5.865e+02, percent-clipped=0.0 2023-06-21 21:04:31,053 INFO [train.py:996] (2/4) Epoch 5, batch 17900, loss[loss=0.2358, simple_loss=0.3292, pruned_loss=0.07117, over 21868.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3179, pruned_loss=0.0822, over 4261525.84 frames. ], batch size: 316, lr: 6.17e-03, grad_scale: 16.0 2023-06-21 21:04:52,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=839274.0, ans=0.0 2023-06-21 21:04:55,573 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.76 vs. limit=15.0 2023-06-21 21:05:16,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=839334.0, ans=0.95 2023-06-21 21:06:12,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=839394.0, ans=0.1 2023-06-21 21:06:18,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=839454.0, ans=0.0 2023-06-21 21:07:04,357 INFO [train.py:996] (2/4) Epoch 5, batch 17950, loss[loss=0.1833, simple_loss=0.2805, pruned_loss=0.04308, over 21795.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3155, pruned_loss=0.07792, over 4262337.62 frames. ], batch size: 351, lr: 6.17e-03, grad_scale: 16.0 2023-06-21 21:08:30,048 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.702e+02 2.221e+02 2.558e+02 3.024e+02 5.288e+02, threshold=5.115e+02, percent-clipped=0.0 2023-06-21 21:09:04,198 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=839814.0, ans=0.2 2023-06-21 21:09:18,126 INFO [train.py:996] (2/4) Epoch 5, batch 18000, loss[loss=0.1975, simple_loss=0.2502, pruned_loss=0.07239, over 21317.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3082, pruned_loss=0.07671, over 4265141.86 frames. ], batch size: 551, lr: 6.17e-03, grad_scale: 32.0 2023-06-21 21:09:18,126 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-21 21:10:07,768 INFO [train.py:1028] (2/4) Epoch 5, validation: loss=0.2683, simple_loss=0.365, pruned_loss=0.08582, over 1796401.00 frames. 2023-06-21 21:10:07,771 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-21 21:12:14,557 INFO [train.py:996] (2/4) Epoch 5, batch 18050, loss[loss=0.2295, simple_loss=0.2986, pruned_loss=0.08016, over 21729.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3044, pruned_loss=0.07733, over 4259683.60 frames. ], batch size: 333, lr: 6.17e-03, grad_scale: 32.0 2023-06-21 21:12:16,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=840174.0, ans=0.0 2023-06-21 21:12:19,809 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=840174.0, ans=10.0 2023-06-21 21:12:46,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=840234.0, ans=0.0 2023-06-21 21:12:50,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=840234.0, ans=0.1 2023-06-21 21:12:56,296 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 21:13:30,873 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.50 vs. limit=15.0 2023-06-21 21:13:37,666 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 2.366e+02 2.744e+02 3.197e+02 4.866e+02, threshold=5.489e+02, percent-clipped=0.0 2023-06-21 21:14:15,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=840414.0, ans=0.0 2023-06-21 21:14:24,272 INFO [train.py:996] (2/4) Epoch 5, batch 18100, loss[loss=0.2344, simple_loss=0.3319, pruned_loss=0.06842, over 21678.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.31, pruned_loss=0.07935, over 4269886.30 frames. ], batch size: 247, lr: 6.17e-03, grad_scale: 32.0 2023-06-21 21:14:24,871 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=840474.0, ans=0.2 2023-06-21 21:14:48,963 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=840474.0, ans=0.2 2023-06-21 21:14:55,903 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-06-21 21:15:03,638 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.96 vs. limit=6.0 2023-06-21 21:15:34,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=840594.0, ans=0.2 2023-06-21 21:16:32,820 INFO [train.py:996] (2/4) Epoch 5, batch 18150, loss[loss=0.2069, simple_loss=0.2725, pruned_loss=0.07065, over 21742.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3097, pruned_loss=0.07831, over 4257615.78 frames. ], batch size: 112, lr: 6.16e-03, grad_scale: 32.0 2023-06-21 21:16:56,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=840774.0, ans=0.0 2023-06-21 21:17:19,571 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=840894.0, ans=0.125 2023-06-21 21:17:52,206 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.016e+02 2.463e+02 2.724e+02 3.140e+02 4.706e+02, threshold=5.448e+02, percent-clipped=0.0 2023-06-21 21:17:57,149 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=840954.0, ans=0.1 2023-06-21 21:18:22,758 INFO [train.py:996] (2/4) Epoch 5, batch 18200, loss[loss=0.1959, simple_loss=0.2644, pruned_loss=0.06373, over 21496.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3031, pruned_loss=0.07782, over 4259629.72 frames. ], batch size: 230, lr: 6.16e-03, grad_scale: 32.0 2023-06-21 21:19:04,330 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.83 vs. limit=22.5 2023-06-21 21:20:10,368 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=841314.0, ans=0.125 2023-06-21 21:20:36,979 INFO [train.py:996] (2/4) Epoch 5, batch 18250, loss[loss=0.1728, simple_loss=0.2449, pruned_loss=0.05035, over 21518.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2969, pruned_loss=0.07601, over 4260011.52 frames. ], batch size: 212, lr: 6.16e-03, grad_scale: 32.0 2023-06-21 21:21:37,756 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.45 vs. limit=22.5 2023-06-21 21:22:05,472 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.639e+02 2.405e+02 2.752e+02 3.303e+02 7.064e+02, threshold=5.504e+02, percent-clipped=4.0 2023-06-21 21:22:35,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=841614.0, ans=0.0 2023-06-21 21:22:42,117 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.37 vs. limit=15.0 2023-06-21 21:22:45,505 INFO [train.py:996] (2/4) Epoch 5, batch 18300, loss[loss=0.2321, simple_loss=0.3338, pruned_loss=0.0652, over 21433.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.2981, pruned_loss=0.07627, over 4263419.94 frames. ], batch size: 211, lr: 6.16e-03, grad_scale: 16.0 2023-06-21 21:23:07,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=841734.0, ans=0.0 2023-06-21 21:24:44,380 INFO [train.py:996] (2/4) Epoch 5, batch 18350, loss[loss=0.2007, simple_loss=0.2687, pruned_loss=0.06634, over 21651.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3016, pruned_loss=0.07569, over 4262735.20 frames. ], batch size: 247, lr: 6.16e-03, grad_scale: 16.0 2023-06-21 21:25:30,501 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.57 vs. limit=6.0 2023-06-21 21:25:37,766 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 21:25:49,017 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.83 vs. limit=22.5 2023-06-21 21:25:57,177 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.whiten.whitening_limit, batch_count=842094.0, ans=15.0 2023-06-21 21:26:05,183 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.993e+02 2.529e+02 3.098e+02 3.873e+02 6.507e+02, threshold=6.195e+02, percent-clipped=4.0 2023-06-21 21:27:08,524 INFO [train.py:996] (2/4) Epoch 5, batch 18400, loss[loss=0.2373, simple_loss=0.3259, pruned_loss=0.07429, over 21057.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2971, pruned_loss=0.07445, over 4263911.92 frames. ], batch size: 607, lr: 6.16e-03, grad_scale: 32.0 2023-06-21 21:27:44,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=842334.0, ans=0.125 2023-06-21 21:28:03,516 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=842394.0, ans=10.0 2023-06-21 21:29:21,806 INFO [train.py:996] (2/4) Epoch 5, batch 18450, loss[loss=0.2458, simple_loss=0.3475, pruned_loss=0.07204, over 19863.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2956, pruned_loss=0.0714, over 4265637.71 frames. ], batch size: 702, lr: 6.16e-03, grad_scale: 32.0 2023-06-21 21:30:35,669 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.867e+02 2.187e+02 2.513e+02 3.081e+02 4.801e+02, threshold=5.026e+02, percent-clipped=0.0 2023-06-21 21:30:51,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=842814.0, ans=0.2 2023-06-21 21:31:16,296 INFO [train.py:996] (2/4) Epoch 5, batch 18500, loss[loss=0.1995, simple_loss=0.2654, pruned_loss=0.06675, over 21507.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2898, pruned_loss=0.06913, over 4252207.67 frames. ], batch size: 212, lr: 6.16e-03, grad_scale: 32.0 2023-06-21 21:32:33,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=842994.0, ans=0.2 2023-06-21 21:32:37,925 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=843054.0, ans=0.125 2023-06-21 21:32:39,809 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.16 vs. limit=12.0 2023-06-21 21:32:55,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=843054.0, ans=0.125 2023-06-21 21:33:38,476 INFO [train.py:996] (2/4) Epoch 5, batch 18550, loss[loss=0.1895, simple_loss=0.2723, pruned_loss=0.05333, over 21570.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2866, pruned_loss=0.06853, over 4252578.45 frames. ], batch size: 230, lr: 6.16e-03, grad_scale: 32.0 2023-06-21 21:33:38,862 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=843174.0, ans=0.0 2023-06-21 21:34:03,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=843174.0, ans=0.125 2023-06-21 21:34:44,878 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.96 vs. limit=15.0 2023-06-21 21:35:03,986 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.831e+02 2.298e+02 2.531e+02 2.916e+02 4.382e+02, threshold=5.063e+02, percent-clipped=0.0 2023-06-21 21:35:39,105 INFO [train.py:996] (2/4) Epoch 5, batch 18600, loss[loss=0.176, simple_loss=0.2424, pruned_loss=0.05476, over 21853.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2845, pruned_loss=0.06918, over 4238648.53 frames. ], batch size: 107, lr: 6.15e-03, grad_scale: 16.0 2023-06-21 21:35:54,959 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.06 vs. limit=22.5 2023-06-21 21:35:55,114 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.31 vs. limit=12.0 2023-06-21 21:36:55,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=843654.0, ans=0.0 2023-06-21 21:37:38,655 INFO [train.py:996] (2/4) Epoch 5, batch 18650, loss[loss=0.2092, simple_loss=0.2788, pruned_loss=0.06978, over 21287.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2848, pruned_loss=0.07021, over 4235856.38 frames. ], batch size: 131, lr: 6.15e-03, grad_scale: 16.0 2023-06-21 21:37:41,146 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.05 vs. limit=6.0 2023-06-21 21:38:31,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=843834.0, ans=0.125 2023-06-21 21:38:37,212 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.19 vs. limit=10.0 2023-06-21 21:39:04,239 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.856e+02 2.384e+02 2.688e+02 3.191e+02 3.999e+02, threshold=5.375e+02, percent-clipped=0.0 2023-06-21 21:39:46,844 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.86 vs. limit=22.5 2023-06-21 21:39:50,080 INFO [train.py:996] (2/4) Epoch 5, batch 18700, loss[loss=0.2018, simple_loss=0.2628, pruned_loss=0.0704, over 21475.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2825, pruned_loss=0.0714, over 4243325.69 frames. ], batch size: 212, lr: 6.15e-03, grad_scale: 16.0 2023-06-21 21:41:10,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=844254.0, ans=0.125 2023-06-21 21:41:58,413 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=844314.0, ans=0.125 2023-06-21 21:42:07,685 INFO [train.py:996] (2/4) Epoch 5, batch 18750, loss[loss=0.2712, simple_loss=0.3393, pruned_loss=0.1016, over 21901.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2861, pruned_loss=0.07461, over 4254115.40 frames. ], batch size: 372, lr: 6.15e-03, grad_scale: 16.0 2023-06-21 21:43:00,085 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=844434.0, ans=0.125 2023-06-21 21:43:06,998 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.08 vs. limit=15.0 2023-06-21 21:43:41,301 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.605e+02 2.502e+02 2.880e+02 3.663e+02 5.516e+02, threshold=5.761e+02, percent-clipped=2.0 2023-06-21 21:44:30,213 INFO [train.py:996] (2/4) Epoch 5, batch 18800, loss[loss=0.2694, simple_loss=0.3502, pruned_loss=0.09424, over 20719.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2927, pruned_loss=0.07604, over 4253720.85 frames. ], batch size: 607, lr: 6.15e-03, grad_scale: 32.0 2023-06-21 21:44:45,768 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.17 vs. limit=15.0 2023-06-21 21:46:28,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=844914.0, ans=0.0 2023-06-21 21:46:32,376 INFO [train.py:996] (2/4) Epoch 5, batch 18850, loss[loss=0.213, simple_loss=0.3146, pruned_loss=0.05573, over 21269.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2898, pruned_loss=0.07142, over 4259635.84 frames. ], batch size: 548, lr: 6.15e-03, grad_scale: 32.0 2023-06-21 21:47:46,333 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=845154.0, ans=0.125 2023-06-21 21:48:09,535 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.639e+02 2.173e+02 2.548e+02 3.090e+02 5.403e+02, threshold=5.096e+02, percent-clipped=0.0 2023-06-21 21:48:45,134 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=845214.0, ans=0.0 2023-06-21 21:48:47,485 INFO [train.py:996] (2/4) Epoch 5, batch 18900, loss[loss=0.1934, simple_loss=0.2672, pruned_loss=0.05979, over 21369.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2862, pruned_loss=0.07146, over 4258818.15 frames. ], batch size: 211, lr: 6.15e-03, grad_scale: 32.0 2023-06-21 21:48:59,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=845274.0, ans=0.0 2023-06-21 21:49:23,287 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=22.5 2023-06-21 21:49:37,873 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=845394.0, ans=10.0 2023-06-21 21:50:18,795 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=845454.0, ans=0.0 2023-06-21 21:50:18,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=845454.0, ans=0.2 2023-06-21 21:50:50,446 INFO [train.py:996] (2/4) Epoch 5, batch 18950, loss[loss=0.317, simple_loss=0.3943, pruned_loss=0.1198, over 21612.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2884, pruned_loss=0.07379, over 4267679.00 frames. ], batch size: 508, lr: 6.15e-03, grad_scale: 32.0 2023-06-21 21:51:18,141 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=845574.0, ans=0.125 2023-06-21 21:52:52,185 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.946e+02 2.414e+02 2.698e+02 3.115e+02 5.010e+02, threshold=5.395e+02, percent-clipped=0.0 2023-06-21 21:53:05,501 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 21:53:21,635 INFO [train.py:996] (2/4) Epoch 5, batch 19000, loss[loss=0.2899, simple_loss=0.353, pruned_loss=0.1134, over 21749.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.2978, pruned_loss=0.07618, over 4267294.19 frames. ], batch size: 441, lr: 6.15e-03, grad_scale: 32.0 2023-06-21 21:53:46,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=845934.0, ans=0.125 2023-06-21 21:54:08,006 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=845994.0, ans=0.125 2023-06-21 21:55:00,442 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=846054.0, ans=0.125 2023-06-21 21:55:29,404 INFO [train.py:996] (2/4) Epoch 5, batch 19050, loss[loss=0.2728, simple_loss=0.3435, pruned_loss=0.1011, over 21801.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3042, pruned_loss=0.07994, over 4281214.27 frames. ], batch size: 414, lr: 6.14e-03, grad_scale: 32.0 2023-06-21 21:55:47,738 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=15.0 2023-06-21 21:56:27,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=846234.0, ans=0.0 2023-06-21 21:56:45,719 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=12.0 2023-06-21 21:57:00,291 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.023e+02 2.693e+02 3.057e+02 3.649e+02 5.381e+02, threshold=6.114e+02, percent-clipped=0.0 2023-06-21 21:57:02,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=846354.0, ans=0.04949747468305833 2023-06-21 21:57:43,805 INFO [train.py:996] (2/4) Epoch 5, batch 19100, loss[loss=0.2223, simple_loss=0.2886, pruned_loss=0.07802, over 21988.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3022, pruned_loss=0.08066, over 4275039.48 frames. ], batch size: 103, lr: 6.14e-03, grad_scale: 16.0 2023-06-21 21:58:10,379 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=846534.0, ans=0.125 2023-06-21 21:58:35,649 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.65 vs. limit=6.0 2023-06-21 21:59:28,610 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=846714.0, ans=0.0 2023-06-21 22:00:10,247 INFO [train.py:996] (2/4) Epoch 5, batch 19150, loss[loss=0.2529, simple_loss=0.3395, pruned_loss=0.08318, over 21621.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3047, pruned_loss=0.08272, over 4272671.04 frames. ], batch size: 263, lr: 6.14e-03, grad_scale: 16.0 2023-06-21 22:00:34,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=846834.0, ans=0.1 2023-06-21 22:01:25,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=846894.0, ans=0.025 2023-06-21 22:01:26,832 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 22:01:31,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=846954.0, ans=0.2 2023-06-21 22:01:42,314 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.974e+02 2.638e+02 2.878e+02 3.184e+02 5.125e+02, threshold=5.755e+02, percent-clipped=0.0 2023-06-21 22:02:02,160 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 22:02:02,726 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-06-21 22:02:21,963 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.75 vs. limit=6.0 2023-06-21 22:02:22,329 INFO [train.py:996] (2/4) Epoch 5, batch 19200, loss[loss=0.2568, simple_loss=0.3515, pruned_loss=0.08102, over 21774.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3138, pruned_loss=0.083, over 4270484.54 frames. ], batch size: 351, lr: 6.14e-03, grad_scale: 32.0 2023-06-21 22:03:27,505 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.92 vs. limit=15.0 2023-06-21 22:03:34,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=847194.0, ans=0.0 2023-06-21 22:04:36,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=847314.0, ans=0.125 2023-06-21 22:04:42,046 INFO [train.py:996] (2/4) Epoch 5, batch 19250, loss[loss=0.1994, simple_loss=0.2849, pruned_loss=0.05699, over 21622.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3153, pruned_loss=0.07829, over 4263164.12 frames. ], batch size: 263, lr: 6.14e-03, grad_scale: 32.0 2023-06-21 22:05:31,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=847434.0, ans=0.2 2023-06-21 22:05:33,978 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.08 vs. limit=22.5 2023-06-21 22:05:52,943 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=847494.0, ans=0.2 2023-06-21 22:06:02,809 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.16 vs. limit=12.0 2023-06-21 22:06:30,475 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.677e+02 2.396e+02 2.706e+02 3.299e+02 5.125e+02, threshold=5.412e+02, percent-clipped=0.0 2023-06-21 22:06:41,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=847614.0, ans=0.1 2023-06-21 22:06:51,712 INFO [train.py:996] (2/4) Epoch 5, batch 19300, loss[loss=0.1992, simple_loss=0.2757, pruned_loss=0.0613, over 21312.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3114, pruned_loss=0.07638, over 4271542.96 frames. ], batch size: 159, lr: 6.14e-03, grad_scale: 32.0 2023-06-21 22:06:56,753 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 22:07:05,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=847734.0, ans=0.0 2023-06-21 22:07:05,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=847734.0, ans=0.125 2023-06-21 22:07:44,551 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.18 vs. limit=15.0 2023-06-21 22:08:58,053 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=847914.0, ans=0.1 2023-06-21 22:09:03,670 INFO [train.py:996] (2/4) Epoch 5, batch 19350, loss[loss=0.2145, simple_loss=0.3034, pruned_loss=0.06277, over 21604.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3062, pruned_loss=0.07378, over 4276193.00 frames. ], batch size: 389, lr: 6.14e-03, grad_scale: 32.0 2023-06-21 22:09:13,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=847974.0, ans=0.125 2023-06-21 22:09:23,876 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=847974.0, ans=0.0 2023-06-21 22:09:52,740 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=848034.0, ans=0.125 2023-06-21 22:10:11,515 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.36 vs. limit=6.0 2023-06-21 22:10:49,119 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.636e+02 2.180e+02 2.502e+02 2.789e+02 3.930e+02, threshold=5.004e+02, percent-clipped=0.0 2023-06-21 22:11:16,305 INFO [train.py:996] (2/4) Epoch 5, batch 19400, loss[loss=0.1728, simple_loss=0.2515, pruned_loss=0.04701, over 21254.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3038, pruned_loss=0.07308, over 4284155.24 frames. ], batch size: 176, lr: 6.14e-03, grad_scale: 32.0 2023-06-21 22:13:26,767 INFO [train.py:996] (2/4) Epoch 5, batch 19450, loss[loss=0.2022, simple_loss=0.259, pruned_loss=0.07266, over 21668.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.2997, pruned_loss=0.07393, over 4290959.52 frames. ], batch size: 282, lr: 6.14e-03, grad_scale: 32.0 2023-06-21 22:13:27,856 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.84 vs. limit=15.0 2023-06-21 22:13:28,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=848574.0, ans=0.125 2023-06-21 22:14:20,301 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.73 vs. limit=15.0 2023-06-21 22:14:28,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=848694.0, ans=0.125 2023-06-21 22:15:03,936 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.874e+02 2.755e+02 3.312e+02 4.071e+02 7.217e+02, threshold=6.625e+02, percent-clipped=11.0 2023-06-21 22:15:31,727 INFO [train.py:996] (2/4) Epoch 5, batch 19500, loss[loss=0.2625, simple_loss=0.3317, pruned_loss=0.09671, over 21562.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.2979, pruned_loss=0.07569, over 4287146.06 frames. ], batch size: 389, lr: 6.14e-03, grad_scale: 32.0 2023-06-21 22:15:49,627 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=848874.0, ans=0.125 2023-06-21 22:16:32,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=848994.0, ans=0.125 2023-06-21 22:17:07,833 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=849054.0, ans=0.125 2023-06-21 22:17:21,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=849054.0, ans=0.1 2023-06-21 22:17:34,132 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=849114.0, ans=0.125 2023-06-21 22:17:47,553 INFO [train.py:996] (2/4) Epoch 5, batch 19550, loss[loss=0.2058, simple_loss=0.3008, pruned_loss=0.0554, over 21158.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2928, pruned_loss=0.07363, over 4289782.26 frames. ], batch size: 548, lr: 6.13e-03, grad_scale: 16.0 2023-06-21 22:18:43,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=849294.0, ans=0.0 2023-06-21 22:19:29,018 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.022e+02 2.481e+02 2.782e+02 3.315e+02 6.873e+02, threshold=5.565e+02, percent-clipped=1.0 2023-06-21 22:19:42,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=849414.0, ans=0.2 2023-06-21 22:19:59,641 INFO [train.py:996] (2/4) Epoch 5, batch 19600, loss[loss=0.2067, simple_loss=0.2807, pruned_loss=0.06634, over 21818.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2943, pruned_loss=0.075, over 4280749.69 frames. ], batch size: 298, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 22:21:18,439 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 22:21:56,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=849714.0, ans=0.1 2023-06-21 22:22:13,703 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.43 vs. limit=15.0 2023-06-21 22:22:29,147 INFO [train.py:996] (2/4) Epoch 5, batch 19650, loss[loss=0.235, simple_loss=0.3009, pruned_loss=0.08454, over 21496.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3002, pruned_loss=0.07888, over 4281781.21 frames. ], batch size: 194, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 22:22:36,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=849774.0, ans=0.125 2023-06-21 22:23:59,818 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.94 vs. limit=10.0 2023-06-21 22:24:12,401 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 2.670e+02 2.999e+02 3.349e+02 5.989e+02, threshold=5.997e+02, percent-clipped=1.0 2023-06-21 22:24:15,547 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.80 vs. limit=15.0 2023-06-21 22:24:57,806 INFO [train.py:996] (2/4) Epoch 5, batch 19700, loss[loss=0.2005, simple_loss=0.2561, pruned_loss=0.07248, over 21131.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3034, pruned_loss=0.07945, over 4280999.87 frames. ], batch size: 143, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 22:25:12,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=850134.0, ans=0.0 2023-06-21 22:25:21,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=850134.0, ans=0.1 2023-06-21 22:26:39,497 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=850254.0, ans=0.125 2023-06-21 22:26:49,148 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.01 vs. limit=22.5 2023-06-21 22:27:10,781 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.66 vs. limit=6.0 2023-06-21 22:27:22,344 INFO [train.py:996] (2/4) Epoch 5, batch 19750, loss[loss=0.3364, simple_loss=0.3937, pruned_loss=0.1395, over 21649.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3127, pruned_loss=0.08079, over 4274816.29 frames. ], batch size: 507, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 22:28:27,085 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=850494.0, ans=0.125 2023-06-21 22:29:19,493 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.051e+02 2.869e+02 3.746e+02 5.027e+02 9.459e+02, threshold=7.491e+02, percent-clipped=12.0 2023-06-21 22:29:30,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=850614.0, ans=0.125 2023-06-21 22:29:39,225 INFO [train.py:996] (2/4) Epoch 5, batch 19800, loss[loss=0.2767, simple_loss=0.361, pruned_loss=0.09626, over 19944.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3123, pruned_loss=0.08131, over 4272731.89 frames. ], batch size: 702, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 22:29:52,948 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=850674.0, ans=0.1 2023-06-21 22:29:53,426 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.74 vs. limit=12.0 2023-06-21 22:30:57,630 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=850794.0, ans=0.125 2023-06-21 22:31:24,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=850854.0, ans=0.2 2023-06-21 22:31:46,838 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=850914.0, ans=0.125 2023-06-21 22:32:01,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=850914.0, ans=0.1 2023-06-21 22:32:11,663 INFO [train.py:996] (2/4) Epoch 5, batch 19850, loss[loss=0.1831, simple_loss=0.2624, pruned_loss=0.05196, over 21251.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3032, pruned_loss=0.07528, over 4274103.63 frames. ], batch size: 176, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 22:32:35,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=851034.0, ans=0.0 2023-06-21 22:32:36,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=851034.0, ans=0.1 2023-06-21 22:33:37,127 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.738e+02 2.104e+02 2.344e+02 2.838e+02 4.436e+02, threshold=4.687e+02, percent-clipped=0.0 2023-06-21 22:34:04,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=851214.0, ans=0.2 2023-06-21 22:34:24,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=851274.0, ans=0.125 2023-06-21 22:34:25,313 INFO [train.py:996] (2/4) Epoch 5, batch 19900, loss[loss=0.1969, simple_loss=0.2689, pruned_loss=0.06244, over 21843.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3037, pruned_loss=0.07301, over 4274026.96 frames. ], batch size: 107, lr: 6.13e-03, grad_scale: 16.0 2023-06-21 22:34:31,038 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-21 22:34:32,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=851274.0, ans=0.125 2023-06-21 22:34:43,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=851274.0, ans=0.125 2023-06-21 22:34:47,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=851334.0, ans=0.125 2023-06-21 22:35:51,154 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.77 vs. limit=15.0 2023-06-21 22:36:20,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=851514.0, ans=0.0 2023-06-21 22:36:28,343 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 22:36:29,307 INFO [train.py:996] (2/4) Epoch 5, batch 19950, loss[loss=0.2288, simple_loss=0.2927, pruned_loss=0.08248, over 21702.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2981, pruned_loss=0.07295, over 4275971.23 frames. ], batch size: 112, lr: 6.13e-03, grad_scale: 16.0 2023-06-21 22:36:30,407 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.45 vs. limit=15.0 2023-06-21 22:37:09,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=851634.0, ans=0.0 2023-06-21 22:37:52,129 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=851694.0, ans=0.07 2023-06-21 22:37:59,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=851754.0, ans=0.2 2023-06-21 22:38:05,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=851754.0, ans=0.125 2023-06-21 22:38:22,474 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.805e+02 2.543e+02 2.962e+02 3.700e+02 5.630e+02, threshold=5.923e+02, percent-clipped=5.0 2023-06-21 22:38:52,203 INFO [train.py:996] (2/4) Epoch 5, batch 20000, loss[loss=0.2187, simple_loss=0.2904, pruned_loss=0.07355, over 21458.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.3005, pruned_loss=0.07381, over 4279249.87 frames. ], batch size: 131, lr: 6.12e-03, grad_scale: 32.0 2023-06-21 22:38:55,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=851874.0, ans=0.125 2023-06-21 22:39:11,584 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.51 vs. limit=15.0 2023-06-21 22:39:34,637 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=851994.0, ans=0.125 2023-06-21 22:40:15,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=852054.0, ans=0.2 2023-06-21 22:40:51,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=852114.0, ans=0.125 2023-06-21 22:40:53,538 INFO [train.py:996] (2/4) Epoch 5, batch 20050, loss[loss=0.2416, simple_loss=0.3027, pruned_loss=0.09029, over 21562.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3014, pruned_loss=0.07599, over 4282987.27 frames. ], batch size: 548, lr: 6.12e-03, grad_scale: 32.0 2023-06-21 22:41:11,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=852174.0, ans=0.1 2023-06-21 22:41:23,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=852174.0, ans=0.025 2023-06-21 22:41:40,131 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=852294.0, ans=0.125 2023-06-21 22:42:07,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=852354.0, ans=0.0 2023-06-21 22:42:14,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=852354.0, ans=0.125 2023-06-21 22:42:24,538 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.068e+02 2.589e+02 2.869e+02 3.352e+02 4.558e+02, threshold=5.737e+02, percent-clipped=0.0 2023-06-21 22:43:07,071 INFO [train.py:996] (2/4) Epoch 5, batch 20100, loss[loss=0.2722, simple_loss=0.3622, pruned_loss=0.09113, over 21681.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3043, pruned_loss=0.07856, over 4289966.99 frames. ], batch size: 389, lr: 6.12e-03, grad_scale: 16.0 2023-06-21 22:43:07,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=852474.0, ans=0.0 2023-06-21 22:44:29,061 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=852594.0, ans=0.1 2023-06-21 22:45:34,426 INFO [train.py:996] (2/4) Epoch 5, batch 20150, loss[loss=0.3109, simple_loss=0.369, pruned_loss=0.1264, over 21758.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3142, pruned_loss=0.08204, over 4288799.14 frames. ], batch size: 441, lr: 6.12e-03, grad_scale: 16.0 2023-06-21 22:45:36,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=852774.0, ans=0.125 2023-06-21 22:45:36,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=852774.0, ans=0.125 2023-06-21 22:47:01,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=852954.0, ans=0.2 2023-06-21 22:47:18,098 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.257e+02 2.899e+02 3.546e+02 4.462e+02 7.672e+02, threshold=7.092e+02, percent-clipped=5.0 2023-06-21 22:47:33,780 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.48 vs. limit=15.0 2023-06-21 22:47:53,461 INFO [train.py:996] (2/4) Epoch 5, batch 20200, loss[loss=0.2742, simple_loss=0.3765, pruned_loss=0.08589, over 21238.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3183, pruned_loss=0.08449, over 4282887.98 frames. ], batch size: 549, lr: 6.12e-03, grad_scale: 16.0 2023-06-21 22:48:04,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=853074.0, ans=0.125 2023-06-21 22:48:05,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=853074.0, ans=0.2 2023-06-21 22:48:52,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=853134.0, ans=0.125 2023-06-21 22:49:17,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=853254.0, ans=0.125 2023-06-21 22:49:27,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=853254.0, ans=0.125 2023-06-21 22:49:28,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=853254.0, ans=0.2 2023-06-21 22:49:43,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=853314.0, ans=0.0 2023-06-21 22:50:22,435 INFO [train.py:996] (2/4) Epoch 5, batch 20250, loss[loss=0.238, simple_loss=0.3099, pruned_loss=0.08309, over 21303.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3198, pruned_loss=0.08306, over 4283405.34 frames. ], batch size: 159, lr: 6.12e-03, grad_scale: 16.0 2023-06-21 22:50:28,024 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=853374.0, ans=0.125 2023-06-21 22:50:44,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=853434.0, ans=0.5 2023-06-21 22:51:24,529 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.61 vs. limit=15.0 2023-06-21 22:51:56,554 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.890e+02 2.435e+02 2.804e+02 3.299e+02 5.136e+02, threshold=5.609e+02, percent-clipped=0.0 2023-06-21 22:52:26,428 INFO [train.py:996] (2/4) Epoch 5, batch 20300, loss[loss=0.1985, simple_loss=0.2728, pruned_loss=0.06208, over 21872.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3179, pruned_loss=0.0802, over 4265991.74 frames. ], batch size: 98, lr: 6.12e-03, grad_scale: 16.0 2023-06-21 22:53:43,952 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=853854.0, ans=0.2 2023-06-21 22:54:20,957 INFO [train.py:996] (2/4) Epoch 5, batch 20350, loss[loss=0.2513, simple_loss=0.3233, pruned_loss=0.08971, over 21688.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3175, pruned_loss=0.08024, over 4263654.04 frames. ], batch size: 389, lr: 6.12e-03, grad_scale: 16.0 2023-06-21 22:54:56,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=854034.0, ans=15.0 2023-06-21 22:56:00,736 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=854154.0, ans=0.125 2023-06-21 22:56:00,759 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=854154.0, ans=0.125 2023-06-21 22:56:03,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=854154.0, ans=0.1 2023-06-21 22:56:05,152 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=854154.0, ans=0.0 2023-06-21 22:56:09,298 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.986e+02 2.362e+02 2.783e+02 3.388e+02 6.347e+02, threshold=5.566e+02, percent-clipped=2.0 2023-06-21 22:56:11,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=854214.0, ans=0.125 2023-06-21 22:56:19,963 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=854214.0, ans=0.125 2023-06-21 22:56:42,089 INFO [train.py:996] (2/4) Epoch 5, batch 20400, loss[loss=0.2743, simple_loss=0.3332, pruned_loss=0.1077, over 21168.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3195, pruned_loss=0.08318, over 4255332.74 frames. ], batch size: 143, lr: 6.12e-03, grad_scale: 32.0 2023-06-21 22:56:53,604 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=854274.0, ans=0.1 2023-06-21 22:57:13,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=854334.0, ans=0.2 2023-06-21 22:57:18,911 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.01 vs. limit=12.0 2023-06-21 22:57:53,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=854394.0, ans=0.125 2023-06-21 22:59:00,690 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.16 vs. limit=22.5 2023-06-21 22:59:03,937 INFO [train.py:996] (2/4) Epoch 5, batch 20450, loss[loss=0.2288, simple_loss=0.2961, pruned_loss=0.08079, over 21681.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3204, pruned_loss=0.08525, over 4256255.55 frames. ], batch size: 263, lr: 6.11e-03, grad_scale: 16.0 2023-06-21 22:59:25,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=854634.0, ans=0.1 2023-06-21 22:59:49,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=854694.0, ans=0.04949747468305833 2023-06-21 23:00:30,027 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=854754.0, ans=0.125 2023-06-21 23:00:31,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=854754.0, ans=0.1 2023-06-21 23:00:34,874 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.35 vs. limit=22.5 2023-06-21 23:00:35,308 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.017e+02 2.525e+02 2.860e+02 3.466e+02 5.175e+02, threshold=5.721e+02, percent-clipped=0.0 2023-06-21 23:00:55,760 INFO [train.py:996] (2/4) Epoch 5, batch 20500, loss[loss=0.2199, simple_loss=0.2971, pruned_loss=0.07136, over 21467.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3159, pruned_loss=0.0852, over 4255806.16 frames. ], batch size: 131, lr: 6.11e-03, grad_scale: 16.0 2023-06-21 23:01:07,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=854874.0, ans=0.04949747468305833 2023-06-21 23:01:11,132 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.37 vs. limit=22.5 2023-06-21 23:01:25,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=854934.0, ans=0.0 2023-06-21 23:01:28,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=854934.0, ans=0.0 2023-06-21 23:02:44,709 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=855114.0, ans=0.125 2023-06-21 23:03:05,511 INFO [train.py:996] (2/4) Epoch 5, batch 20550, loss[loss=0.2847, simple_loss=0.3388, pruned_loss=0.1153, over 21383.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.308, pruned_loss=0.0834, over 4239446.39 frames. ], batch size: 508, lr: 6.11e-03, grad_scale: 16.0 2023-06-21 23:04:09,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=855294.0, ans=0.2 2023-06-21 23:04:34,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=855354.0, ans=0.0 2023-06-21 23:04:45,344 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.963e+02 2.439e+02 2.815e+02 3.369e+02 5.545e+02, threshold=5.629e+02, percent-clipped=0.0 2023-06-21 23:05:16,247 INFO [train.py:996] (2/4) Epoch 5, batch 20600, loss[loss=0.2544, simple_loss=0.3232, pruned_loss=0.09284, over 21764.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3102, pruned_loss=0.08105, over 4240064.75 frames. ], batch size: 112, lr: 6.11e-03, grad_scale: 16.0 2023-06-21 23:05:32,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=855474.0, ans=0.0 2023-06-21 23:05:51,937 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=855534.0, ans=0.125 2023-06-21 23:06:07,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=855594.0, ans=0.035 2023-06-21 23:06:16,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=855594.0, ans=0.125 2023-06-21 23:06:19,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=855594.0, ans=0.125 2023-06-21 23:06:30,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=855654.0, ans=0.125 2023-06-21 23:06:37,450 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=855654.0, ans=0.125 2023-06-21 23:07:21,861 INFO [train.py:996] (2/4) Epoch 5, batch 20650, loss[loss=0.1925, simple_loss=0.2378, pruned_loss=0.07359, over 20879.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3058, pruned_loss=0.08168, over 4246393.20 frames. ], batch size: 608, lr: 6.11e-03, grad_scale: 16.0 2023-06-21 23:07:24,125 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.39 vs. limit=15.0 2023-06-21 23:07:26,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=855774.0, ans=0.125 2023-06-21 23:07:41,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=855774.0, ans=0.2 2023-06-21 23:07:42,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=855774.0, ans=0.125 2023-06-21 23:08:08,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=855834.0, ans=0.2 2023-06-21 23:09:13,969 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.915e+02 2.383e+02 2.744e+02 3.352e+02 4.969e+02, threshold=5.489e+02, percent-clipped=0.0 2023-06-21 23:09:44,574 INFO [train.py:996] (2/4) Epoch 5, batch 20700, loss[loss=0.1908, simple_loss=0.2687, pruned_loss=0.0564, over 21421.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.2995, pruned_loss=0.07879, over 4245544.64 frames. ], batch size: 211, lr: 6.11e-03, grad_scale: 16.0 2023-06-21 23:09:54,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=856074.0, ans=0.015 2023-06-21 23:10:05,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=856134.0, ans=0.2 2023-06-21 23:10:07,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=856134.0, ans=0.125 2023-06-21 23:10:20,419 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.07 vs. limit=15.0 2023-06-21 23:10:45,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=856194.0, ans=0.04949747468305833 2023-06-21 23:11:02,746 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=856254.0, ans=0.125 2023-06-21 23:11:51,755 INFO [train.py:996] (2/4) Epoch 5, batch 20750, loss[loss=0.3679, simple_loss=0.444, pruned_loss=0.1459, over 21488.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3041, pruned_loss=0.07855, over 4247971.47 frames. ], batch size: 507, lr: 6.11e-03, grad_scale: 16.0 2023-06-21 23:12:30,014 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 23:12:30,078 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=856434.0, ans=0.1 2023-06-21 23:13:39,576 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=856554.0, ans=0.1 2023-06-21 23:13:46,180 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.914e+02 2.828e+02 3.457e+02 4.836e+02 8.710e+02, threshold=6.913e+02, percent-clipped=21.0 2023-06-21 23:14:13,086 INFO [train.py:996] (2/4) Epoch 5, batch 20800, loss[loss=0.2074, simple_loss=0.2692, pruned_loss=0.07278, over 21238.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3058, pruned_loss=0.07909, over 4253438.65 frames. ], batch size: 549, lr: 6.11e-03, grad_scale: 32.0 2023-06-21 23:14:34,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=856734.0, ans=0.0 2023-06-21 23:14:41,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=856734.0, ans=0.125 2023-06-21 23:14:57,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=856794.0, ans=0.0 2023-06-21 23:15:39,297 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=856854.0, ans=0.125 2023-06-21 23:16:24,379 INFO [train.py:996] (2/4) Epoch 5, batch 20850, loss[loss=0.1874, simple_loss=0.2625, pruned_loss=0.05617, over 21794.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.2978, pruned_loss=0.07701, over 4251067.85 frames. ], batch size: 282, lr: 6.11e-03, grad_scale: 32.0 2023-06-21 23:16:37,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=856974.0, ans=0.0 2023-06-21 23:16:46,901 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-21 23:17:01,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=857034.0, ans=0.0 2023-06-21 23:17:16,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=857094.0, ans=0.1 2023-06-21 23:17:32,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=857154.0, ans=0.0 2023-06-21 23:18:13,618 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.775e+02 2.467e+02 2.845e+02 3.460e+02 6.189e+02, threshold=5.691e+02, percent-clipped=0.0 2023-06-21 23:18:40,143 INFO [train.py:996] (2/4) Epoch 5, batch 20900, loss[loss=0.2824, simple_loss=0.3512, pruned_loss=0.1068, over 21646.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.2993, pruned_loss=0.07801, over 4259364.03 frames. ], batch size: 509, lr: 6.11e-03, grad_scale: 16.0 2023-06-21 23:19:04,455 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=857334.0, ans=0.125 2023-06-21 23:19:18,788 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.30 vs. limit=6.0 2023-06-21 23:19:22,738 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-06-21 23:20:15,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=857514.0, ans=0.125 2023-06-21 23:20:25,862 INFO [train.py:996] (2/4) Epoch 5, batch 20950, loss[loss=0.1938, simple_loss=0.2699, pruned_loss=0.05881, over 21810.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2946, pruned_loss=0.07399, over 4252482.76 frames. ], batch size: 316, lr: 6.10e-03, grad_scale: 16.0 2023-06-21 23:20:52,959 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.33 vs. limit=15.0 2023-06-21 23:21:08,267 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.29 vs. limit=15.0 2023-06-21 23:21:45,144 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=857754.0, ans=0.0 2023-06-21 23:22:08,066 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.705e+02 2.456e+02 2.757e+02 3.195e+02 6.346e+02, threshold=5.513e+02, percent-clipped=1.0 2023-06-21 23:22:08,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=857814.0, ans=0.04949747468305833 2023-06-21 23:22:39,127 INFO [train.py:996] (2/4) Epoch 5, batch 21000, loss[loss=0.2179, simple_loss=0.2867, pruned_loss=0.07457, over 21084.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2927, pruned_loss=0.07378, over 4256540.94 frames. ], batch size: 608, lr: 6.10e-03, grad_scale: 16.0 2023-06-21 23:22:39,128 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-21 23:23:36,536 INFO [train.py:1028] (2/4) Epoch 5, validation: loss=0.2652, simple_loss=0.3651, pruned_loss=0.08266, over 1796401.00 frames. 2023-06-21 23:23:36,537 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-21 23:24:03,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=857934.0, ans=0.125 2023-06-21 23:25:24,733 INFO [train.py:996] (2/4) Epoch 5, batch 21050, loss[loss=0.2124, simple_loss=0.2875, pruned_loss=0.06859, over 21653.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2907, pruned_loss=0.0743, over 4256923.03 frames. ], batch size: 282, lr: 6.10e-03, grad_scale: 16.0 2023-06-21 23:25:55,834 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=858234.0, ans=0.2 2023-06-21 23:26:10,962 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.52 vs. limit=15.0 2023-06-21 23:26:39,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=858354.0, ans=0.125 2023-06-21 23:26:59,620 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.878e+02 2.489e+02 2.963e+02 3.543e+02 5.648e+02, threshold=5.927e+02, percent-clipped=1.0 2023-06-21 23:27:34,810 INFO [train.py:996] (2/4) Epoch 5, batch 21100, loss[loss=0.2168, simple_loss=0.2787, pruned_loss=0.07745, over 21583.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2875, pruned_loss=0.07435, over 4264567.98 frames. ], batch size: 332, lr: 6.10e-03, grad_scale: 16.0 2023-06-21 23:27:55,553 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=858474.0, ans=0.125 2023-06-21 23:28:00,931 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=12.0 2023-06-21 23:28:08,264 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=858534.0, ans=0.0 2023-06-21 23:28:09,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=858534.0, ans=0.0 2023-06-21 23:28:17,365 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=858594.0, ans=0.125 2023-06-21 23:28:18,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=858594.0, ans=0.04949747468305833 2023-06-21 23:28:25,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=858594.0, ans=0.0 2023-06-21 23:29:44,531 INFO [train.py:996] (2/4) Epoch 5, batch 21150, loss[loss=0.1991, simple_loss=0.2642, pruned_loss=0.067, over 21322.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2849, pruned_loss=0.07507, over 4266640.75 frames. ], batch size: 131, lr: 6.10e-03, grad_scale: 16.0 2023-06-21 23:30:20,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=858834.0, ans=0.2 2023-06-21 23:31:01,277 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.99 vs. limit=15.0 2023-06-21 23:31:11,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=858954.0, ans=0.125 2023-06-21 23:31:29,329 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.928e+02 2.447e+02 2.797e+02 3.334e+02 4.948e+02, threshold=5.595e+02, percent-clipped=0.0 2023-06-21 23:31:55,143 INFO [train.py:996] (2/4) Epoch 5, batch 21200, loss[loss=0.2097, simple_loss=0.2756, pruned_loss=0.07192, over 21833.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2816, pruned_loss=0.07454, over 4254944.74 frames. ], batch size: 318, lr: 6.10e-03, grad_scale: 32.0 2023-06-21 23:31:57,602 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.96 vs. limit=15.0 2023-06-21 23:33:12,210 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.22 vs. limit=22.5 2023-06-21 23:33:15,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=859254.0, ans=0.125 2023-06-21 23:33:33,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff3.min_abs, batch_count=859314.0, ans=0.2 2023-06-21 23:34:06,394 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=859374.0, ans=0.125 2023-06-21 23:34:07,350 INFO [train.py:996] (2/4) Epoch 5, batch 21250, loss[loss=0.2377, simple_loss=0.3139, pruned_loss=0.08081, over 21679.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2802, pruned_loss=0.07438, over 4257988.08 frames. ], batch size: 247, lr: 6.10e-03, grad_scale: 16.0 2023-06-21 23:34:15,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=859374.0, ans=0.1 2023-06-21 23:35:32,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=859554.0, ans=0.0 2023-06-21 23:35:40,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=859614.0, ans=0.125 2023-06-21 23:35:41,302 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.784e+02 2.492e+02 2.834e+02 3.212e+02 4.793e+02, threshold=5.668e+02, percent-clipped=0.0 2023-06-21 23:35:43,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=859614.0, ans=0.2 2023-06-21 23:36:01,187 INFO [train.py:996] (2/4) Epoch 5, batch 21300, loss[loss=0.2353, simple_loss=0.3106, pruned_loss=0.08003, over 21813.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2878, pruned_loss=0.07657, over 4258305.86 frames. ], batch size: 298, lr: 6.10e-03, grad_scale: 16.0 2023-06-21 23:36:55,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=859794.0, ans=0.0 2023-06-21 23:37:02,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=859794.0, ans=0.125 2023-06-21 23:37:43,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=859914.0, ans=0.125 2023-06-21 23:38:27,904 INFO [train.py:996] (2/4) Epoch 5, batch 21350, loss[loss=0.2284, simple_loss=0.2955, pruned_loss=0.08064, over 21685.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.292, pruned_loss=0.07655, over 4262096.12 frames. ], batch size: 263, lr: 6.10e-03, grad_scale: 16.0 2023-06-21 23:38:58,413 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.40 vs. limit=6.0 2023-06-21 23:39:13,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=860094.0, ans=0.2 2023-06-21 23:39:44,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=860154.0, ans=0.125 2023-06-21 23:39:56,177 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=860214.0, ans=0.125 2023-06-21 23:39:58,677 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.968e+02 2.438e+02 2.773e+02 3.263e+02 5.694e+02, threshold=5.547e+02, percent-clipped=1.0 2023-06-21 23:40:26,302 INFO [train.py:996] (2/4) Epoch 5, batch 21400, loss[loss=0.253, simple_loss=0.3259, pruned_loss=0.09004, over 21756.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2941, pruned_loss=0.07596, over 4256935.66 frames. ], batch size: 332, lr: 6.09e-03, grad_scale: 16.0 2023-06-21 23:40:26,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=860274.0, ans=0.025 2023-06-21 23:40:31,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=860274.0, ans=0.0 2023-06-21 23:40:41,869 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=860274.0, ans=0.125 2023-06-21 23:41:05,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=860334.0, ans=0.125 2023-06-21 23:41:31,301 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.83 vs. limit=15.0 2023-06-21 23:41:34,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=860394.0, ans=0.05 2023-06-21 23:41:34,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=860394.0, ans=0.2 2023-06-21 23:41:35,204 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.96 vs. limit=15.0 2023-06-21 23:42:02,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=860454.0, ans=0.0 2023-06-21 23:42:50,929 INFO [train.py:996] (2/4) Epoch 5, batch 21450, loss[loss=0.2187, simple_loss=0.2883, pruned_loss=0.07454, over 21363.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.2967, pruned_loss=0.07732, over 4261051.68 frames. ], batch size: 176, lr: 6.09e-03, grad_scale: 16.0 2023-06-21 23:42:54,359 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=860574.0, ans=0.125 2023-06-21 23:43:10,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=860634.0, ans=0.1 2023-06-21 23:43:16,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=860634.0, ans=0.125 2023-06-21 23:43:19,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=860634.0, ans=0.0 2023-06-21 23:43:45,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=860694.0, ans=0.1 2023-06-21 23:44:01,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=860694.0, ans=0.0 2023-06-21 23:44:17,834 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=860754.0, ans=0.125 2023-06-21 23:44:24,710 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.05 vs. limit=5.0 2023-06-21 23:44:25,808 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-21 23:44:40,937 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.065e+02 2.786e+02 3.330e+02 3.846e+02 5.703e+02, threshold=6.661e+02, percent-clipped=1.0 2023-06-21 23:44:55,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=860814.0, ans=0.1 2023-06-21 23:45:02,339 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=860814.0, ans=0.025 2023-06-21 23:45:04,777 INFO [train.py:996] (2/4) Epoch 5, batch 21500, loss[loss=0.1954, simple_loss=0.2552, pruned_loss=0.06782, over 21576.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.2953, pruned_loss=0.07851, over 4261675.65 frames. ], batch size: 247, lr: 6.09e-03, grad_scale: 16.0 2023-06-21 23:45:20,786 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.26 vs. limit=15.0 2023-06-21 23:46:11,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=860994.0, ans=0.0 2023-06-21 23:47:17,731 INFO [train.py:996] (2/4) Epoch 5, batch 21550, loss[loss=0.1834, simple_loss=0.2514, pruned_loss=0.0577, over 21648.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2901, pruned_loss=0.07657, over 4270472.39 frames. ], batch size: 282, lr: 6.09e-03, grad_scale: 16.0 2023-06-21 23:47:32,059 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=861234.0, ans=0.125 2023-06-21 23:48:35,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=861294.0, ans=0.0 2023-06-21 23:48:38,116 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 23:49:12,628 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.757e+02 2.449e+02 2.734e+02 3.185e+02 5.518e+02, threshold=5.467e+02, percent-clipped=0.0 2023-06-21 23:49:24,738 INFO [train.py:996] (2/4) Epoch 5, batch 21600, loss[loss=0.1937, simple_loss=0.2752, pruned_loss=0.05609, over 21494.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2854, pruned_loss=0.075, over 4269787.57 frames. ], batch size: 230, lr: 6.09e-03, grad_scale: 32.0 2023-06-21 23:49:29,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=861474.0, ans=0.0 2023-06-21 23:50:28,009 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=861534.0, ans=0.1 2023-06-21 23:50:31,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=861594.0, ans=0.125 2023-06-21 23:50:38,445 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 23:50:59,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=861654.0, ans=0.0 2023-06-21 23:51:10,515 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=861654.0, ans=0.2 2023-06-21 23:51:31,787 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=861714.0, ans=0.025 2023-06-21 23:51:39,802 INFO [train.py:996] (2/4) Epoch 5, batch 21650, loss[loss=0.2103, simple_loss=0.296, pruned_loss=0.06232, over 21434.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.289, pruned_loss=0.07229, over 4265646.91 frames. ], batch size: 194, lr: 6.09e-03, grad_scale: 32.0 2023-06-21 23:51:44,523 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=861774.0, ans=0.1 2023-06-21 23:51:46,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=861774.0, ans=0.125 2023-06-21 23:52:09,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=861834.0, ans=0.125 2023-06-21 23:53:26,394 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.37 vs. limit=15.0 2023-06-21 23:53:40,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=862014.0, ans=0.0 2023-06-21 23:53:43,248 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.784e+02 2.320e+02 2.612e+02 3.014e+02 5.606e+02, threshold=5.225e+02, percent-clipped=2.0 2023-06-21 23:53:48,219 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 23:53:55,253 INFO [train.py:996] (2/4) Epoch 5, batch 21700, loss[loss=0.1657, simple_loss=0.2375, pruned_loss=0.047, over 17064.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2899, pruned_loss=0.07105, over 4264629.49 frames. ], batch size: 67, lr: 6.09e-03, grad_scale: 32.0 2023-06-21 23:55:41,721 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.02 vs. limit=22.5 2023-06-21 23:55:48,367 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=862314.0, ans=0.125 2023-06-21 23:55:55,871 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=862374.0, ans=0.125 2023-06-21 23:55:56,744 INFO [train.py:996] (2/4) Epoch 5, batch 21750, loss[loss=0.2256, simple_loss=0.2755, pruned_loss=0.08781, over 21526.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2866, pruned_loss=0.07187, over 4265772.89 frames. ], batch size: 442, lr: 6.09e-03, grad_scale: 32.0 2023-06-21 23:55:59,406 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.55 vs. limit=15.0 2023-06-21 23:56:33,698 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.66 vs. limit=22.5 2023-06-21 23:57:20,547 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=862554.0, ans=0.0 2023-06-21 23:57:21,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=862554.0, ans=0.125 2023-06-21 23:57:52,628 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=862554.0, ans=0.2 2023-06-21 23:57:58,710 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.986e+02 2.550e+02 2.899e+02 3.518e+02 5.551e+02, threshold=5.797e+02, percent-clipped=2.0 2023-06-21 23:57:59,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=862614.0, ans=0.2 2023-06-21 23:58:03,977 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 23:58:11,053 INFO [train.py:996] (2/4) Epoch 5, batch 21800, loss[loss=0.2542, simple_loss=0.3501, pruned_loss=0.07915, over 21750.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2867, pruned_loss=0.07328, over 4274100.72 frames. ], batch size: 351, lr: 6.09e-03, grad_scale: 32.0 2023-06-21 23:58:11,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=862674.0, ans=0.125 2023-06-21 23:58:20,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=862674.0, ans=0.2 2023-06-21 23:58:20,262 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=862674.0, ans=0.0 2023-06-21 23:58:28,103 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.04 vs. limit=6.0 2023-06-21 23:59:26,884 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=862854.0, ans=0.1 2023-06-21 23:59:53,774 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-22 00:00:25,046 INFO [train.py:996] (2/4) Epoch 5, batch 21850, loss[loss=0.2066, simple_loss=0.2736, pruned_loss=0.06985, over 16567.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.291, pruned_loss=0.07375, over 4266360.06 frames. ], batch size: 60, lr: 6.08e-03, grad_scale: 32.0 2023-06-22 00:00:28,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=862974.0, ans=0.0 2023-06-22 00:01:04,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=863034.0, ans=0.0 2023-06-22 00:01:07,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=863034.0, ans=0.125 2023-06-22 00:01:48,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=863094.0, ans=0.125 2023-06-22 00:02:32,165 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.894e+02 2.444e+02 2.856e+02 3.585e+02 5.005e+02, threshold=5.712e+02, percent-clipped=0.0 2023-06-22 00:02:34,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=863214.0, ans=0.125 2023-06-22 00:02:43,747 INFO [train.py:996] (2/4) Epoch 5, batch 21900, loss[loss=0.1974, simple_loss=0.2549, pruned_loss=0.07, over 21641.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2915, pruned_loss=0.07498, over 4267060.43 frames. ], batch size: 247, lr: 6.08e-03, grad_scale: 32.0 2023-06-22 00:03:05,402 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 00:03:50,180 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.24 vs. limit=15.0 2023-06-22 00:03:59,442 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=863394.0, ans=0.05 2023-06-22 00:04:56,136 INFO [train.py:996] (2/4) Epoch 5, batch 21950, loss[loss=0.1806, simple_loss=0.258, pruned_loss=0.05158, over 21736.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2863, pruned_loss=0.07343, over 4254596.55 frames. ], batch size: 351, lr: 6.08e-03, grad_scale: 32.0 2023-06-22 00:05:04,897 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.96 vs. limit=10.0 2023-06-22 00:05:04,969 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-06-22 00:05:16,568 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=863634.0, ans=0.0 2023-06-22 00:05:46,669 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=863634.0, ans=0.125 2023-06-22 00:05:50,046 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.40 vs. limit=10.0 2023-06-22 00:06:04,523 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.04 vs. limit=10.0 2023-06-22 00:06:45,425 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.690e+02 2.222e+02 2.484e+02 2.757e+02 5.064e+02, threshold=4.969e+02, percent-clipped=0.0 2023-06-22 00:07:03,552 INFO [train.py:996] (2/4) Epoch 5, batch 22000, loss[loss=0.1717, simple_loss=0.2442, pruned_loss=0.04958, over 21609.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.281, pruned_loss=0.07095, over 4256540.02 frames. ], batch size: 247, lr: 6.08e-03, grad_scale: 32.0 2023-06-22 00:07:04,085 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=863874.0, ans=0.125 2023-06-22 00:07:08,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=863874.0, ans=0.125 2023-06-22 00:08:52,816 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=864114.0, ans=0.125 2023-06-22 00:09:13,106 INFO [train.py:996] (2/4) Epoch 5, batch 22050, loss[loss=0.2737, simple_loss=0.3507, pruned_loss=0.09833, over 21740.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2866, pruned_loss=0.07205, over 4258231.92 frames. ], batch size: 351, lr: 6.08e-03, grad_scale: 32.0 2023-06-22 00:10:09,309 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.48 vs. limit=15.0 2023-06-22 00:11:04,344 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.22 vs. limit=12.0 2023-06-22 00:11:10,347 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.004e+02 2.782e+02 3.105e+02 3.760e+02 6.556e+02, threshold=6.210e+02, percent-clipped=5.0 2023-06-22 00:11:27,981 INFO [train.py:996] (2/4) Epoch 5, batch 22100, loss[loss=0.223, simple_loss=0.2966, pruned_loss=0.0747, over 21710.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.2959, pruned_loss=0.07623, over 4252148.93 frames. ], batch size: 230, lr: 6.08e-03, grad_scale: 32.0 2023-06-22 00:11:45,567 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.99 vs. limit=6.0 2023-06-22 00:12:07,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=864534.0, ans=0.125 2023-06-22 00:12:26,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=864594.0, ans=10.0 2023-06-22 00:13:11,873 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=864654.0, ans=0.125 2023-06-22 00:13:14,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=864654.0, ans=0.5 2023-06-22 00:13:55,552 INFO [train.py:996] (2/4) Epoch 5, batch 22150, loss[loss=0.1944, simple_loss=0.2735, pruned_loss=0.05763, over 21902.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3004, pruned_loss=0.07794, over 4253635.64 frames. ], batch size: 98, lr: 6.08e-03, grad_scale: 32.0 2023-06-22 00:15:29,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=864954.0, ans=0.125 2023-06-22 00:15:32,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=864954.0, ans=0.2 2023-06-22 00:15:36,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=864954.0, ans=0.125 2023-06-22 00:15:38,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=864954.0, ans=0.0 2023-06-22 00:15:40,600 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-22 00:15:44,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=865014.0, ans=0.125 2023-06-22 00:15:47,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=865014.0, ans=0.1 2023-06-22 00:15:54,118 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.272e+02 2.802e+02 3.197e+02 3.821e+02 5.658e+02, threshold=6.394e+02, percent-clipped=0.0 2023-06-22 00:16:05,988 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 00:16:08,471 INFO [train.py:996] (2/4) Epoch 5, batch 22200, loss[loss=0.2217, simple_loss=0.3125, pruned_loss=0.06547, over 21425.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3033, pruned_loss=0.07951, over 4265785.31 frames. ], batch size: 194, lr: 6.08e-03, grad_scale: 32.0 2023-06-22 00:17:46,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=865254.0, ans=0.05 2023-06-22 00:18:04,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=865254.0, ans=0.125 2023-06-22 00:18:35,914 INFO [train.py:996] (2/4) Epoch 5, batch 22250, loss[loss=0.2155, simple_loss=0.2851, pruned_loss=0.07293, over 21427.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3084, pruned_loss=0.08112, over 4269891.53 frames. ], batch size: 211, lr: 6.08e-03, grad_scale: 32.0 2023-06-22 00:20:15,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=865554.0, ans=0.125 2023-06-22 00:20:20,956 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=865614.0, ans=0.0 2023-06-22 00:20:30,951 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.143e+02 2.644e+02 2.877e+02 3.306e+02 4.671e+02, threshold=5.753e+02, percent-clipped=0.0 2023-06-22 00:20:42,625 INFO [train.py:996] (2/4) Epoch 5, batch 22300, loss[loss=0.2417, simple_loss=0.3061, pruned_loss=0.08862, over 21216.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3107, pruned_loss=0.08278, over 4267461.58 frames. ], batch size: 143, lr: 6.08e-03, grad_scale: 32.0 2023-06-22 00:21:10,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=865674.0, ans=0.125 2023-06-22 00:21:12,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=865674.0, ans=0.2 2023-06-22 00:22:23,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=865794.0, ans=0.1 2023-06-22 00:22:56,165 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=865914.0, ans=0.1 2023-06-22 00:22:56,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=865914.0, ans=0.1 2023-06-22 00:23:07,775 INFO [train.py:996] (2/4) Epoch 5, batch 22350, loss[loss=0.2148, simple_loss=0.2749, pruned_loss=0.07731, over 21229.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3085, pruned_loss=0.08327, over 4279501.78 frames. ], batch size: 608, lr: 6.07e-03, grad_scale: 32.0 2023-06-22 00:23:44,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=866034.0, ans=0.125 2023-06-22 00:24:02,547 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=866034.0, ans=0.0 2023-06-22 00:24:36,794 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=866154.0, ans=0.025 2023-06-22 00:25:07,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=866214.0, ans=0.2 2023-06-22 00:25:08,178 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.892e+02 2.517e+02 2.753e+02 3.146e+02 5.107e+02, threshold=5.506e+02, percent-clipped=0.0 2023-06-22 00:25:40,369 INFO [train.py:996] (2/4) Epoch 5, batch 22400, loss[loss=0.2226, simple_loss=0.2896, pruned_loss=0.07783, over 21628.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3058, pruned_loss=0.08071, over 4284424.93 frames. ], batch size: 332, lr: 6.07e-03, grad_scale: 32.0 2023-06-22 00:25:48,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=866274.0, ans=0.2 2023-06-22 00:26:46,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=866394.0, ans=0.125 2023-06-22 00:27:44,120 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=866574.0, ans=0.125 2023-06-22 00:27:44,993 INFO [train.py:996] (2/4) Epoch 5, batch 22450, loss[loss=0.2535, simple_loss=0.3189, pruned_loss=0.09398, over 20059.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3006, pruned_loss=0.07944, over 4267857.46 frames. ], batch size: 703, lr: 6.07e-03, grad_scale: 32.0 2023-06-22 00:27:45,954 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.71 vs. limit=12.0 2023-06-22 00:28:18,003 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=866634.0, ans=15.0 2023-06-22 00:29:11,112 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=866754.0, ans=0.125 2023-06-22 00:29:12,618 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=866754.0, ans=0.125 2023-06-22 00:29:33,982 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.934e+02 2.555e+02 2.922e+02 4.023e+02 6.050e+02, threshold=5.844e+02, percent-clipped=3.0 2023-06-22 00:29:55,179 INFO [train.py:996] (2/4) Epoch 5, batch 22500, loss[loss=0.2961, simple_loss=0.3467, pruned_loss=0.1228, over 21364.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.2953, pruned_loss=0.0789, over 4264926.89 frames. ], batch size: 507, lr: 6.07e-03, grad_scale: 32.0 2023-06-22 00:30:09,143 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.79 vs. limit=15.0 2023-06-22 00:31:45,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=867054.0, ans=0.2 2023-06-22 00:32:10,856 INFO [train.py:996] (2/4) Epoch 5, batch 22550, loss[loss=0.223, simple_loss=0.2932, pruned_loss=0.07636, over 21487.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.2979, pruned_loss=0.07945, over 4267940.01 frames. ], batch size: 194, lr: 6.07e-03, grad_scale: 32.0 2023-06-22 00:32:34,371 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=12.0 2023-06-22 00:34:15,152 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.115e+02 2.802e+02 3.271e+02 3.660e+02 5.759e+02, threshold=6.543e+02, percent-clipped=0.0 2023-06-22 00:34:22,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=867414.0, ans=0.125 2023-06-22 00:34:25,683 INFO [train.py:996] (2/4) Epoch 5, batch 22600, loss[loss=0.1978, simple_loss=0.2657, pruned_loss=0.06496, over 21370.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.2997, pruned_loss=0.07945, over 4275717.56 frames. ], batch size: 194, lr: 6.07e-03, grad_scale: 32.0 2023-06-22 00:35:20,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=867534.0, ans=0.05 2023-06-22 00:35:58,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=867654.0, ans=0.125 2023-06-22 00:36:38,655 INFO [train.py:996] (2/4) Epoch 5, batch 22650, loss[loss=0.2436, simple_loss=0.3306, pruned_loss=0.07836, over 19921.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.2966, pruned_loss=0.07855, over 4268314.87 frames. ], batch size: 702, lr: 6.07e-03, grad_scale: 32.0 2023-06-22 00:37:09,592 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=867774.0, ans=0.1 2023-06-22 00:38:19,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=868014.0, ans=0.0 2023-06-22 00:38:33,418 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 2.499e+02 2.940e+02 3.814e+02 6.401e+02, threshold=5.879e+02, percent-clipped=0.0 2023-06-22 00:38:45,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=868014.0, ans=0.07 2023-06-22 00:38:58,702 INFO [train.py:996] (2/4) Epoch 5, batch 22700, loss[loss=0.1814, simple_loss=0.247, pruned_loss=0.05794, over 21733.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2908, pruned_loss=0.07822, over 4255958.49 frames. ], batch size: 124, lr: 6.07e-03, grad_scale: 32.0 2023-06-22 00:39:22,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=868074.0, ans=0.125 2023-06-22 00:39:24,451 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=868134.0, ans=0.0 2023-06-22 00:39:28,231 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.21 vs. limit=22.5 2023-06-22 00:40:16,456 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.71 vs. limit=22.5 2023-06-22 00:40:59,331 INFO [train.py:996] (2/4) Epoch 5, batch 22750, loss[loss=0.2614, simple_loss=0.3266, pruned_loss=0.09806, over 21747.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.2918, pruned_loss=0.08001, over 4263253.43 frames. ], batch size: 332, lr: 6.07e-03, grad_scale: 32.0 2023-06-22 00:41:15,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=868374.0, ans=0.0 2023-06-22 00:42:36,255 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.42 vs. limit=12.0 2023-06-22 00:42:58,632 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.155e+02 2.771e+02 3.219e+02 3.779e+02 6.245e+02, threshold=6.438e+02, percent-clipped=2.0 2023-06-22 00:43:15,160 INFO [train.py:996] (2/4) Epoch 5, batch 22800, loss[loss=0.217, simple_loss=0.2775, pruned_loss=0.07822, over 21185.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.2974, pruned_loss=0.08271, over 4271763.70 frames. ], batch size: 608, lr: 6.06e-03, grad_scale: 32.0 2023-06-22 00:43:18,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=868674.0, ans=0.5 2023-06-22 00:44:21,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=868794.0, ans=0.125 2023-06-22 00:44:37,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=868854.0, ans=0.125 2023-06-22 00:45:06,575 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=868914.0, ans=0.1 2023-06-22 00:45:16,020 INFO [train.py:996] (2/4) Epoch 5, batch 22850, loss[loss=0.2093, simple_loss=0.2729, pruned_loss=0.07284, over 21813.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.2937, pruned_loss=0.08155, over 4275341.20 frames. ], batch size: 118, lr: 6.06e-03, grad_scale: 32.0 2023-06-22 00:45:33,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=868974.0, ans=0.0 2023-06-22 00:46:07,387 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=869034.0, ans=0.125 2023-06-22 00:46:20,031 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.45 vs. limit=15.0 2023-06-22 00:47:07,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=869154.0, ans=0.125 2023-06-22 00:47:36,104 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.084e+02 2.707e+02 3.130e+02 3.568e+02 4.939e+02, threshold=6.260e+02, percent-clipped=0.0 2023-06-22 00:47:59,186 INFO [train.py:996] (2/4) Epoch 5, batch 22900, loss[loss=0.2996, simple_loss=0.4185, pruned_loss=0.09035, over 19803.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.2954, pruned_loss=0.08018, over 4263280.85 frames. ], batch size: 702, lr: 6.06e-03, grad_scale: 16.0 2023-06-22 00:48:07,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=869274.0, ans=0.125 2023-06-22 00:48:09,925 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.02 vs. limit=15.0 2023-06-22 00:48:41,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=869334.0, ans=0.0 2023-06-22 00:49:18,579 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.59 vs. limit=6.0 2023-06-22 00:49:36,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=869454.0, ans=0.025 2023-06-22 00:50:33,830 INFO [train.py:996] (2/4) Epoch 5, batch 22950, loss[loss=0.2386, simple_loss=0.354, pruned_loss=0.06162, over 21725.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3089, pruned_loss=0.07864, over 4267296.34 frames. ], batch size: 332, lr: 6.06e-03, grad_scale: 16.0 2023-06-22 00:50:34,177 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=869574.0, ans=0.0 2023-06-22 00:50:52,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=869634.0, ans=0.0 2023-06-22 00:50:59,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=869634.0, ans=0.1 2023-06-22 00:51:25,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=869694.0, ans=0.04949747468305833 2023-06-22 00:52:10,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=869814.0, ans=0.0 2023-06-22 00:52:23,390 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.734e+02 2.567e+02 2.952e+02 3.530e+02 5.726e+02, threshold=5.904e+02, percent-clipped=0.0 2023-06-22 00:52:44,565 INFO [train.py:996] (2/4) Epoch 5, batch 23000, loss[loss=0.2322, simple_loss=0.3035, pruned_loss=0.08044, over 21867.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3063, pruned_loss=0.07681, over 4265106.42 frames. ], batch size: 118, lr: 6.06e-03, grad_scale: 16.0 2023-06-22 00:53:41,781 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.49 vs. limit=15.0 2023-06-22 00:53:56,789 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=869994.0, ans=0.125 2023-06-22 00:53:59,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=870054.0, ans=0.025 2023-06-22 00:54:01,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=870054.0, ans=0.125 2023-06-22 00:54:03,226 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.36 vs. limit=15.0 2023-06-22 00:55:02,669 INFO [train.py:996] (2/4) Epoch 5, batch 23050, loss[loss=0.2345, simple_loss=0.3034, pruned_loss=0.08286, over 20666.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3094, pruned_loss=0.0798, over 4268942.74 frames. ], batch size: 607, lr: 6.06e-03, grad_scale: 16.0 2023-06-22 00:55:07,464 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 00:55:38,757 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.00 vs. limit=22.5 2023-06-22 00:55:44,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=870234.0, ans=0.0 2023-06-22 00:56:17,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=870294.0, ans=0.125 2023-06-22 00:57:09,227 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.227e+02 2.630e+02 3.010e+02 3.445e+02 5.620e+02, threshold=6.019e+02, percent-clipped=0.0 2023-06-22 00:57:18,354 INFO [train.py:996] (2/4) Epoch 5, batch 23100, loss[loss=0.1844, simple_loss=0.2451, pruned_loss=0.06182, over 21583.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3042, pruned_loss=0.07983, over 4268770.99 frames. ], batch size: 247, lr: 6.06e-03, grad_scale: 16.0 2023-06-22 00:59:20,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=870714.0, ans=0.0 2023-06-22 00:59:29,219 INFO [train.py:996] (2/4) Epoch 5, batch 23150, loss[loss=0.2406, simple_loss=0.3108, pruned_loss=0.08525, over 21848.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.2974, pruned_loss=0.07862, over 4271738.63 frames. ], batch size: 118, lr: 6.06e-03, grad_scale: 16.0 2023-06-22 00:59:29,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=870774.0, ans=0.125 2023-06-22 00:59:44,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=870774.0, ans=0.0 2023-06-22 01:01:23,146 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=871014.0, ans=0.1 2023-06-22 01:01:30,214 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.884e+02 2.556e+02 2.836e+02 3.479e+02 5.811e+02, threshold=5.672e+02, percent-clipped=0.0 2023-06-22 01:01:30,759 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=871014.0, ans=0.0 2023-06-22 01:01:32,077 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=871014.0, ans=0.0 2023-06-22 01:01:38,836 INFO [train.py:996] (2/4) Epoch 5, batch 23200, loss[loss=0.24, simple_loss=0.3055, pruned_loss=0.0872, over 21742.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.2967, pruned_loss=0.07942, over 4278906.29 frames. ], batch size: 389, lr: 6.06e-03, grad_scale: 32.0 2023-06-22 01:01:51,310 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.61 vs. limit=15.0 2023-06-22 01:02:58,557 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=871254.0, ans=0.2 2023-06-22 01:03:35,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=871314.0, ans=0.125 2023-06-22 01:03:54,138 INFO [train.py:996] (2/4) Epoch 5, batch 23250, loss[loss=0.2417, simple_loss=0.3119, pruned_loss=0.08575, over 21754.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.2977, pruned_loss=0.08046, over 4281808.52 frames. ], batch size: 112, lr: 6.06e-03, grad_scale: 32.0 2023-06-22 01:04:32,926 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.54 vs. limit=15.0 2023-06-22 01:04:42,730 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=871494.0, ans=0.0 2023-06-22 01:05:01,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=871494.0, ans=0.125 2023-06-22 01:05:39,668 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=12.0 2023-06-22 01:05:53,625 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.121e+02 2.691e+02 3.046e+02 3.824e+02 6.255e+02, threshold=6.093e+02, percent-clipped=3.0 2023-06-22 01:06:02,703 INFO [train.py:996] (2/4) Epoch 5, batch 23300, loss[loss=0.2887, simple_loss=0.3957, pruned_loss=0.09087, over 21668.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3068, pruned_loss=0.08249, over 4283839.97 frames. ], batch size: 389, lr: 6.05e-03, grad_scale: 16.0 2023-06-22 01:06:45,374 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 01:08:37,162 INFO [train.py:996] (2/4) Epoch 5, batch 23350, loss[loss=0.2426, simple_loss=0.3222, pruned_loss=0.08155, over 19933.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3106, pruned_loss=0.08085, over 4274595.42 frames. ], batch size: 702, lr: 6.05e-03, grad_scale: 16.0 2023-06-22 01:09:07,618 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.83 vs. limit=15.0 2023-06-22 01:09:09,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=872034.0, ans=0.125 2023-06-22 01:10:07,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=872154.0, ans=10.0 2023-06-22 01:10:15,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=872154.0, ans=0.125 2023-06-22 01:10:31,356 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.775e+02 2.322e+02 2.505e+02 2.999e+02 5.635e+02, threshold=5.010e+02, percent-clipped=0.0 2023-06-22 01:10:48,127 INFO [train.py:996] (2/4) Epoch 5, batch 23400, loss[loss=0.2239, simple_loss=0.2993, pruned_loss=0.07421, over 21827.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3039, pruned_loss=0.07753, over 4276830.83 frames. ], batch size: 282, lr: 6.05e-03, grad_scale: 16.0 2023-06-22 01:11:34,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=872334.0, ans=0.2 2023-06-22 01:12:18,841 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.53 vs. limit=15.0 2023-06-22 01:12:30,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=872454.0, ans=0.125 2023-06-22 01:13:11,270 INFO [train.py:996] (2/4) Epoch 5, batch 23450, loss[loss=0.2345, simple_loss=0.3033, pruned_loss=0.08282, over 21951.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3052, pruned_loss=0.08033, over 4284544.70 frames. ], batch size: 316, lr: 6.05e-03, grad_scale: 16.0 2023-06-22 01:13:47,073 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=872634.0, ans=0.0 2023-06-22 01:14:16,366 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.71 vs. limit=22.5 2023-06-22 01:14:46,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=872754.0, ans=0.07 2023-06-22 01:14:47,858 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.01 vs. limit=15.0 2023-06-22 01:15:13,229 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.052e+02 2.562e+02 2.957e+02 3.488e+02 7.125e+02, threshold=5.914e+02, percent-clipped=6.0 2023-06-22 01:15:35,309 INFO [train.py:996] (2/4) Epoch 5, batch 23500, loss[loss=0.2192, simple_loss=0.2745, pruned_loss=0.08197, over 21252.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.305, pruned_loss=0.08138, over 4279049.88 frames. ], batch size: 608, lr: 6.05e-03, grad_scale: 16.0 2023-06-22 01:16:43,387 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 01:17:32,719 INFO [train.py:996] (2/4) Epoch 5, batch 23550, loss[loss=0.2036, simple_loss=0.2648, pruned_loss=0.07119, over 21687.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3012, pruned_loss=0.08088, over 4273003.03 frames. ], batch size: 333, lr: 6.05e-03, grad_scale: 16.0 2023-06-22 01:17:40,006 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.85 vs. limit=15.0 2023-06-22 01:18:19,905 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=873234.0, ans=0.0 2023-06-22 01:18:25,871 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=873234.0, ans=0.125 2023-06-22 01:18:36,191 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=873294.0, ans=0.0 2023-06-22 01:18:50,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=873354.0, ans=0.0 2023-06-22 01:19:25,247 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=873414.0, ans=0.125 2023-06-22 01:19:27,628 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.096e+02 2.576e+02 2.870e+02 3.585e+02 5.605e+02, threshold=5.739e+02, percent-clipped=0.0 2023-06-22 01:19:45,018 INFO [train.py:996] (2/4) Epoch 5, batch 23600, loss[loss=0.2297, simple_loss=0.3081, pruned_loss=0.07567, over 21746.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3013, pruned_loss=0.08096, over 4276497.75 frames. ], batch size: 332, lr: 6.05e-03, grad_scale: 32.0 2023-06-22 01:20:25,232 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.10 vs. limit=10.0 2023-06-22 01:20:43,889 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 01:20:46,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=873594.0, ans=0.125 2023-06-22 01:21:22,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=873654.0, ans=0.125 2023-06-22 01:22:05,162 INFO [train.py:996] (2/4) Epoch 5, batch 23650, loss[loss=0.203, simple_loss=0.2947, pruned_loss=0.05563, over 21264.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3011, pruned_loss=0.07897, over 4281168.33 frames. ], batch size: 548, lr: 6.05e-03, grad_scale: 32.0 2023-06-22 01:22:28,444 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.30 vs. limit=22.5 2023-06-22 01:23:09,179 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=873894.0, ans=0.125 2023-06-22 01:23:09,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=873894.0, ans=0.125 2023-06-22 01:23:12,174 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=873894.0, ans=0.0 2023-06-22 01:24:28,599 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.007e+02 2.418e+02 2.647e+02 3.193e+02 5.088e+02, threshold=5.293e+02, percent-clipped=0.0 2023-06-22 01:24:41,718 INFO [train.py:996] (2/4) Epoch 5, batch 23700, loss[loss=0.2046, simple_loss=0.2983, pruned_loss=0.05542, over 20696.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3049, pruned_loss=0.07882, over 4282603.81 frames. ], batch size: 607, lr: 6.05e-03, grad_scale: 32.0 2023-06-22 01:25:05,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=874134.0, ans=0.025 2023-06-22 01:25:07,434 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.14 vs. limit=15.0 2023-06-22 01:25:13,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=874134.0, ans=0.125 2023-06-22 01:26:18,450 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=874254.0, ans=0.125 2023-06-22 01:26:40,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=874314.0, ans=0.125 2023-06-22 01:26:55,997 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.24 vs. limit=15.0 2023-06-22 01:26:56,482 INFO [train.py:996] (2/4) Epoch 5, batch 23750, loss[loss=0.1913, simple_loss=0.2923, pruned_loss=0.04515, over 21636.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3071, pruned_loss=0.07962, over 4276932.08 frames. ], batch size: 263, lr: 6.05e-03, grad_scale: 32.0 2023-06-22 01:27:12,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=874434.0, ans=0.1 2023-06-22 01:27:17,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=874434.0, ans=0.1 2023-06-22 01:28:14,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=874554.0, ans=0.125 2023-06-22 01:29:02,181 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.672e+02 2.409e+02 2.705e+02 3.135e+02 4.675e+02, threshold=5.410e+02, percent-clipped=0.0 2023-06-22 01:29:02,711 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=874614.0, ans=0.125 2023-06-22 01:29:02,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=874614.0, ans=0.5 2023-06-22 01:29:10,007 INFO [train.py:996] (2/4) Epoch 5, batch 23800, loss[loss=0.2378, simple_loss=0.3115, pruned_loss=0.08206, over 21253.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3049, pruned_loss=0.07753, over 4270220.78 frames. ], batch size: 159, lr: 6.04e-03, grad_scale: 32.0 2023-06-22 01:29:24,523 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=874734.0, ans=0.125 2023-06-22 01:31:10,087 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=874854.0, ans=0.2 2023-06-22 01:31:11,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=874914.0, ans=0.0 2023-06-22 01:31:28,759 INFO [train.py:996] (2/4) Epoch 5, batch 23850, loss[loss=0.2422, simple_loss=0.317, pruned_loss=0.08372, over 21361.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3145, pruned_loss=0.08041, over 4275429.19 frames. ], batch size: 176, lr: 6.04e-03, grad_scale: 32.0 2023-06-22 01:32:23,800 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.31 vs. limit=22.5 2023-06-22 01:32:47,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=875094.0, ans=0.125 2023-06-22 01:33:10,387 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=875154.0, ans=0.125 2023-06-22 01:33:39,653 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.103e+02 2.789e+02 3.123e+02 3.786e+02 8.749e+02, threshold=6.247e+02, percent-clipped=6.0 2023-06-22 01:33:56,583 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=875214.0, ans=0.2 2023-06-22 01:34:00,419 INFO [train.py:996] (2/4) Epoch 5, batch 23900, loss[loss=0.2548, simple_loss=0.3307, pruned_loss=0.08949, over 21986.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3213, pruned_loss=0.08257, over 4278418.70 frames. ], batch size: 103, lr: 6.04e-03, grad_scale: 32.0 2023-06-22 01:34:21,046 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-06-22 01:35:54,012 INFO [train.py:996] (2/4) Epoch 5, batch 23950, loss[loss=0.2238, simple_loss=0.2825, pruned_loss=0.08256, over 21699.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3144, pruned_loss=0.08252, over 4270719.32 frames. ], batch size: 247, lr: 6.04e-03, grad_scale: 32.0 2023-06-22 01:36:05,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=875574.0, ans=0.125 2023-06-22 01:36:05,890 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.89 vs. limit=22.5 2023-06-22 01:36:08,070 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=875574.0, ans=0.125 2023-06-22 01:37:13,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=875694.0, ans=0.125 2023-06-22 01:37:55,917 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.991e+02 2.517e+02 2.934e+02 3.377e+02 5.286e+02, threshold=5.869e+02, percent-clipped=0.0 2023-06-22 01:38:06,338 INFO [train.py:996] (2/4) Epoch 5, batch 24000, loss[loss=0.2919, simple_loss=0.3547, pruned_loss=0.1146, over 21838.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3166, pruned_loss=0.08585, over 4266578.45 frames. ], batch size: 441, lr: 6.04e-03, grad_scale: 32.0 2023-06-22 01:38:06,339 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-22 01:38:46,522 INFO [train.py:1028] (2/4) Epoch 5, validation: loss=0.2672, simple_loss=0.3621, pruned_loss=0.08617, over 1796401.00 frames. 2023-06-22 01:38:46,523 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-22 01:39:37,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=875934.0, ans=0.125 2023-06-22 01:39:46,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=875994.0, ans=0.0 2023-06-22 01:40:22,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=876054.0, ans=0.125 2023-06-22 01:40:47,116 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.25 vs. limit=12.0 2023-06-22 01:41:22,794 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.80 vs. limit=12.0 2023-06-22 01:41:23,103 INFO [train.py:996] (2/4) Epoch 5, batch 24050, loss[loss=0.1985, simple_loss=0.2861, pruned_loss=0.0554, over 21711.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3175, pruned_loss=0.08554, over 4270451.14 frames. ], batch size: 247, lr: 6.04e-03, grad_scale: 32.0 2023-06-22 01:41:55,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=876234.0, ans=0.1 2023-06-22 01:42:46,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=876354.0, ans=0.1 2023-06-22 01:42:54,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=876354.0, ans=0.125 2023-06-22 01:43:11,871 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=876414.0, ans=0.125 2023-06-22 01:43:11,876 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=876414.0, ans=0.125 2023-06-22 01:43:23,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=876414.0, ans=0.2 2023-06-22 01:43:24,679 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.987e+02 2.535e+02 2.835e+02 3.239e+02 5.691e+02, threshold=5.670e+02, percent-clipped=0.0 2023-06-22 01:43:34,090 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=876414.0, ans=0.0 2023-06-22 01:43:36,732 INFO [train.py:996] (2/4) Epoch 5, batch 24100, loss[loss=0.2285, simple_loss=0.2992, pruned_loss=0.07886, over 21156.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3174, pruned_loss=0.08391, over 4273601.82 frames. ], batch size: 143, lr: 6.04e-03, grad_scale: 32.0 2023-06-22 01:44:05,258 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.66 vs. limit=22.5 2023-06-22 01:44:20,266 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.61 vs. limit=15.0 2023-06-22 01:44:20,298 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-06-22 01:44:21,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=876594.0, ans=0.07 2023-06-22 01:45:49,885 INFO [train.py:996] (2/4) Epoch 5, batch 24150, loss[loss=0.2279, simple_loss=0.2977, pruned_loss=0.07908, over 21927.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3155, pruned_loss=0.08415, over 4273974.89 frames. ], batch size: 316, lr: 6.04e-03, grad_scale: 16.0 2023-06-22 01:46:22,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=876834.0, ans=0.1 2023-06-22 01:46:22,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=876834.0, ans=0.125 2023-06-22 01:46:26,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=876894.0, ans=0.125 2023-06-22 01:46:53,448 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=876894.0, ans=0.125 2023-06-22 01:47:26,010 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.23 vs. limit=10.0 2023-06-22 01:48:02,121 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.095e+02 2.649e+02 3.023e+02 3.593e+02 4.486e+02, threshold=6.046e+02, percent-clipped=0.0 2023-06-22 01:48:06,735 INFO [train.py:996] (2/4) Epoch 5, batch 24200, loss[loss=0.35, simple_loss=0.411, pruned_loss=0.1445, over 21507.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3185, pruned_loss=0.08567, over 4276763.54 frames. ], batch size: 508, lr: 6.04e-03, grad_scale: 16.0 2023-06-22 01:48:16,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=877074.0, ans=0.2 2023-06-22 01:50:13,923 INFO [train.py:996] (2/4) Epoch 5, batch 24250, loss[loss=0.1931, simple_loss=0.2981, pruned_loss=0.04407, over 21857.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3157, pruned_loss=0.07912, over 4277386.11 frames. ], batch size: 371, lr: 6.03e-03, grad_scale: 16.0 2023-06-22 01:52:16,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=877554.0, ans=0.125 2023-06-22 01:52:32,153 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.447e+02 1.979e+02 2.289e+02 2.942e+02 4.339e+02, threshold=4.579e+02, percent-clipped=0.0 2023-06-22 01:52:36,816 INFO [train.py:996] (2/4) Epoch 5, batch 24300, loss[loss=0.1955, simple_loss=0.2614, pruned_loss=0.06473, over 21816.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.308, pruned_loss=0.07343, over 4277393.67 frames. ], batch size: 107, lr: 6.03e-03, grad_scale: 16.0 2023-06-22 01:53:03,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=877734.0, ans=0.125 2023-06-22 01:54:56,431 INFO [train.py:996] (2/4) Epoch 5, batch 24350, loss[loss=0.2451, simple_loss=0.308, pruned_loss=0.09112, over 20941.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.304, pruned_loss=0.07373, over 4284489.21 frames. ], batch size: 607, lr: 6.03e-03, grad_scale: 16.0 2023-06-22 01:57:07,944 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.029e+02 2.557e+02 2.877e+02 3.455e+02 4.976e+02, threshold=5.754e+02, percent-clipped=3.0 2023-06-22 01:57:18,435 INFO [train.py:996] (2/4) Epoch 5, batch 24400, loss[loss=0.2338, simple_loss=0.3066, pruned_loss=0.08046, over 21691.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3075, pruned_loss=0.07691, over 4283421.15 frames. ], batch size: 298, lr: 6.03e-03, grad_scale: 32.0 2023-06-22 01:57:44,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=878334.0, ans=0.0 2023-06-22 01:58:05,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=878334.0, ans=0.125 2023-06-22 01:58:12,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=878334.0, ans=0.1 2023-06-22 01:58:52,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=878454.0, ans=0.1 2023-06-22 01:59:40,621 INFO [train.py:996] (2/4) Epoch 5, batch 24450, loss[loss=0.3332, simple_loss=0.4069, pruned_loss=0.1298, over 21458.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3104, pruned_loss=0.0786, over 4281310.15 frames. ], batch size: 508, lr: 6.03e-03, grad_scale: 32.0 2023-06-22 02:00:14,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=878634.0, ans=0.125 2023-06-22 02:00:59,709 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=878694.0, ans=0.0 2023-06-22 02:01:02,509 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=878694.0, ans=0.1 2023-06-22 02:01:10,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=878754.0, ans=0.0 2023-06-22 02:01:41,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=878814.0, ans=0.025 2023-06-22 02:01:46,968 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.115e+02 2.690e+02 3.074e+02 4.098e+02 6.316e+02, threshold=6.149e+02, percent-clipped=3.0 2023-06-22 02:01:47,480 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=878814.0, ans=0.0 2023-06-22 02:02:00,007 INFO [train.py:996] (2/4) Epoch 5, batch 24500, loss[loss=0.2185, simple_loss=0.2851, pruned_loss=0.07595, over 21219.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3099, pruned_loss=0.07847, over 4278107.62 frames. ], batch size: 608, lr: 6.03e-03, grad_scale: 32.0 2023-06-22 02:02:00,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=878874.0, ans=0.125 2023-06-22 02:02:26,141 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=878874.0, ans=0.05 2023-06-22 02:03:08,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=878994.0, ans=0.125 2023-06-22 02:03:16,525 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=878994.0, ans=0.125 2023-06-22 02:03:16,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=878994.0, ans=0.125 2023-06-22 02:03:21,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=878994.0, ans=0.1 2023-06-22 02:03:24,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=879054.0, ans=0.125 2023-06-22 02:03:34,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=879054.0, ans=0.2 2023-06-22 02:04:19,099 INFO [train.py:996] (2/4) Epoch 5, batch 24550, loss[loss=0.2933, simple_loss=0.3622, pruned_loss=0.1122, over 21237.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3147, pruned_loss=0.08239, over 4284234.80 frames. ], batch size: 143, lr: 6.03e-03, grad_scale: 32.0 2023-06-22 02:04:21,675 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.28 vs. limit=15.0 2023-06-22 02:04:39,576 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=879174.0, ans=0.0 2023-06-22 02:06:26,333 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.87 vs. limit=15.0 2023-06-22 02:06:28,116 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.118e+02 2.618e+02 2.906e+02 3.324e+02 4.617e+02, threshold=5.812e+02, percent-clipped=0.0 2023-06-22 02:06:31,234 INFO [train.py:996] (2/4) Epoch 5, batch 24600, loss[loss=0.2034, simple_loss=0.2564, pruned_loss=0.07519, over 21149.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3105, pruned_loss=0.08329, over 4276963.35 frames. ], batch size: 143, lr: 6.03e-03, grad_scale: 16.0 2023-06-22 02:07:46,812 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.07 vs. limit=15.0 2023-06-22 02:08:19,172 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=879714.0, ans=0.125 2023-06-22 02:08:40,537 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.76 vs. limit=15.0 2023-06-22 02:08:40,857 INFO [train.py:996] (2/4) Epoch 5, batch 24650, loss[loss=0.237, simple_loss=0.3635, pruned_loss=0.05518, over 19850.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3033, pruned_loss=0.08137, over 4273677.99 frames. ], batch size: 702, lr: 6.03e-03, grad_scale: 16.0 2023-06-22 02:08:41,941 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=12.0 2023-06-22 02:09:24,626 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=879834.0, ans=0.0 2023-06-22 02:09:25,182 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.91 vs. limit=15.0 2023-06-22 02:10:31,530 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.64 vs. limit=15.0 2023-06-22 02:10:55,274 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.966e+02 2.542e+02 2.905e+02 3.356e+02 5.377e+02, threshold=5.810e+02, percent-clipped=0.0 2023-06-22 02:10:58,331 INFO [train.py:996] (2/4) Epoch 5, batch 24700, loss[loss=0.1884, simple_loss=0.2455, pruned_loss=0.06566, over 20667.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3012, pruned_loss=0.07956, over 4261630.97 frames. ], batch size: 607, lr: 6.03e-03, grad_scale: 16.0 2023-06-22 02:11:34,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=880134.0, ans=0.025 2023-06-22 02:11:49,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=880194.0, ans=15.0 2023-06-22 02:11:51,208 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.34 vs. limit=6.0 2023-06-22 02:11:57,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=880194.0, ans=0.125 2023-06-22 02:12:34,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=880314.0, ans=0.125 2023-06-22 02:12:36,691 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=880314.0, ans=0.125 2023-06-22 02:12:51,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=880314.0, ans=0.2 2023-06-22 02:12:54,958 INFO [train.py:996] (2/4) Epoch 5, batch 24750, loss[loss=0.1789, simple_loss=0.2424, pruned_loss=0.05771, over 21207.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2947, pruned_loss=0.07698, over 4262793.29 frames. ], batch size: 159, lr: 6.02e-03, grad_scale: 16.0 2023-06-22 02:13:37,546 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.93 vs. limit=10.0 2023-06-22 02:13:39,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=880434.0, ans=0.125 2023-06-22 02:13:41,147 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=880434.0, ans=0.0 2023-06-22 02:13:43,167 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.37 vs. limit=10.0 2023-06-22 02:13:47,054 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=880434.0, ans=0.0 2023-06-22 02:13:48,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=880434.0, ans=0.2 2023-06-22 02:14:08,156 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=880494.0, ans=0.0 2023-06-22 02:14:27,898 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.53 vs. limit=15.0 2023-06-22 02:15:04,034 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.972e+02 2.468e+02 2.817e+02 3.299e+02 5.341e+02, threshold=5.634e+02, percent-clipped=0.0 2023-06-22 02:15:06,310 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.30 vs. limit=15.0 2023-06-22 02:15:06,737 INFO [train.py:996] (2/4) Epoch 5, batch 24800, loss[loss=0.2191, simple_loss=0.2851, pruned_loss=0.07654, over 21391.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2894, pruned_loss=0.07587, over 4271108.91 frames. ], batch size: 194, lr: 6.02e-03, grad_scale: 32.0 2023-06-22 02:16:45,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=880854.0, ans=0.2 2023-06-22 02:17:22,833 INFO [train.py:996] (2/4) Epoch 5, batch 24850, loss[loss=0.2939, simple_loss=0.3597, pruned_loss=0.1141, over 21544.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2895, pruned_loss=0.07748, over 4270314.85 frames. ], batch size: 471, lr: 6.02e-03, grad_scale: 32.0 2023-06-22 02:17:29,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=880974.0, ans=0.1 2023-06-22 02:18:27,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=881094.0, ans=0.0 2023-06-22 02:19:27,573 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-22 02:19:35,606 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.196e+02 2.607e+02 3.027e+02 3.615e+02 6.896e+02, threshold=6.053e+02, percent-clipped=2.0 2023-06-22 02:19:38,480 INFO [train.py:996] (2/4) Epoch 5, batch 24900, loss[loss=0.2529, simple_loss=0.3297, pruned_loss=0.08806, over 21484.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2921, pruned_loss=0.07796, over 4275232.76 frames. ], batch size: 131, lr: 6.02e-03, grad_scale: 32.0 2023-06-22 02:20:27,821 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.90 vs. limit=15.0 2023-06-22 02:20:32,500 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.73 vs. limit=15.0 2023-06-22 02:21:00,797 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=881394.0, ans=0.1 2023-06-22 02:21:02,984 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=19.09 vs. limit=22.5 2023-06-22 02:21:05,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=881454.0, ans=0.125 2023-06-22 02:22:16,531 INFO [train.py:996] (2/4) Epoch 5, batch 24950, loss[loss=0.278, simple_loss=0.3489, pruned_loss=0.1036, over 21388.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3012, pruned_loss=0.08262, over 4277307.99 frames. ], batch size: 159, lr: 6.02e-03, grad_scale: 16.0 2023-06-22 02:22:23,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=881574.0, ans=0.0 2023-06-22 02:22:54,735 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-22 02:24:34,211 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.284e+02 2.886e+02 3.310e+02 3.841e+02 8.420e+02, threshold=6.620e+02, percent-clipped=6.0 2023-06-22 02:24:35,700 INFO [train.py:996] (2/4) Epoch 5, batch 25000, loss[loss=0.2316, simple_loss=0.3056, pruned_loss=0.07881, over 20744.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3091, pruned_loss=0.0852, over 4280515.81 frames. ], batch size: 607, lr: 6.02e-03, grad_scale: 16.0 2023-06-22 02:24:36,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=881874.0, ans=0.0 2023-06-22 02:24:42,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=881874.0, ans=0.04949747468305833 2023-06-22 02:25:33,717 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=881994.0, ans=0.0 2023-06-22 02:26:43,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=882174.0, ans=0.125 2023-06-22 02:26:44,210 INFO [train.py:996] (2/4) Epoch 5, batch 25050, loss[loss=0.23, simple_loss=0.284, pruned_loss=0.08798, over 21865.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3019, pruned_loss=0.08393, over 4278236.25 frames. ], batch size: 373, lr: 6.02e-03, grad_scale: 16.0 2023-06-22 02:26:45,266 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.61 vs. limit=15.0 2023-06-22 02:27:18,289 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=882234.0, ans=0.125 2023-06-22 02:27:21,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=882294.0, ans=0.125 2023-06-22 02:28:36,721 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=882414.0, ans=0.125 2023-06-22 02:28:50,925 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=882414.0, ans=0.1 2023-06-22 02:28:57,817 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.157e+02 2.524e+02 2.786e+02 3.350e+02 6.215e+02, threshold=5.572e+02, percent-clipped=0.0 2023-06-22 02:28:59,709 INFO [train.py:996] (2/4) Epoch 5, batch 25100, loss[loss=0.2269, simple_loss=0.2837, pruned_loss=0.08498, over 21688.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.295, pruned_loss=0.08242, over 4280003.08 frames. ], batch size: 417, lr: 6.02e-03, grad_scale: 16.0 2023-06-22 02:29:00,061 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=882474.0, ans=0.1 2023-06-22 02:29:54,424 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=882594.0, ans=0.125 2023-06-22 02:30:25,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=882654.0, ans=0.125 2023-06-22 02:30:28,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=882714.0, ans=0.1 2023-06-22 02:31:07,267 INFO [train.py:996] (2/4) Epoch 5, batch 25150, loss[loss=0.226, simple_loss=0.3087, pruned_loss=0.07166, over 21916.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.2992, pruned_loss=0.08, over 4261645.18 frames. ], batch size: 316, lr: 6.02e-03, grad_scale: 16.0 2023-06-22 02:31:25,881 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=882834.0, ans=0.125 2023-06-22 02:33:00,568 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.825e+02 2.215e+02 2.479e+02 2.821e+02 4.157e+02, threshold=4.958e+02, percent-clipped=0.0 2023-06-22 02:33:02,328 INFO [train.py:996] (2/4) Epoch 5, batch 25200, loss[loss=0.1912, simple_loss=0.2695, pruned_loss=0.05646, over 21363.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.2993, pruned_loss=0.07784, over 4268701.70 frames. ], batch size: 131, lr: 6.02e-03, grad_scale: 32.0 2023-06-22 02:33:14,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=883074.0, ans=0.0 2023-06-22 02:33:19,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=883074.0, ans=0.125 2023-06-22 02:33:20,060 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=883074.0, ans=0.125 2023-06-22 02:33:50,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=883194.0, ans=0.125 2023-06-22 02:34:15,466 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=883194.0, ans=0.2 2023-06-22 02:34:18,663 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.38 vs. limit=22.5 2023-06-22 02:35:17,803 INFO [train.py:996] (2/4) Epoch 5, batch 25250, loss[loss=0.2411, simple_loss=0.2978, pruned_loss=0.09224, over 21521.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2969, pruned_loss=0.07598, over 4252323.38 frames. ], batch size: 414, lr: 6.01e-03, grad_scale: 32.0 2023-06-22 02:35:41,877 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 02:35:43,957 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.65 vs. limit=15.0 2023-06-22 02:37:20,504 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.890e+02 2.367e+02 2.631e+02 2.996e+02 4.981e+02, threshold=5.263e+02, percent-clipped=1.0 2023-06-22 02:37:27,882 INFO [train.py:996] (2/4) Epoch 5, batch 25300, loss[loss=0.2372, simple_loss=0.3094, pruned_loss=0.08246, over 21311.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2946, pruned_loss=0.07464, over 4259568.23 frames. ], batch size: 176, lr: 6.01e-03, grad_scale: 32.0 2023-06-22 02:37:52,234 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 02:37:52,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=883734.0, ans=0.125 2023-06-22 02:37:56,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=883734.0, ans=0.0 2023-06-22 02:39:39,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=883914.0, ans=0.125 2023-06-22 02:39:44,738 INFO [train.py:996] (2/4) Epoch 5, batch 25350, loss[loss=0.2256, simple_loss=0.3062, pruned_loss=0.07257, over 21465.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.2992, pruned_loss=0.07497, over 4260083.27 frames. ], batch size: 471, lr: 6.01e-03, grad_scale: 32.0 2023-06-22 02:40:05,347 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.89 vs. limit=15.0 2023-06-22 02:41:19,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=884154.0, ans=0.125 2023-06-22 02:41:31,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=884214.0, ans=0.125 2023-06-22 02:41:59,338 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.667e+02 2.361e+02 2.646e+02 3.091e+02 5.117e+02, threshold=5.293e+02, percent-clipped=0.0 2023-06-22 02:41:59,362 INFO [train.py:996] (2/4) Epoch 5, batch 25400, loss[loss=0.22, simple_loss=0.2882, pruned_loss=0.07589, over 21606.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2941, pruned_loss=0.07405, over 4265891.25 frames. ], batch size: 298, lr: 6.01e-03, grad_scale: 16.0 2023-06-22 02:42:06,082 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.69 vs. limit=15.0 2023-06-22 02:42:14,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=884334.0, ans=0.125 2023-06-22 02:43:05,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=884394.0, ans=0.1 2023-06-22 02:43:08,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=884394.0, ans=0.0 2023-06-22 02:44:14,162 INFO [train.py:996] (2/4) Epoch 5, batch 25450, loss[loss=0.2028, simple_loss=0.2896, pruned_loss=0.05801, over 21269.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2957, pruned_loss=0.0755, over 4267018.60 frames. ], batch size: 176, lr: 6.01e-03, grad_scale: 16.0 2023-06-22 02:44:14,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=884574.0, ans=0.0 2023-06-22 02:44:51,367 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.04 vs. limit=22.5 2023-06-22 02:44:51,527 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.48 vs. limit=15.0 2023-06-22 02:44:57,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=884694.0, ans=0.1 2023-06-22 02:46:12,588 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.23 vs. limit=22.5 2023-06-22 02:46:30,115 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.857e+02 2.236e+02 2.528e+02 3.001e+02 4.958e+02, threshold=5.055e+02, percent-clipped=0.0 2023-06-22 02:46:30,138 INFO [train.py:996] (2/4) Epoch 5, batch 25500, loss[loss=0.1805, simple_loss=0.2678, pruned_loss=0.04661, over 21415.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2945, pruned_loss=0.07182, over 4252816.43 frames. ], batch size: 194, lr: 6.01e-03, grad_scale: 16.0 2023-06-22 02:47:17,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=884934.0, ans=0.1 2023-06-22 02:47:26,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=884994.0, ans=0.1 2023-06-22 02:47:47,757 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.76 vs. limit=22.5 2023-06-22 02:48:42,379 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=885114.0, ans=0.125 2023-06-22 02:48:47,572 INFO [train.py:996] (2/4) Epoch 5, batch 25550, loss[loss=0.2891, simple_loss=0.3755, pruned_loss=0.1013, over 21566.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3033, pruned_loss=0.07318, over 4247320.28 frames. ], batch size: 471, lr: 6.01e-03, grad_scale: 16.0 2023-06-22 02:50:59,785 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.00 vs. limit=15.0 2023-06-22 02:51:01,402 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.803e+02 2.415e+02 2.727e+02 3.146e+02 6.002e+02, threshold=5.455e+02, percent-clipped=2.0 2023-06-22 02:51:01,425 INFO [train.py:996] (2/4) Epoch 5, batch 25600, loss[loss=0.2451, simple_loss=0.3163, pruned_loss=0.0869, over 21946.00 frames. ], tot_loss[loss=0.228, simple_loss=0.307, pruned_loss=0.07451, over 4249993.13 frames. ], batch size: 316, lr: 6.01e-03, grad_scale: 32.0 2023-06-22 02:52:40,223 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.14 vs. limit=15.0 2023-06-22 02:53:11,095 INFO [train.py:996] (2/4) Epoch 5, batch 25650, loss[loss=0.1968, simple_loss=0.2693, pruned_loss=0.06213, over 21800.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3079, pruned_loss=0.07751, over 4251506.20 frames. ], batch size: 317, lr: 6.01e-03, grad_scale: 16.0 2023-06-22 02:54:36,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=885954.0, ans=0.125 2023-06-22 02:54:38,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=885954.0, ans=0.125 2023-06-22 02:54:38,269 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=885954.0, ans=0.0 2023-06-22 02:54:48,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=885954.0, ans=0.125 2023-06-22 02:54:51,581 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.09 vs. limit=15.0 2023-06-22 02:55:23,879 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.19 vs. limit=10.0 2023-06-22 02:55:24,465 INFO [train.py:996] (2/4) Epoch 5, batch 25700, loss[loss=0.2085, simple_loss=0.2746, pruned_loss=0.07118, over 21494.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3034, pruned_loss=0.07807, over 4251799.71 frames. ], batch size: 212, lr: 6.01e-03, grad_scale: 16.0 2023-06-22 02:55:40,589 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.169e+02 2.593e+02 2.983e+02 3.503e+02 5.289e+02, threshold=5.966e+02, percent-clipped=0.0 2023-06-22 02:57:21,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=886314.0, ans=0.05 2023-06-22 02:57:30,447 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=886314.0, ans=0.1 2023-06-22 02:57:32,865 INFO [train.py:996] (2/4) Epoch 5, batch 25750, loss[loss=0.3759, simple_loss=0.4538, pruned_loss=0.149, over 21469.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3105, pruned_loss=0.08169, over 4258185.56 frames. ], batch size: 471, lr: 6.00e-03, grad_scale: 16.0 2023-06-22 02:58:41,527 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 02:59:12,525 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.21 vs. limit=15.0 2023-06-22 02:59:21,418 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.84 vs. limit=15.0 2023-06-22 02:59:22,336 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=886554.0, ans=0.125 2023-06-22 02:59:25,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=886554.0, ans=0.1 2023-06-22 02:59:48,508 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-22 03:00:20,180 INFO [train.py:996] (2/4) Epoch 5, batch 25800, loss[loss=0.2336, simple_loss=0.3102, pruned_loss=0.07843, over 21625.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3226, pruned_loss=0.08638, over 4259958.57 frames. ], batch size: 263, lr: 6.00e-03, grad_scale: 16.0 2023-06-22 03:00:21,877 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.996e+02 2.816e+02 3.413e+02 4.279e+02 8.490e+02, threshold=6.826e+02, percent-clipped=2.0 2023-06-22 03:01:02,471 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=886734.0, ans=0.125 2023-06-22 03:01:03,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=886734.0, ans=0.125 2023-06-22 03:01:25,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=886794.0, ans=0.0 2023-06-22 03:01:25,114 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=886794.0, ans=0.0 2023-06-22 03:01:37,689 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.03 vs. limit=22.5 2023-06-22 03:01:38,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=886794.0, ans=0.0 2023-06-22 03:02:08,949 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.24 vs. limit=15.0 2023-06-22 03:02:41,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=886974.0, ans=0.0 2023-06-22 03:02:42,721 INFO [train.py:996] (2/4) Epoch 5, batch 25850, loss[loss=0.2583, simple_loss=0.3218, pruned_loss=0.09742, over 21783.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3231, pruned_loss=0.08575, over 4264377.67 frames. ], batch size: 441, lr: 6.00e-03, grad_scale: 16.0 2023-06-22 03:03:05,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=886974.0, ans=0.125 2023-06-22 03:03:46,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=887094.0, ans=0.2 2023-06-22 03:04:04,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=887154.0, ans=0.125 2023-06-22 03:04:16,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=887154.0, ans=0.07 2023-06-22 03:04:48,241 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=887214.0, ans=0.0 2023-06-22 03:04:58,322 INFO [train.py:996] (2/4) Epoch 5, batch 25900, loss[loss=0.2791, simple_loss=0.3626, pruned_loss=0.09782, over 21691.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3233, pruned_loss=0.0858, over 4269914.19 frames. ], batch size: 247, lr: 6.00e-03, grad_scale: 16.0 2023-06-22 03:04:59,692 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.827e+02 2.721e+02 3.071e+02 3.513e+02 5.338e+02, threshold=6.142e+02, percent-clipped=0.0 2023-06-22 03:05:35,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=887334.0, ans=0.2 2023-06-22 03:06:55,336 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.97 vs. limit=15.0 2023-06-22 03:07:08,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=887514.0, ans=0.125 2023-06-22 03:07:09,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=887514.0, ans=0.125 2023-06-22 03:07:18,014 INFO [train.py:996] (2/4) Epoch 5, batch 25950, loss[loss=0.2636, simple_loss=0.326, pruned_loss=0.1006, over 21320.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3291, pruned_loss=0.08876, over 4275546.37 frames. ], batch size: 549, lr: 6.00e-03, grad_scale: 16.0 2023-06-22 03:07:19,894 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 03:07:24,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=887574.0, ans=0.125 2023-06-22 03:07:43,023 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.92 vs. limit=15.0 2023-06-22 03:07:55,132 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=887634.0, ans=0.0 2023-06-22 03:09:14,394 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=887754.0, ans=0.0 2023-06-22 03:09:32,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=887814.0, ans=0.125 2023-06-22 03:09:40,577 INFO [train.py:996] (2/4) Epoch 5, batch 26000, loss[loss=0.2184, simple_loss=0.302, pruned_loss=0.06735, over 21821.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3276, pruned_loss=0.08644, over 4271112.87 frames. ], batch size: 282, lr: 6.00e-03, grad_scale: 32.0 2023-06-22 03:09:42,002 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.043e+02 2.564e+02 2.907e+02 3.379e+02 5.318e+02, threshold=5.814e+02, percent-clipped=0.0 2023-06-22 03:09:48,522 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=887874.0, ans=0.2 2023-06-22 03:09:49,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=887874.0, ans=0.1 2023-06-22 03:10:25,309 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=887934.0, ans=0.2 2023-06-22 03:11:41,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=888114.0, ans=10.0 2023-06-22 03:11:46,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=888174.0, ans=0.125 2023-06-22 03:11:47,528 INFO [train.py:996] (2/4) Epoch 5, batch 26050, loss[loss=0.2315, simple_loss=0.2967, pruned_loss=0.08314, over 21421.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3277, pruned_loss=0.08799, over 4271432.55 frames. ], batch size: 211, lr: 6.00e-03, grad_scale: 32.0 2023-06-22 03:11:47,952 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=888174.0, ans=0.125 2023-06-22 03:11:55,179 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=888174.0, ans=0.0 2023-06-22 03:12:12,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=888234.0, ans=0.125 2023-06-22 03:12:35,282 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=888234.0, ans=0.0 2023-06-22 03:13:09,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=888354.0, ans=0.2 2023-06-22 03:13:46,282 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=888414.0, ans=0.2 2023-06-22 03:13:58,833 INFO [train.py:996] (2/4) Epoch 5, batch 26100, loss[loss=0.2344, simple_loss=0.2973, pruned_loss=0.08578, over 21961.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3232, pruned_loss=0.0873, over 4276053.75 frames. ], batch size: 333, lr: 6.00e-03, grad_scale: 32.0 2023-06-22 03:14:00,256 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.938e+02 2.634e+02 2.953e+02 3.214e+02 4.969e+02, threshold=5.905e+02, percent-clipped=0.0 2023-06-22 03:14:06,979 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.00 vs. limit=15.0 2023-06-22 03:14:24,642 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=15.0 2023-06-22 03:14:26,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=888534.0, ans=0.125 2023-06-22 03:15:14,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=888594.0, ans=0.125 2023-06-22 03:15:21,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=888654.0, ans=0.125 2023-06-22 03:15:58,600 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.67 vs. limit=15.0 2023-06-22 03:16:09,563 INFO [train.py:996] (2/4) Epoch 5, batch 26150, loss[loss=0.2678, simple_loss=0.3449, pruned_loss=0.09533, over 21293.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3213, pruned_loss=0.08717, over 4280988.49 frames. ], batch size: 143, lr: 6.00e-03, grad_scale: 16.0 2023-06-22 03:16:42,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=888834.0, ans=0.125 2023-06-22 03:17:30,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=888894.0, ans=0.125 2023-06-22 03:18:10,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=889014.0, ans=0.125 2023-06-22 03:18:37,494 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=889014.0, ans=0.2 2023-06-22 03:18:41,270 INFO [train.py:996] (2/4) Epoch 5, batch 26200, loss[loss=0.2123, simple_loss=0.3166, pruned_loss=0.05401, over 21290.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3211, pruned_loss=0.08468, over 4281699.76 frames. ], batch size: 548, lr: 6.00e-03, grad_scale: 16.0 2023-06-22 03:18:48,979 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.087e+02 2.441e+02 2.915e+02 3.551e+02 5.779e+02, threshold=5.831e+02, percent-clipped=0.0 2023-06-22 03:19:58,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=889194.0, ans=22.5 2023-06-22 03:19:58,081 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.70 vs. limit=22.5 2023-06-22 03:20:15,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=889254.0, ans=0.0 2023-06-22 03:20:22,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=889254.0, ans=0.025 2023-06-22 03:21:09,475 INFO [train.py:996] (2/4) Epoch 5, batch 26250, loss[loss=0.228, simple_loss=0.3137, pruned_loss=0.07112, over 21840.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3235, pruned_loss=0.08333, over 4278338.91 frames. ], batch size: 332, lr: 5.99e-03, grad_scale: 16.0 2023-06-22 03:21:28,841 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 03:23:11,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=889614.0, ans=0.0 2023-06-22 03:23:13,120 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 03:23:20,759 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=889614.0, ans=0.0 2023-06-22 03:23:26,708 INFO [train.py:996] (2/4) Epoch 5, batch 26300, loss[loss=0.2469, simple_loss=0.3116, pruned_loss=0.09108, over 21752.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3212, pruned_loss=0.08456, over 4288109.04 frames. ], batch size: 389, lr: 5.99e-03, grad_scale: 16.0 2023-06-22 03:23:37,826 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.012e+02 2.568e+02 2.838e+02 3.219e+02 5.714e+02, threshold=5.676e+02, percent-clipped=0.0 2023-06-22 03:24:51,843 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=889854.0, ans=0.0 2023-06-22 03:25:53,849 INFO [train.py:996] (2/4) Epoch 5, batch 26350, loss[loss=0.2458, simple_loss=0.3182, pruned_loss=0.08671, over 21579.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.32, pruned_loss=0.08593, over 4292890.83 frames. ], batch size: 263, lr: 5.99e-03, grad_scale: 16.0 2023-06-22 03:26:59,652 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=890094.0, ans=0.125 2023-06-22 03:27:09,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=890154.0, ans=0.0 2023-06-22 03:27:49,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=890214.0, ans=0.2 2023-06-22 03:27:49,269 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 03:27:52,902 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.52 vs. limit=6.0 2023-06-22 03:27:53,342 INFO [train.py:996] (2/4) Epoch 5, batch 26400, loss[loss=0.2217, simple_loss=0.2775, pruned_loss=0.08292, over 21242.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3134, pruned_loss=0.08522, over 4286471.49 frames. ], batch size: 159, lr: 5.99e-03, grad_scale: 32.0 2023-06-22 03:27:56,288 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.053e+02 2.457e+02 2.775e+02 3.263e+02 6.072e+02, threshold=5.551e+02, percent-clipped=1.0 2023-06-22 03:28:31,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=890334.0, ans=0.125 2023-06-22 03:28:33,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=890334.0, ans=0.0 2023-06-22 03:29:28,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=890454.0, ans=0.2 2023-06-22 03:29:45,601 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.19 vs. limit=22.5 2023-06-22 03:30:28,627 INFO [train.py:996] (2/4) Epoch 5, batch 26450, loss[loss=0.2536, simple_loss=0.353, pruned_loss=0.07711, over 21684.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3163, pruned_loss=0.08478, over 4285489.55 frames. ], batch size: 298, lr: 5.99e-03, grad_scale: 32.0 2023-06-22 03:31:09,383 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.81 vs. limit=15.0 2023-06-22 03:31:41,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=890754.0, ans=0.125 2023-06-22 03:31:50,489 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.81 vs. limit=15.0 2023-06-22 03:32:37,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=890814.0, ans=0.0 2023-06-22 03:32:41,196 INFO [train.py:996] (2/4) Epoch 5, batch 26500, loss[loss=0.2696, simple_loss=0.3512, pruned_loss=0.09402, over 21651.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3178, pruned_loss=0.08386, over 4278178.80 frames. ], batch size: 441, lr: 5.99e-03, grad_scale: 32.0 2023-06-22 03:32:44,224 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.276e+02 2.731e+02 3.364e+02 4.078e+02 7.843e+02, threshold=6.728e+02, percent-clipped=9.0 2023-06-22 03:32:56,315 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=890874.0, ans=0.0 2023-06-22 03:32:56,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=890874.0, ans=0.1 2023-06-22 03:33:18,676 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.70 vs. limit=15.0 2023-06-22 03:33:43,324 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=890934.0, ans=0.2 2023-06-22 03:34:14,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=891054.0, ans=0.2 2023-06-22 03:34:39,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=891054.0, ans=0.04949747468305833 2023-06-22 03:35:11,484 INFO [train.py:996] (2/4) Epoch 5, batch 26550, loss[loss=0.2017, simple_loss=0.3077, pruned_loss=0.04782, over 21104.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3132, pruned_loss=0.08079, over 4262319.43 frames. ], batch size: 548, lr: 5.99e-03, grad_scale: 32.0 2023-06-22 03:36:12,493 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.57 vs. limit=15.0 2023-06-22 03:36:32,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=891294.0, ans=0.125 2023-06-22 03:36:53,760 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.19 vs. limit=15.0 2023-06-22 03:37:09,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=891414.0, ans=0.0 2023-06-22 03:37:16,299 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.04 vs. limit=22.5 2023-06-22 03:37:18,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=891414.0, ans=0.125 2023-06-22 03:37:18,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=891414.0, ans=0.2 2023-06-22 03:37:43,562 INFO [train.py:996] (2/4) Epoch 5, batch 26600, loss[loss=0.2236, simple_loss=0.2973, pruned_loss=0.07494, over 21493.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3117, pruned_loss=0.07811, over 4257827.78 frames. ], batch size: 389, lr: 5.99e-03, grad_scale: 32.0 2023-06-22 03:37:46,432 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.746e+02 2.463e+02 2.685e+02 3.046e+02 4.735e+02, threshold=5.371e+02, percent-clipped=0.0 2023-06-22 03:38:27,947 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=891594.0, ans=0.125 2023-06-22 03:39:43,218 INFO [train.py:996] (2/4) Epoch 5, batch 26650, loss[loss=0.1967, simple_loss=0.2823, pruned_loss=0.05551, over 21591.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3042, pruned_loss=0.07649, over 4251573.39 frames. ], batch size: 442, lr: 5.99e-03, grad_scale: 16.0 2023-06-22 03:40:36,759 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=891894.0, ans=0.125 2023-06-22 03:40:36,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=891894.0, ans=0.125 2023-06-22 03:41:38,691 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=892014.0, ans=0.125 2023-06-22 03:41:53,006 INFO [train.py:996] (2/4) Epoch 5, batch 26700, loss[loss=0.2243, simple_loss=0.2783, pruned_loss=0.08518, over 20005.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2982, pruned_loss=0.07445, over 4256016.22 frames. ], batch size: 702, lr: 5.99e-03, grad_scale: 16.0 2023-06-22 03:42:06,439 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.415e+02 2.008e+02 2.325e+02 2.767e+02 5.375e+02, threshold=4.650e+02, percent-clipped=1.0 2023-06-22 03:44:06,269 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=892314.0, ans=0.125 2023-06-22 03:44:14,042 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.80 vs. limit=15.0 2023-06-22 03:44:15,904 INFO [train.py:996] (2/4) Epoch 5, batch 26750, loss[loss=0.2356, simple_loss=0.3174, pruned_loss=0.07689, over 21456.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2966, pruned_loss=0.07258, over 4263757.27 frames. ], batch size: 211, lr: 5.98e-03, grad_scale: 8.0 2023-06-22 03:44:46,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=892374.0, ans=0.05 2023-06-22 03:44:48,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=892434.0, ans=0.2 2023-06-22 03:44:52,330 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.96 vs. limit=15.0 2023-06-22 03:45:00,916 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=22.5 2023-06-22 03:45:51,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=892554.0, ans=0.0 2023-06-22 03:46:46,163 INFO [train.py:996] (2/4) Epoch 5, batch 26800, loss[loss=0.2392, simple_loss=0.3107, pruned_loss=0.08383, over 21618.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3032, pruned_loss=0.07623, over 4267969.40 frames. ], batch size: 263, lr: 5.98e-03, grad_scale: 16.0 2023-06-22 03:46:52,280 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.875e+02 2.542e+02 2.962e+02 3.442e+02 4.612e+02, threshold=5.925e+02, percent-clipped=0.0 2023-06-22 03:47:05,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=892734.0, ans=0.125 2023-06-22 03:47:29,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=892734.0, ans=0.0 2023-06-22 03:47:55,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=892854.0, ans=0.125 2023-06-22 03:48:45,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=892914.0, ans=0.07 2023-06-22 03:48:55,005 INFO [train.py:996] (2/4) Epoch 5, batch 26850, loss[loss=0.1956, simple_loss=0.2563, pruned_loss=0.0675, over 21264.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3056, pruned_loss=0.07964, over 4262756.61 frames. ], batch size: 176, lr: 5.98e-03, grad_scale: 16.0 2023-06-22 03:49:27,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=893034.0, ans=0.0 2023-06-22 03:50:00,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=893094.0, ans=0.125 2023-06-22 03:50:48,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=893214.0, ans=0.125 2023-06-22 03:51:03,006 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 03:51:10,814 INFO [train.py:996] (2/4) Epoch 5, batch 26900, loss[loss=0.207, simple_loss=0.2685, pruned_loss=0.07277, over 21535.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.2972, pruned_loss=0.07872, over 4258678.08 frames. ], batch size: 391, lr: 5.98e-03, grad_scale: 16.0 2023-06-22 03:51:16,755 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.71 vs. limit=8.0 2023-06-22 03:51:22,791 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.056e+02 2.626e+02 2.882e+02 3.361e+02 7.434e+02, threshold=5.764e+02, percent-clipped=1.0 2023-06-22 03:52:20,361 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=893394.0, ans=0.0 2023-06-22 03:53:02,579 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=893514.0, ans=0.125 2023-06-22 03:53:16,932 INFO [train.py:996] (2/4) Epoch 5, batch 26950, loss[loss=0.2362, simple_loss=0.3177, pruned_loss=0.07735, over 21504.00 frames. ], tot_loss[loss=0.227, simple_loss=0.2969, pruned_loss=0.07858, over 4257415.46 frames. ], batch size: 212, lr: 5.98e-03, grad_scale: 16.0 2023-06-22 03:54:01,179 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=893634.0, ans=0.125 2023-06-22 03:54:20,049 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.11 vs. limit=10.0 2023-06-22 03:54:27,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=893694.0, ans=0.2 2023-06-22 03:54:32,169 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=893694.0, ans=15.0 2023-06-22 03:55:45,146 INFO [train.py:996] (2/4) Epoch 5, batch 27000, loss[loss=0.2393, simple_loss=0.3316, pruned_loss=0.07344, over 21572.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.2976, pruned_loss=0.07678, over 4263688.46 frames. ], batch size: 442, lr: 5.98e-03, grad_scale: 16.0 2023-06-22 03:55:45,146 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-22 03:56:32,880 INFO [train.py:1028] (2/4) Epoch 5, validation: loss=0.2499, simple_loss=0.3437, pruned_loss=0.07804, over 1796401.00 frames. 2023-06-22 03:56:32,881 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-22 03:56:45,963 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.801e+02 2.322e+02 2.675e+02 3.569e+02 6.901e+02, threshold=5.350e+02, percent-clipped=2.0 2023-06-22 03:56:58,857 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=893934.0, ans=0.125 2023-06-22 03:57:30,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=893994.0, ans=0.125 2023-06-22 03:57:30,733 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.76 vs. limit=12.0 2023-06-22 03:57:34,503 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=893994.0, ans=0.0 2023-06-22 03:58:09,975 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=894054.0, ans=0.125 2023-06-22 03:58:47,686 INFO [train.py:996] (2/4) Epoch 5, batch 27050, loss[loss=0.2866, simple_loss=0.3452, pruned_loss=0.114, over 21629.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2995, pruned_loss=0.07384, over 4270176.59 frames. ], batch size: 471, lr: 5.98e-03, grad_scale: 16.0 2023-06-22 03:59:15,948 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=894234.0, ans=0.0 2023-06-22 03:59:50,876 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=894294.0, ans=0.125 2023-06-22 04:00:15,560 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.17 vs. limit=15.0 2023-06-22 04:00:32,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=894414.0, ans=0.0 2023-06-22 04:00:52,684 INFO [train.py:996] (2/4) Epoch 5, batch 27100, loss[loss=0.237, simple_loss=0.3086, pruned_loss=0.08273, over 21842.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3021, pruned_loss=0.07423, over 4272782.27 frames. ], batch size: 124, lr: 5.98e-03, grad_scale: 16.0 2023-06-22 04:00:58,762 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.689e+02 2.373e+02 2.647e+02 3.210e+02 5.471e+02, threshold=5.294e+02, percent-clipped=1.0 2023-06-22 04:01:03,334 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.78 vs. limit=15.0 2023-06-22 04:01:24,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=894534.0, ans=0.125 2023-06-22 04:02:02,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=894594.0, ans=0.07 2023-06-22 04:03:12,692 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-22 04:03:16,616 INFO [train.py:996] (2/4) Epoch 5, batch 27150, loss[loss=0.2048, simple_loss=0.3195, pruned_loss=0.04502, over 20092.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3106, pruned_loss=0.07659, over 4274763.39 frames. ], batch size: 703, lr: 5.98e-03, grad_scale: 16.0 2023-06-22 04:03:17,575 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.03 vs. limit=22.5 2023-06-22 04:03:26,152 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=894774.0, ans=0.125 2023-06-22 04:03:27,534 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 04:04:27,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=894894.0, ans=0.2 2023-06-22 04:04:32,060 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=894894.0, ans=0.1 2023-06-22 04:04:46,554 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=894954.0, ans=0.0 2023-06-22 04:05:41,577 INFO [train.py:996] (2/4) Epoch 5, batch 27200, loss[loss=0.247, simple_loss=0.3194, pruned_loss=0.08729, over 21440.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3188, pruned_loss=0.07965, over 4276369.41 frames. ], batch size: 211, lr: 5.98e-03, grad_scale: 32.0 2023-06-22 04:05:47,605 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.168e+02 2.848e+02 3.372e+02 3.981e+02 6.685e+02, threshold=6.744e+02, percent-clipped=3.0 2023-06-22 04:07:28,778 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.20 vs. limit=15.0 2023-06-22 04:08:00,934 INFO [train.py:996] (2/4) Epoch 5, batch 27250, loss[loss=0.2638, simple_loss=0.3306, pruned_loss=0.09855, over 21357.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3228, pruned_loss=0.08363, over 4281576.64 frames. ], batch size: 143, lr: 5.97e-03, grad_scale: 32.0 2023-06-22 04:08:32,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=895434.0, ans=0.0 2023-06-22 04:09:20,673 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.85 vs. limit=15.0 2023-06-22 04:09:41,617 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=12.0 2023-06-22 04:10:32,066 INFO [train.py:996] (2/4) Epoch 5, batch 27300, loss[loss=0.2442, simple_loss=0.3268, pruned_loss=0.08078, over 21750.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3254, pruned_loss=0.08496, over 4280488.44 frames. ], batch size: 247, lr: 5.97e-03, grad_scale: 32.0 2023-06-22 04:10:52,687 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.112e+02 2.750e+02 3.166e+02 3.741e+02 6.824e+02, threshold=6.331e+02, percent-clipped=1.0 2023-06-22 04:11:21,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=895734.0, ans=0.09899494936611666 2023-06-22 04:11:28,724 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 04:12:43,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=895914.0, ans=0.07 2023-06-22 04:13:20,061 INFO [train.py:996] (2/4) Epoch 5, batch 27350, loss[loss=0.2164, simple_loss=0.2993, pruned_loss=0.06676, over 21400.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3281, pruned_loss=0.08607, over 4282517.74 frames. ], batch size: 194, lr: 5.97e-03, grad_scale: 32.0 2023-06-22 04:15:01,265 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.03 vs. limit=10.0 2023-06-22 04:15:27,416 INFO [train.py:996] (2/4) Epoch 5, batch 27400, loss[loss=0.2335, simple_loss=0.295, pruned_loss=0.08599, over 21756.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3221, pruned_loss=0.08501, over 4285869.92 frames. ], batch size: 351, lr: 5.97e-03, grad_scale: 32.0 2023-06-22 04:15:45,253 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.047e+02 2.543e+02 2.846e+02 3.223e+02 4.913e+02, threshold=5.692e+02, percent-clipped=0.0 2023-06-22 04:16:08,325 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.73 vs. limit=6.0 2023-06-22 04:16:14,201 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.65 vs. limit=15.0 2023-06-22 04:16:17,251 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.03 vs. limit=15.0 2023-06-22 04:17:23,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=896514.0, ans=0.125 2023-06-22 04:17:48,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=896574.0, ans=0.125 2023-06-22 04:17:49,325 INFO [train.py:996] (2/4) Epoch 5, batch 27450, loss[loss=0.2603, simple_loss=0.3431, pruned_loss=0.08879, over 21660.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3158, pruned_loss=0.08318, over 4285731.47 frames. ], batch size: 414, lr: 5.97e-03, grad_scale: 16.0 2023-06-22 04:17:54,151 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 04:19:40,936 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=896814.0, ans=0.0 2023-06-22 04:20:11,228 INFO [train.py:996] (2/4) Epoch 5, batch 27500, loss[loss=0.2156, simple_loss=0.293, pruned_loss=0.06906, over 21620.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3144, pruned_loss=0.08385, over 4288638.43 frames. ], batch size: 231, lr: 5.97e-03, grad_scale: 16.0 2023-06-22 04:20:18,388 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 2.586e+02 3.048e+02 3.349e+02 5.042e+02, threshold=6.096e+02, percent-clipped=0.0 2023-06-22 04:21:06,527 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=896994.0, ans=0.125 2023-06-22 04:22:18,501 INFO [train.py:996] (2/4) Epoch 5, batch 27550, loss[loss=0.1926, simple_loss=0.2631, pruned_loss=0.06106, over 21721.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.308, pruned_loss=0.0798, over 4285778.73 frames. ], batch size: 124, lr: 5.97e-03, grad_scale: 16.0 2023-06-22 04:22:27,930 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.90 vs. limit=22.5 2023-06-22 04:23:25,697 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.95 vs. limit=15.0 2023-06-22 04:24:12,020 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.81 vs. limit=22.5 2023-06-22 04:24:27,636 INFO [train.py:996] (2/4) Epoch 5, batch 27600, loss[loss=0.211, simple_loss=0.2736, pruned_loss=0.07421, over 21181.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3008, pruned_loss=0.07837, over 4285681.96 frames. ], batch size: 144, lr: 5.97e-03, grad_scale: 32.0 2023-06-22 04:24:28,588 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.21 vs. limit=15.0 2023-06-22 04:24:45,707 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-06-22 04:24:46,034 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.028e+02 2.414e+02 2.661e+02 3.209e+02 4.551e+02, threshold=5.321e+02, percent-clipped=0.0 2023-06-22 04:25:09,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=897534.0, ans=0.2 2023-06-22 04:25:25,557 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=897534.0, ans=0.5 2023-06-22 04:26:07,748 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-06-22 04:26:34,748 INFO [train.py:996] (2/4) Epoch 5, batch 27650, loss[loss=0.2305, simple_loss=0.3073, pruned_loss=0.0769, over 21637.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.2958, pruned_loss=0.0779, over 4283349.56 frames. ], batch size: 389, lr: 5.97e-03, grad_scale: 32.0 2023-06-22 04:27:03,674 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=897834.0, ans=0.125 2023-06-22 04:28:26,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=898014.0, ans=0.0 2023-06-22 04:28:50,431 INFO [train.py:996] (2/4) Epoch 5, batch 27700, loss[loss=0.2072, simple_loss=0.2841, pruned_loss=0.06516, over 16635.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2959, pruned_loss=0.07644, over 4281802.09 frames. ], batch size: 62, lr: 5.97e-03, grad_scale: 32.0 2023-06-22 04:29:15,742 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.981e+02 2.480e+02 2.768e+02 3.290e+02 5.245e+02, threshold=5.535e+02, percent-clipped=0.0 2023-06-22 04:29:54,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=898134.0, ans=0.125 2023-06-22 04:30:00,564 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=898194.0, ans=0.5 2023-06-22 04:30:08,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=898194.0, ans=0.0 2023-06-22 04:30:21,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=898254.0, ans=0.2 2023-06-22 04:30:24,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=898254.0, ans=0.1 2023-06-22 04:30:39,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=898314.0, ans=0.0 2023-06-22 04:31:16,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=898374.0, ans=0.0 2023-06-22 04:31:17,208 INFO [train.py:996] (2/4) Epoch 5, batch 27750, loss[loss=0.2076, simple_loss=0.2889, pruned_loss=0.06315, over 21911.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.299, pruned_loss=0.07567, over 4286896.76 frames. ], batch size: 316, lr: 5.96e-03, grad_scale: 16.0 2023-06-22 04:31:17,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=898374.0, ans=10.0 2023-06-22 04:31:17,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=898374.0, ans=0.125 2023-06-22 04:31:36,969 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=898374.0, ans=0.125 2023-06-22 04:31:41,813 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.71 vs. limit=6.0 2023-06-22 04:31:49,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=898434.0, ans=0.2 2023-06-22 04:32:01,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=898434.0, ans=0.0 2023-06-22 04:32:46,523 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=898554.0, ans=0.125 2023-06-22 04:33:33,551 INFO [train.py:996] (2/4) Epoch 5, batch 27800, loss[loss=0.2289, simple_loss=0.3007, pruned_loss=0.0786, over 21856.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.2979, pruned_loss=0.07636, over 4289239.89 frames. ], batch size: 332, lr: 5.96e-03, grad_scale: 16.0 2023-06-22 04:33:40,808 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=898674.0, ans=0.1 2023-06-22 04:33:41,766 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.940e+02 2.543e+02 3.042e+02 3.729e+02 6.528e+02, threshold=6.084e+02, percent-clipped=2.0 2023-06-22 04:34:51,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=898794.0, ans=0.1 2023-06-22 04:35:28,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=898914.0, ans=0.125 2023-06-22 04:35:43,856 INFO [train.py:996] (2/4) Epoch 5, batch 27850, loss[loss=0.2378, simple_loss=0.3228, pruned_loss=0.07638, over 21788.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.2972, pruned_loss=0.07779, over 4290702.94 frames. ], batch size: 298, lr: 5.96e-03, grad_scale: 16.0 2023-06-22 04:36:33,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=899034.0, ans=0.0 2023-06-22 04:38:21,372 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=899214.0, ans=0.04949747468305833 2023-06-22 04:38:27,002 INFO [train.py:996] (2/4) Epoch 5, batch 27900, loss[loss=0.3161, simple_loss=0.3942, pruned_loss=0.119, over 21482.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3077, pruned_loss=0.07953, over 4286544.59 frames. ], batch size: 471, lr: 5.96e-03, grad_scale: 16.0 2023-06-22 04:38:46,425 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.080e+02 2.475e+02 2.737e+02 3.192e+02 5.724e+02, threshold=5.474e+02, percent-clipped=0.0 2023-06-22 04:40:51,508 INFO [train.py:996] (2/4) Epoch 5, batch 27950, loss[loss=0.2187, simple_loss=0.3018, pruned_loss=0.06778, over 21643.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3072, pruned_loss=0.07593, over 4284200.95 frames. ], batch size: 263, lr: 5.96e-03, grad_scale: 16.0 2023-06-22 04:40:58,941 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=899574.0, ans=0.125 2023-06-22 04:42:43,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=899754.0, ans=0.125 2023-06-22 04:43:07,333 INFO [train.py:996] (2/4) Epoch 5, batch 28000, loss[loss=0.2144, simple_loss=0.2795, pruned_loss=0.07467, over 21714.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3057, pruned_loss=0.0743, over 4289847.57 frames. ], batch size: 230, lr: 5.96e-03, grad_scale: 32.0 2023-06-22 04:43:30,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=899874.0, ans=0.0 2023-06-22 04:43:31,379 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.728e+02 2.277e+02 2.720e+02 3.265e+02 5.503e+02, threshold=5.441e+02, percent-clipped=1.0 2023-06-22 04:43:41,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=899934.0, ans=0.1 2023-06-22 04:44:02,044 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.17 vs. limit=15.0 2023-06-22 04:44:22,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=899994.0, ans=0.125 2023-06-22 04:45:16,187 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 04:45:23,549 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=900114.0, ans=0.125 2023-06-22 04:45:36,024 INFO [train.py:996] (2/4) Epoch 5, batch 28050, loss[loss=0.2078, simple_loss=0.2672, pruned_loss=0.07417, over 21770.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3036, pruned_loss=0.07592, over 4292984.21 frames. ], batch size: 118, lr: 5.96e-03, grad_scale: 32.0 2023-06-22 04:45:52,179 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=900174.0, ans=0.125 2023-06-22 04:47:09,062 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=900354.0, ans=0.125 2023-06-22 04:47:57,166 INFO [train.py:996] (2/4) Epoch 5, batch 28100, loss[loss=0.2063, simple_loss=0.2693, pruned_loss=0.07165, over 21643.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3028, pruned_loss=0.0766, over 4293385.32 frames. ], batch size: 282, lr: 5.96e-03, grad_scale: 16.0 2023-06-22 04:48:04,172 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.71 vs. limit=22.5 2023-06-22 04:48:16,677 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.925e+02 2.684e+02 3.220e+02 3.866e+02 7.727e+02, threshold=6.440e+02, percent-clipped=6.0 2023-06-22 04:48:52,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=900534.0, ans=0.05 2023-06-22 04:48:54,145 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=900534.0, ans=0.125 2023-06-22 04:48:57,468 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.35 vs. limit=22.5 2023-06-22 04:49:04,152 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=900594.0, ans=0.125 2023-06-22 04:49:32,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=900654.0, ans=0.125 2023-06-22 04:50:04,223 INFO [train.py:996] (2/4) Epoch 5, batch 28150, loss[loss=0.1942, simple_loss=0.2572, pruned_loss=0.06565, over 21498.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.2983, pruned_loss=0.07678, over 4284374.57 frames. ], batch size: 212, lr: 5.96e-03, grad_scale: 16.0 2023-06-22 04:51:20,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=900894.0, ans=0.0 2023-06-22 04:52:06,146 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=901014.0, ans=0.125 2023-06-22 04:52:25,420 INFO [train.py:996] (2/4) Epoch 5, batch 28200, loss[loss=0.2343, simple_loss=0.3005, pruned_loss=0.08407, over 21900.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.2974, pruned_loss=0.07811, over 4281588.74 frames. ], batch size: 372, lr: 5.96e-03, grad_scale: 16.0 2023-06-22 04:52:35,733 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.170e+02 2.760e+02 3.168e+02 3.960e+02 6.976e+02, threshold=6.335e+02, percent-clipped=2.0 2023-06-22 04:52:38,284 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.78 vs. limit=15.0 2023-06-22 04:52:46,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=901134.0, ans=0.2 2023-06-22 04:53:12,788 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=901194.0, ans=0.125 2023-06-22 04:53:39,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=901194.0, ans=0.125 2023-06-22 04:53:47,696 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=901254.0, ans=0.125 2023-06-22 04:54:10,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=901254.0, ans=0.1 2023-06-22 04:54:34,810 INFO [train.py:996] (2/4) Epoch 5, batch 28250, loss[loss=0.2048, simple_loss=0.2722, pruned_loss=0.06872, over 21811.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3005, pruned_loss=0.0813, over 4285030.92 frames. ], batch size: 118, lr: 5.95e-03, grad_scale: 16.0 2023-06-22 04:55:28,141 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.23 vs. limit=6.0 2023-06-22 04:55:36,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=901434.0, ans=0.05 2023-06-22 04:56:25,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=901554.0, ans=0.125 2023-06-22 04:56:53,970 INFO [train.py:996] (2/4) Epoch 5, batch 28300, loss[loss=0.1659, simple_loss=0.2426, pruned_loss=0.04454, over 21215.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.2984, pruned_loss=0.07836, over 4270726.81 frames. ], batch size: 176, lr: 5.95e-03, grad_scale: 16.0 2023-06-22 04:57:26,484 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.777e+02 2.516e+02 2.928e+02 3.510e+02 5.631e+02, threshold=5.856e+02, percent-clipped=0.0 2023-06-22 04:57:37,590 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=12.0 2023-06-22 04:58:20,433 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.84 vs. limit=15.0 2023-06-22 04:58:46,152 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=901854.0, ans=0.2 2023-06-22 04:58:59,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=901914.0, ans=0.125 2023-06-22 04:59:26,094 INFO [train.py:996] (2/4) Epoch 5, batch 28350, loss[loss=0.1908, simple_loss=0.2526, pruned_loss=0.06445, over 21844.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2933, pruned_loss=0.0723, over 4272843.86 frames. ], batch size: 98, lr: 5.95e-03, grad_scale: 16.0 2023-06-22 04:59:31,829 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.85 vs. limit=15.0 2023-06-22 05:00:08,372 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=902034.0, ans=0.1 2023-06-22 05:01:03,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=902214.0, ans=0.125 2023-06-22 05:01:04,491 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=902214.0, ans=0.2 2023-06-22 05:01:31,921 INFO [train.py:996] (2/4) Epoch 5, batch 28400, loss[loss=0.2268, simple_loss=0.2779, pruned_loss=0.0878, over 21830.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2899, pruned_loss=0.07342, over 4267937.64 frames. ], batch size: 98, lr: 5.95e-03, grad_scale: 32.0 2023-06-22 05:02:02,165 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.695e+02 2.345e+02 2.616e+02 3.191e+02 6.472e+02, threshold=5.233e+02, percent-clipped=2.0 2023-06-22 05:02:23,467 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.90 vs. limit=15.0 2023-06-22 05:03:13,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=902454.0, ans=0.125 2023-06-22 05:03:50,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=902514.0, ans=0.2 2023-06-22 05:03:50,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=902514.0, ans=0.2 2023-06-22 05:03:51,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=902514.0, ans=0.0 2023-06-22 05:03:53,694 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.78 vs. limit=22.5 2023-06-22 05:03:56,947 INFO [train.py:996] (2/4) Epoch 5, batch 28450, loss[loss=0.2351, simple_loss=0.3045, pruned_loss=0.08291, over 21868.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.2961, pruned_loss=0.07713, over 4260326.22 frames. ], batch size: 351, lr: 5.95e-03, grad_scale: 16.0 2023-06-22 05:04:13,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=902574.0, ans=0.125 2023-06-22 05:05:00,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=902694.0, ans=0.125 2023-06-22 05:05:19,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=902754.0, ans=0.1 2023-06-22 05:05:20,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=902754.0, ans=0.125 2023-06-22 05:06:04,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=902814.0, ans=0.05 2023-06-22 05:06:15,941 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.47 vs. limit=15.0 2023-06-22 05:06:25,820 INFO [train.py:996] (2/4) Epoch 5, batch 28500, loss[loss=0.2345, simple_loss=0.3043, pruned_loss=0.08237, over 21938.00 frames. ], tot_loss[loss=0.229, simple_loss=0.2989, pruned_loss=0.07955, over 4272061.89 frames. ], batch size: 316, lr: 5.95e-03, grad_scale: 16.0 2023-06-22 05:06:38,318 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.223e+02 2.555e+02 2.878e+02 3.269e+02 4.287e+02, threshold=5.756e+02, percent-clipped=0.0 2023-06-22 05:07:10,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=902994.0, ans=0.09899494936611666 2023-06-22 05:07:20,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=902994.0, ans=0.0 2023-06-22 05:07:54,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=903054.0, ans=0.125 2023-06-22 05:07:58,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=903054.0, ans=0.125 2023-06-22 05:08:11,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=903114.0, ans=0.0 2023-06-22 05:08:46,889 INFO [train.py:996] (2/4) Epoch 5, batch 28550, loss[loss=0.2549, simple_loss=0.3423, pruned_loss=0.08372, over 21566.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3079, pruned_loss=0.08227, over 4275179.40 frames. ], batch size: 230, lr: 5.95e-03, grad_scale: 16.0 2023-06-22 05:08:59,530 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.90 vs. limit=15.0 2023-06-22 05:09:02,539 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.79 vs. limit=22.5 2023-06-22 05:09:02,587 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.77 vs. limit=22.5 2023-06-22 05:09:51,666 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.77 vs. limit=15.0 2023-06-22 05:11:03,132 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=903414.0, ans=0.125 2023-06-22 05:11:12,579 INFO [train.py:996] (2/4) Epoch 5, batch 28600, loss[loss=0.2642, simple_loss=0.3352, pruned_loss=0.09661, over 21368.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3148, pruned_loss=0.08435, over 4274729.09 frames. ], batch size: 549, lr: 5.95e-03, grad_scale: 16.0 2023-06-22 05:11:24,651 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.229e+02 2.734e+02 3.060e+02 3.584e+02 6.352e+02, threshold=6.121e+02, percent-clipped=1.0 2023-06-22 05:11:25,820 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.78 vs. limit=15.0 2023-06-22 05:11:26,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=903534.0, ans=0.125 2023-06-22 05:13:22,528 INFO [train.py:996] (2/4) Epoch 5, batch 28650, loss[loss=0.2091, simple_loss=0.2658, pruned_loss=0.0762, over 21580.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3082, pruned_loss=0.08318, over 4275394.94 frames. ], batch size: 415, lr: 5.95e-03, grad_scale: 16.0 2023-06-22 05:14:17,242 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.82 vs. limit=15.0 2023-06-22 05:15:43,992 INFO [train.py:996] (2/4) Epoch 5, batch 28700, loss[loss=0.2279, simple_loss=0.2983, pruned_loss=0.07869, over 21658.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3064, pruned_loss=0.0837, over 4276280.15 frames. ], batch size: 263, lr: 5.95e-03, grad_scale: 16.0 2023-06-22 05:16:00,152 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=904074.0, ans=0.125 2023-06-22 05:16:01,187 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 2.516e+02 2.747e+02 3.102e+02 4.785e+02, threshold=5.493e+02, percent-clipped=0.0 2023-06-22 05:16:06,616 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.21 vs. limit=15.0 2023-06-22 05:16:18,022 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=15.0 2023-06-22 05:18:03,004 INFO [train.py:996] (2/4) Epoch 5, batch 28750, loss[loss=0.2215, simple_loss=0.3119, pruned_loss=0.0656, over 21850.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3065, pruned_loss=0.08414, over 4286312.36 frames. ], batch size: 351, lr: 5.94e-03, grad_scale: 16.0 2023-06-22 05:19:20,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=904494.0, ans=0.2 2023-06-22 05:20:32,365 INFO [train.py:996] (2/4) Epoch 5, batch 28800, loss[loss=0.261, simple_loss=0.3333, pruned_loss=0.09437, over 21758.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3108, pruned_loss=0.08469, over 4283989.64 frames. ], batch size: 332, lr: 5.94e-03, grad_scale: 32.0 2023-06-22 05:20:50,084 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.262e+02 2.736e+02 3.061e+02 3.498e+02 5.651e+02, threshold=6.121e+02, percent-clipped=1.0 2023-06-22 05:21:21,711 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=904734.0, ans=0.2 2023-06-22 05:21:55,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=904854.0, ans=0.1 2023-06-22 05:22:18,783 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.11 vs. limit=10.0 2023-06-22 05:23:02,430 INFO [train.py:996] (2/4) Epoch 5, batch 28850, loss[loss=0.1995, simple_loss=0.2585, pruned_loss=0.07022, over 20213.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.312, pruned_loss=0.08567, over 4284219.88 frames. ], batch size: 702, lr: 5.94e-03, grad_scale: 32.0 2023-06-22 05:23:25,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=905034.0, ans=0.1 2023-06-22 05:25:27,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=905214.0, ans=0.125 2023-06-22 05:25:35,338 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=905214.0, ans=0.125 2023-06-22 05:25:42,883 INFO [train.py:996] (2/4) Epoch 5, batch 28900, loss[loss=0.2419, simple_loss=0.3185, pruned_loss=0.08264, over 21437.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3149, pruned_loss=0.08749, over 4283245.72 frames. ], batch size: 131, lr: 5.94e-03, grad_scale: 32.0 2023-06-22 05:25:55,352 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.043e+02 2.574e+02 2.990e+02 3.468e+02 6.193e+02, threshold=5.980e+02, percent-clipped=0.0 2023-06-22 05:26:29,798 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=905334.0, ans=0.0 2023-06-22 05:26:52,072 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=13.05 vs. limit=15.0 2023-06-22 05:26:59,491 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.62 vs. limit=15.0 2023-06-22 05:27:30,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=905454.0, ans=0.0 2023-06-22 05:27:43,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=905514.0, ans=0.2 2023-06-22 05:27:53,493 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=905574.0, ans=0.2 2023-06-22 05:27:54,369 INFO [train.py:996] (2/4) Epoch 5, batch 28950, loss[loss=0.2336, simple_loss=0.347, pruned_loss=0.06013, over 20762.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3177, pruned_loss=0.08723, over 4274511.39 frames. ], batch size: 607, lr: 5.94e-03, grad_scale: 32.0 2023-06-22 05:28:09,795 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=905574.0, ans=0.0 2023-06-22 05:29:02,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=905694.0, ans=0.1 2023-06-22 05:29:05,696 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=905694.0, ans=0.1 2023-06-22 05:29:45,309 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=905754.0, ans=0.1 2023-06-22 05:29:52,310 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=905814.0, ans=0.125 2023-06-22 05:30:19,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=905814.0, ans=0.125 2023-06-22 05:30:22,095 INFO [train.py:996] (2/4) Epoch 5, batch 29000, loss[loss=0.2535, simple_loss=0.3305, pruned_loss=0.08821, over 21389.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3211, pruned_loss=0.08628, over 4269843.93 frames. ], batch size: 548, lr: 5.94e-03, grad_scale: 32.0 2023-06-22 05:30:46,531 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.026e+02 2.550e+02 2.882e+02 3.345e+02 5.949e+02, threshold=5.765e+02, percent-clipped=1.0 2023-06-22 05:30:46,908 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=905874.0, ans=0.2 2023-06-22 05:30:55,758 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=905934.0, ans=0.125 2023-06-22 05:31:31,603 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.96 vs. limit=12.0 2023-06-22 05:31:46,383 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.68 vs. limit=15.0 2023-06-22 05:32:44,729 INFO [train.py:996] (2/4) Epoch 5, batch 29050, loss[loss=0.2483, simple_loss=0.2938, pruned_loss=0.1014, over 20235.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3192, pruned_loss=0.08734, over 4278619.92 frames. ], batch size: 707, lr: 5.94e-03, grad_scale: 32.0 2023-06-22 05:33:37,278 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.02 vs. limit=6.0 2023-06-22 05:33:38,616 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.12 vs. limit=15.0 2023-06-22 05:33:45,434 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=906294.0, ans=0.125 2023-06-22 05:34:16,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=906354.0, ans=0.125 2023-06-22 05:34:45,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=906414.0, ans=0.0 2023-06-22 05:34:50,204 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.93 vs. limit=15.0 2023-06-22 05:34:59,592 INFO [train.py:996] (2/4) Epoch 5, batch 29100, loss[loss=0.2246, simple_loss=0.2799, pruned_loss=0.08472, over 21536.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3098, pruned_loss=0.08425, over 4274257.19 frames. ], batch size: 441, lr: 5.94e-03, grad_scale: 32.0 2023-06-22 05:35:37,014 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.003e+02 2.541e+02 2.908e+02 3.289e+02 5.672e+02, threshold=5.815e+02, percent-clipped=0.0 2023-06-22 05:35:57,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=906534.0, ans=0.0 2023-06-22 05:36:23,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=906594.0, ans=0.04949747468305833 2023-06-22 05:36:30,174 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.16 vs. limit=22.5 2023-06-22 05:36:32,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=906654.0, ans=0.0 2023-06-22 05:37:19,653 INFO [train.py:996] (2/4) Epoch 5, batch 29150, loss[loss=0.2382, simple_loss=0.3157, pruned_loss=0.08034, over 21769.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3084, pruned_loss=0.08275, over 4271745.80 frames. ], batch size: 316, lr: 5.94e-03, grad_scale: 16.0 2023-06-22 05:37:26,554 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.59 vs. limit=8.0 2023-06-22 05:37:44,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=906774.0, ans=0.0 2023-06-22 05:39:14,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=907014.0, ans=0.125 2023-06-22 05:39:23,824 INFO [train.py:996] (2/4) Epoch 5, batch 29200, loss[loss=0.1953, simple_loss=0.2692, pruned_loss=0.06069, over 21858.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3043, pruned_loss=0.08202, over 4263019.29 frames. ], batch size: 125, lr: 5.94e-03, grad_scale: 32.0 2023-06-22 05:40:04,047 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.047e+02 2.662e+02 3.109e+02 4.025e+02 6.896e+02, threshold=6.219e+02, percent-clipped=4.0 2023-06-22 05:40:30,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=907134.0, ans=0.125 2023-06-22 05:40:51,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=907254.0, ans=0.125 2023-06-22 05:40:55,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=907254.0, ans=0.125 2023-06-22 05:41:38,140 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.63 vs. limit=15.0 2023-06-22 05:41:47,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=907314.0, ans=0.125 2023-06-22 05:41:49,350 INFO [train.py:996] (2/4) Epoch 5, batch 29250, loss[loss=0.2471, simple_loss=0.3251, pruned_loss=0.08458, over 21406.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3021, pruned_loss=0.07939, over 4267849.29 frames. ], batch size: 471, lr: 5.93e-03, grad_scale: 32.0 2023-06-22 05:42:24,525 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=907434.0, ans=0.2 2023-06-22 05:42:33,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=907434.0, ans=0.1 2023-06-22 05:43:12,355 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 05:43:29,409 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=907614.0, ans=10.0 2023-06-22 05:43:29,878 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.20 vs. limit=15.0 2023-06-22 05:44:06,203 INFO [train.py:996] (2/4) Epoch 5, batch 29300, loss[loss=0.2235, simple_loss=0.2921, pruned_loss=0.07746, over 21791.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3041, pruned_loss=0.07876, over 4268422.94 frames. ], batch size: 351, lr: 5.93e-03, grad_scale: 32.0 2023-06-22 05:44:37,883 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.867e+02 2.436e+02 2.713e+02 3.173e+02 4.858e+02, threshold=5.427e+02, percent-clipped=0.0 2023-06-22 05:44:49,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=907734.0, ans=0.125 2023-06-22 05:45:11,498 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=907854.0, ans=0.0 2023-06-22 05:46:10,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=907914.0, ans=0.0 2023-06-22 05:46:14,526 INFO [train.py:996] (2/4) Epoch 5, batch 29350, loss[loss=0.1988, simple_loss=0.2643, pruned_loss=0.06663, over 20045.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3, pruned_loss=0.07745, over 4268776.23 frames. ], batch size: 702, lr: 5.93e-03, grad_scale: 32.0 2023-06-22 05:46:18,043 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=907974.0, ans=0.0 2023-06-22 05:47:46,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=908154.0, ans=0.05 2023-06-22 05:48:40,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=908214.0, ans=0.04949747468305833 2023-06-22 05:48:40,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=908214.0, ans=0.125 2023-06-22 05:48:48,335 INFO [train.py:996] (2/4) Epoch 5, batch 29400, loss[loss=0.2546, simple_loss=0.3421, pruned_loss=0.08355, over 21679.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3004, pruned_loss=0.07582, over 4266197.94 frames. ], batch size: 415, lr: 5.93e-03, grad_scale: 32.0 2023-06-22 05:49:08,059 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.007e+02 2.428e+02 2.742e+02 3.233e+02 5.582e+02, threshold=5.484e+02, percent-clipped=1.0 2023-06-22 05:51:06,630 INFO [train.py:996] (2/4) Epoch 5, batch 29450, loss[loss=0.2191, simple_loss=0.3048, pruned_loss=0.06666, over 21480.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.2987, pruned_loss=0.07512, over 4264102.33 frames. ], batch size: 131, lr: 5.93e-03, grad_scale: 32.0 2023-06-22 05:51:42,815 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.01 vs. limit=12.0 2023-06-22 05:51:44,196 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-06-22 05:52:10,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=908694.0, ans=0.2 2023-06-22 05:52:44,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=908754.0, ans=0.125 2023-06-22 05:52:45,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=908754.0, ans=0.0 2023-06-22 05:53:20,893 INFO [train.py:996] (2/4) Epoch 5, batch 29500, loss[loss=0.2075, simple_loss=0.2724, pruned_loss=0.07131, over 21586.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.303, pruned_loss=0.07873, over 4272642.18 frames. ], batch size: 212, lr: 5.93e-03, grad_scale: 32.0 2023-06-22 05:53:27,337 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=908874.0, ans=0.0 2023-06-22 05:53:33,969 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 2.755e+02 3.048e+02 3.720e+02 7.452e+02, threshold=6.096e+02, percent-clipped=3.0 2023-06-22 05:54:02,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=908934.0, ans=0.0 2023-06-22 05:55:39,407 INFO [train.py:996] (2/4) Epoch 5, batch 29550, loss[loss=0.1945, simple_loss=0.2546, pruned_loss=0.0672, over 21285.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3021, pruned_loss=0.08019, over 4281666.07 frames. ], batch size: 608, lr: 5.93e-03, grad_scale: 16.0 2023-06-22 05:56:17,398 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=10.89 vs. limit=15.0 2023-06-22 05:56:19,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=909234.0, ans=0.125 2023-06-22 05:56:23,297 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-22 05:56:36,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=909294.0, ans=0.1 2023-06-22 05:58:01,446 INFO [train.py:996] (2/4) Epoch 5, batch 29600, loss[loss=0.3599, simple_loss=0.4223, pruned_loss=0.1488, over 21557.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3097, pruned_loss=0.08282, over 4290433.89 frames. ], batch size: 508, lr: 5.93e-03, grad_scale: 32.0 2023-06-22 05:58:29,145 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.104e+02 2.694e+02 3.024e+02 3.458e+02 5.010e+02, threshold=6.047e+02, percent-clipped=0.0 2023-06-22 05:59:06,203 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=909594.0, ans=0.1 2023-06-22 05:59:16,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=909594.0, ans=10.0 2023-06-22 05:59:23,879 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=909594.0, ans=0.0 2023-06-22 05:59:43,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=909654.0, ans=0.0 2023-06-22 06:00:01,726 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 06:00:16,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=909714.0, ans=0.2 2023-06-22 06:00:26,809 INFO [train.py:996] (2/4) Epoch 5, batch 29650, loss[loss=0.2041, simple_loss=0.2719, pruned_loss=0.06809, over 21797.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3067, pruned_loss=0.0793, over 4283006.46 frames. ], batch size: 247, lr: 5.93e-03, grad_scale: 32.0 2023-06-22 06:00:35,582 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=909774.0, ans=0.0 2023-06-22 06:00:54,009 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=909834.0, ans=0.125 2023-06-22 06:01:23,516 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=909894.0, ans=0.1 2023-06-22 06:01:57,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=909954.0, ans=0.2 2023-06-22 06:02:32,779 INFO [train.py:996] (2/4) Epoch 5, batch 29700, loss[loss=0.2417, simple_loss=0.3408, pruned_loss=0.07133, over 21435.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3087, pruned_loss=0.07938, over 4284150.86 frames. ], batch size: 194, lr: 5.93e-03, grad_scale: 32.0 2023-06-22 06:02:33,289 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=910074.0, ans=0.0 2023-06-22 06:03:02,112 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.724e+02 2.342e+02 2.589e+02 3.104e+02 5.027e+02, threshold=5.177e+02, percent-clipped=0.0 2023-06-22 06:03:07,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=910134.0, ans=0.2 2023-06-22 06:04:25,472 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.21 vs. limit=22.5 2023-06-22 06:04:28,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=910314.0, ans=0.125 2023-06-22 06:04:36,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=910314.0, ans=0.125 2023-06-22 06:04:48,302 INFO [train.py:996] (2/4) Epoch 5, batch 29750, loss[loss=0.2256, simple_loss=0.316, pruned_loss=0.0676, over 21422.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3121, pruned_loss=0.07857, over 4279743.34 frames. ], batch size: 194, lr: 5.92e-03, grad_scale: 32.0 2023-06-22 06:05:14,152 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=910434.0, ans=0.125 2023-06-22 06:05:15,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=910434.0, ans=0.125 2023-06-22 06:05:15,644 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=910434.0, ans=0.0 2023-06-22 06:05:21,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=910434.0, ans=0.2 2023-06-22 06:05:53,204 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 06:06:48,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=910614.0, ans=0.0 2023-06-22 06:07:10,295 INFO [train.py:996] (2/4) Epoch 5, batch 29800, loss[loss=0.2173, simple_loss=0.2997, pruned_loss=0.06743, over 21654.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3129, pruned_loss=0.07941, over 4285416.38 frames. ], batch size: 230, lr: 5.92e-03, grad_scale: 32.0 2023-06-22 06:07:12,768 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.97 vs. limit=15.0 2023-06-22 06:07:40,155 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.962e+02 2.449e+02 2.699e+02 3.078e+02 3.997e+02, threshold=5.399e+02, percent-clipped=0.0 2023-06-22 06:07:47,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=910734.0, ans=0.125 2023-06-22 06:08:59,264 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-22 06:09:21,739 INFO [train.py:996] (2/4) Epoch 5, batch 29850, loss[loss=0.2251, simple_loss=0.3028, pruned_loss=0.07374, over 21516.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3098, pruned_loss=0.07841, over 4290919.27 frames. ], batch size: 131, lr: 5.92e-03, grad_scale: 32.0 2023-06-22 06:09:42,368 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=910974.0, ans=0.1 2023-06-22 06:10:03,860 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.22 vs. limit=15.0 2023-06-22 06:10:04,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=911034.0, ans=0.07 2023-06-22 06:10:08,301 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.29 vs. limit=22.5 2023-06-22 06:10:14,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=911034.0, ans=0.0 2023-06-22 06:10:22,522 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=911094.0, ans=0.125 2023-06-22 06:10:34,661 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.31 vs. limit=15.0 2023-06-22 06:11:46,337 INFO [train.py:996] (2/4) Epoch 5, batch 29900, loss[loss=0.2414, simple_loss=0.3137, pruned_loss=0.08451, over 17254.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3084, pruned_loss=0.0797, over 4290812.80 frames. ], batch size: 60, lr: 5.92e-03, grad_scale: 32.0 2023-06-22 06:12:18,157 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.757e+02 2.539e+02 3.015e+02 3.692e+02 5.996e+02, threshold=6.029e+02, percent-clipped=2.0 2023-06-22 06:12:18,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=911334.0, ans=0.0 2023-06-22 06:14:06,423 INFO [train.py:996] (2/4) Epoch 5, batch 29950, loss[loss=0.2852, simple_loss=0.3567, pruned_loss=0.1068, over 21826.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3115, pruned_loss=0.0829, over 4290623.75 frames. ], batch size: 124, lr: 5.92e-03, grad_scale: 32.0 2023-06-22 06:16:11,082 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=911814.0, ans=0.0 2023-06-22 06:16:12,864 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.54 vs. limit=15.0 2023-06-22 06:16:33,822 INFO [train.py:996] (2/4) Epoch 5, batch 30000, loss[loss=0.1804, simple_loss=0.2425, pruned_loss=0.05912, over 16432.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3132, pruned_loss=0.08287, over 4281679.76 frames. ], batch size: 61, lr: 5.92e-03, grad_scale: 32.0 2023-06-22 06:16:33,822 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-22 06:17:11,593 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([4.7977, 2.3508, 4.0888, 2.3695], device='cuda:2') 2023-06-22 06:17:18,838 INFO [train.py:1028] (2/4) Epoch 5, validation: loss=0.2496, simple_loss=0.3465, pruned_loss=0.07629, over 1796401.00 frames. 2023-06-22 06:17:18,839 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-22 06:17:45,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=911934.0, ans=0.0 2023-06-22 06:17:54,407 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.843e+02 2.727e+02 3.282e+02 4.075e+02 7.165e+02, threshold=6.565e+02, percent-clipped=2.0 2023-06-22 06:17:54,931 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=911934.0, ans=0.0 2023-06-22 06:18:49,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=912054.0, ans=0.125 2023-06-22 06:20:02,387 INFO [train.py:996] (2/4) Epoch 5, batch 30050, loss[loss=0.2423, simple_loss=0.3367, pruned_loss=0.07395, over 21607.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3167, pruned_loss=0.07963, over 4277895.59 frames. ], batch size: 263, lr: 5.92e-03, grad_scale: 32.0 2023-06-22 06:21:28,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=912354.0, ans=0.125 2023-06-22 06:22:03,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=912474.0, ans=0.07 2023-06-22 06:22:04,889 INFO [train.py:996] (2/4) Epoch 5, batch 30100, loss[loss=0.2192, simple_loss=0.2787, pruned_loss=0.0799, over 21553.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.315, pruned_loss=0.07884, over 4270852.51 frames. ], batch size: 247, lr: 5.92e-03, grad_scale: 32.0 2023-06-22 06:22:11,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=912474.0, ans=0.0 2023-06-22 06:22:21,792 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.800e+02 2.782e+02 3.150e+02 3.648e+02 6.507e+02, threshold=6.300e+02, percent-clipped=0.0 2023-06-22 06:23:55,597 INFO [train.py:996] (2/4) Epoch 5, batch 30150, loss[loss=0.2521, simple_loss=0.323, pruned_loss=0.09061, over 21692.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3109, pruned_loss=0.08074, over 4264114.25 frames. ], batch size: 332, lr: 5.92e-03, grad_scale: 32.0 2023-06-22 06:25:08,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=912894.0, ans=15.0 2023-06-22 06:26:30,434 INFO [train.py:996] (2/4) Epoch 5, batch 30200, loss[loss=0.2427, simple_loss=0.3234, pruned_loss=0.081, over 21780.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3129, pruned_loss=0.08008, over 4265693.63 frames. ], batch size: 247, lr: 5.92e-03, grad_scale: 32.0 2023-06-22 06:26:59,506 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.869e+02 2.559e+02 2.949e+02 3.535e+02 6.486e+02, threshold=5.897e+02, percent-clipped=1.0 2023-06-22 06:27:45,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=913194.0, ans=0.125 2023-06-22 06:28:09,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=913254.0, ans=0.125 2023-06-22 06:28:15,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=913254.0, ans=0.09899494936611666 2023-06-22 06:28:25,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=913314.0, ans=0.125 2023-06-22 06:28:49,076 INFO [train.py:996] (2/4) Epoch 5, batch 30250, loss[loss=0.2545, simple_loss=0.3539, pruned_loss=0.07754, over 21462.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3218, pruned_loss=0.08267, over 4269728.93 frames. ], batch size: 211, lr: 5.92e-03, grad_scale: 32.0 2023-06-22 06:31:15,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=913614.0, ans=0.125 2023-06-22 06:31:21,057 INFO [train.py:996] (2/4) Epoch 5, batch 30300, loss[loss=0.19, simple_loss=0.2591, pruned_loss=0.06051, over 21350.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3186, pruned_loss=0.08203, over 4271525.92 frames. ], batch size: 177, lr: 5.91e-03, grad_scale: 32.0 2023-06-22 06:31:59,443 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.898e+02 2.693e+02 3.085e+02 3.904e+02 6.808e+02, threshold=6.171e+02, percent-clipped=2.0 2023-06-22 06:32:55,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=913854.0, ans=0.0 2023-06-22 06:33:44,904 INFO [train.py:996] (2/4) Epoch 5, batch 30350, loss[loss=0.3123, simple_loss=0.3953, pruned_loss=0.1146, over 21544.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3182, pruned_loss=0.08318, over 4264550.47 frames. ], batch size: 473, lr: 5.91e-03, grad_scale: 32.0 2023-06-22 06:34:12,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=913974.0, ans=15.0 2023-06-22 06:34:16,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=914034.0, ans=0.0 2023-06-22 06:34:57,289 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.40 vs. limit=15.0 2023-06-22 06:36:02,192 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=914214.0, ans=0.125 2023-06-22 06:36:23,848 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.14 vs. limit=15.0 2023-06-22 06:36:44,132 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=914274.0, ans=0.05 2023-06-22 06:36:45,045 INFO [train.py:996] (2/4) Epoch 5, batch 30400, loss[loss=0.2315, simple_loss=0.2796, pruned_loss=0.09174, over 20136.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3116, pruned_loss=0.08142, over 4254480.03 frames. ], batch size: 702, lr: 5.91e-03, grad_scale: 32.0 2023-06-22 06:37:53,817 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.256e+02 2.876e+02 3.400e+02 4.558e+02 7.300e+02, threshold=6.801e+02, percent-clipped=4.0 2023-06-22 06:37:55,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=914334.0, ans=0.0 2023-06-22 06:38:41,060 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=914394.0, ans=0.125 2023-06-22 06:38:42,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=914394.0, ans=0.125 2023-06-22 06:40:23,441 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=914514.0, ans=0.125 2023-06-22 06:40:24,756 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=914514.0, ans=0.1 2023-06-22 06:40:25,396 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.76 vs. limit=15.0 2023-06-22 06:40:57,575 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=914514.0, ans=0.0 2023-06-22 06:40:57,592 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=914514.0, ans=0.125 2023-06-22 06:41:03,398 INFO [train.py:996] (2/4) Epoch 5, batch 30450, loss[loss=0.3073, simple_loss=0.4076, pruned_loss=0.1036, over 19883.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3137, pruned_loss=0.0823, over 4196896.00 frames. ], batch size: 702, lr: 5.91e-03, grad_scale: 16.0 2023-06-22 06:42:28,114 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 06:42:30,645 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=914634.0, ans=0.0 2023-06-22 06:43:08,272 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.92 vs. limit=10.0 2023-06-22 06:43:50,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=914694.0, ans=0.125 2023-06-22 06:47:12,981 INFO [train.py:996] (2/4) Epoch 6, batch 0, loss[loss=0.2199, simple_loss=0.2965, pruned_loss=0.07164, over 21735.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2965, pruned_loss=0.07164, over 21735.00 frames. ], batch size: 124, lr: 5.35e-03, grad_scale: 32.0 2023-06-22 06:47:12,981 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-22 06:48:06,649 INFO [train.py:1028] (2/4) Epoch 6, validation: loss=0.2383, simple_loss=0.345, pruned_loss=0.06584, over 1796401.00 frames. 2023-06-22 06:48:06,650 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-22 06:48:13,383 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 06:48:20,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=914838.0, ans=0.125 2023-06-22 06:48:50,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=914898.0, ans=0.125 2023-06-22 06:48:52,676 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.955e+02 5.006e+02 6.285e+02 8.348e+02 2.118e+03, threshold=1.257e+03, percent-clipped=42.0 2023-06-22 06:48:57,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=914958.0, ans=0.0 2023-06-22 06:49:24,618 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=915018.0, ans=0.2 2023-06-22 06:49:46,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=915078.0, ans=0.125 2023-06-22 06:50:15,984 INFO [train.py:996] (2/4) Epoch 6, batch 50, loss[loss=0.2029, simple_loss=0.2892, pruned_loss=0.0583, over 21364.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3196, pruned_loss=0.0815, over 964838.42 frames. ], batch size: 194, lr: 5.35e-03, grad_scale: 16.0 2023-06-22 06:50:20,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=915138.0, ans=0.2 2023-06-22 06:50:26,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=915138.0, ans=0.1 2023-06-22 06:51:21,032 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=915318.0, ans=0.0 2023-06-22 06:51:53,334 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=915378.0, ans=0.1 2023-06-22 06:52:24,634 INFO [train.py:996] (2/4) Epoch 6, batch 100, loss[loss=0.2505, simple_loss=0.343, pruned_loss=0.07895, over 21748.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3321, pruned_loss=0.08375, over 1688946.96 frames. ], batch size: 332, lr: 5.34e-03, grad_scale: 16.0 2023-06-22 06:53:10,359 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.761e+02 2.323e+02 2.634e+02 3.053e+02 4.648e+02, threshold=5.268e+02, percent-clipped=0.0 2023-06-22 06:53:11,949 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=915558.0, ans=0.125 2023-06-22 06:53:29,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=915618.0, ans=0.1 2023-06-22 06:54:19,712 INFO [train.py:996] (2/4) Epoch 6, batch 150, loss[loss=0.2466, simple_loss=0.3446, pruned_loss=0.07426, over 21798.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3315, pruned_loss=0.08179, over 2256278.63 frames. ], batch size: 332, lr: 5.34e-03, grad_scale: 16.0 2023-06-22 06:56:01,359 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=915918.0, ans=0.125 2023-06-22 06:56:43,945 INFO [train.py:996] (2/4) Epoch 6, batch 200, loss[loss=0.2852, simple_loss=0.3515, pruned_loss=0.1094, over 21798.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3295, pruned_loss=0.08251, over 2702145.21 frames. ], batch size: 441, lr: 5.34e-03, grad_scale: 16.0 2023-06-22 06:56:48,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=916038.0, ans=0.125 2023-06-22 06:57:20,125 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=916098.0, ans=0.0 2023-06-22 06:57:27,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=916098.0, ans=0.125 2023-06-22 06:57:42,671 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.797e+02 2.601e+02 2.986e+02 3.624e+02 6.597e+02, threshold=5.972e+02, percent-clipped=3.0 2023-06-22 06:58:53,832 INFO [train.py:996] (2/4) Epoch 6, batch 250, loss[loss=0.2374, simple_loss=0.3152, pruned_loss=0.07978, over 21515.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3241, pruned_loss=0.07996, over 3046810.57 frames. ], batch size: 131, lr: 5.34e-03, grad_scale: 16.0 2023-06-22 06:58:55,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=916338.0, ans=0.0 2023-06-22 07:00:20,258 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.71 vs. limit=15.0 2023-06-22 07:00:22,547 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=916518.0, ans=0.2 2023-06-22 07:00:52,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=916578.0, ans=0.1 2023-06-22 07:01:15,610 INFO [train.py:996] (2/4) Epoch 6, batch 300, loss[loss=0.2323, simple_loss=0.3052, pruned_loss=0.07965, over 21600.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3184, pruned_loss=0.0791, over 3315569.69 frames. ], batch size: 263, lr: 5.34e-03, grad_scale: 16.0 2023-06-22 07:02:07,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=916698.0, ans=0.125 2023-06-22 07:02:11,254 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.812e+02 2.629e+02 3.080e+02 3.512e+02 4.991e+02, threshold=6.161e+02, percent-clipped=0.0 2023-06-22 07:02:32,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=916818.0, ans=0.0 2023-06-22 07:03:26,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=916878.0, ans=0.0 2023-06-22 07:03:36,292 INFO [train.py:996] (2/4) Epoch 6, batch 350, loss[loss=0.1996, simple_loss=0.2664, pruned_loss=0.06641, over 21833.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3124, pruned_loss=0.07912, over 3524855.47 frames. ], batch size: 352, lr: 5.34e-03, grad_scale: 16.0 2023-06-22 07:03:57,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=916938.0, ans=0.125 2023-06-22 07:04:08,947 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.89 vs. limit=15.0 2023-06-22 07:04:24,788 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=917058.0, ans=0.0 2023-06-22 07:04:45,759 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=917118.0, ans=0.05 2023-06-22 07:05:31,413 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=917178.0, ans=0.05 2023-06-22 07:05:42,665 INFO [train.py:996] (2/4) Epoch 6, batch 400, loss[loss=0.2009, simple_loss=0.2697, pruned_loss=0.06604, over 21635.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.306, pruned_loss=0.07778, over 3687412.62 frames. ], batch size: 298, lr: 5.34e-03, grad_scale: 32.0 2023-06-22 07:06:08,172 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=917238.0, ans=0.125 2023-06-22 07:06:41,522 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.619e+02 3.011e+02 3.416e+02 5.139e+02, threshold=6.021e+02, percent-clipped=0.0 2023-06-22 07:06:48,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=917358.0, ans=0.0 2023-06-22 07:07:55,473 INFO [train.py:996] (2/4) Epoch 6, batch 450, loss[loss=0.2246, simple_loss=0.2806, pruned_loss=0.08428, over 21673.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3028, pruned_loss=0.07683, over 3823011.46 frames. ], batch size: 417, lr: 5.34e-03, grad_scale: 32.0 2023-06-22 07:07:57,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=917538.0, ans=0.1 2023-06-22 07:08:12,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=917538.0, ans=0.0 2023-06-22 07:08:31,156 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=917598.0, ans=0.07 2023-06-22 07:08:34,519 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.61 vs. limit=15.0 2023-06-22 07:08:54,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=917598.0, ans=0.125 2023-06-22 07:09:14,986 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.78 vs. limit=22.5 2023-06-22 07:09:50,886 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.33 vs. limit=15.0 2023-06-22 07:10:16,303 INFO [train.py:996] (2/4) Epoch 6, batch 500, loss[loss=0.2392, simple_loss=0.3534, pruned_loss=0.06248, over 19816.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3048, pruned_loss=0.07571, over 3929782.34 frames. ], batch size: 703, lr: 5.34e-03, grad_scale: 32.0 2023-06-22 07:10:49,173 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.18 vs. limit=12.0 2023-06-22 07:10:57,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=917898.0, ans=0.0 2023-06-22 07:11:01,586 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.947e+02 2.514e+02 2.945e+02 3.485e+02 5.759e+02, threshold=5.890e+02, percent-clipped=0.0 2023-06-22 07:11:46,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=918018.0, ans=0.1 2023-06-22 07:12:12,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=918078.0, ans=0.0 2023-06-22 07:12:28,422 INFO [train.py:996] (2/4) Epoch 6, batch 550, loss[loss=0.2986, simple_loss=0.3969, pruned_loss=0.1002, over 21542.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3103, pruned_loss=0.07651, over 4005236.73 frames. ], batch size: 471, lr: 5.34e-03, grad_scale: 32.0 2023-06-22 07:13:17,021 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.31 vs. limit=22.5 2023-06-22 07:13:32,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=918258.0, ans=0.0 2023-06-22 07:13:44,630 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=918318.0, ans=0.0 2023-06-22 07:14:06,866 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.97 vs. limit=22.5 2023-06-22 07:14:34,560 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 07:14:35,202 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.80 vs. limit=22.5 2023-06-22 07:14:38,194 INFO [train.py:996] (2/4) Epoch 6, batch 600, loss[loss=0.2141, simple_loss=0.2763, pruned_loss=0.07593, over 21997.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3119, pruned_loss=0.07653, over 4070114.11 frames. ], batch size: 103, lr: 5.34e-03, grad_scale: 32.0 2023-06-22 07:15:22,304 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.050e+02 2.705e+02 3.324e+02 3.965e+02 6.330e+02, threshold=6.647e+02, percent-clipped=3.0 2023-06-22 07:16:49,489 INFO [train.py:996] (2/4) Epoch 6, batch 650, loss[loss=0.2117, simple_loss=0.276, pruned_loss=0.07376, over 21854.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3129, pruned_loss=0.07734, over 4124592.28 frames. ], batch size: 107, lr: 5.34e-03, grad_scale: 32.0 2023-06-22 07:16:58,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=918738.0, ans=0.07 2023-06-22 07:17:29,626 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.64 vs. limit=8.0 2023-06-22 07:18:14,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=918918.0, ans=0.0 2023-06-22 07:18:53,911 INFO [train.py:996] (2/4) Epoch 6, batch 700, loss[loss=0.2425, simple_loss=0.3225, pruned_loss=0.08127, over 21797.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.312, pruned_loss=0.07742, over 4162035.06 frames. ], batch size: 107, lr: 5.33e-03, grad_scale: 16.0 2023-06-22 07:19:46,552 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.060e+02 2.473e+02 2.771e+02 3.367e+02 4.695e+02, threshold=5.542e+02, percent-clipped=0.0 2023-06-22 07:20:25,524 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=919218.0, ans=0.125 2023-06-22 07:20:45,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=919278.0, ans=0.07 2023-06-22 07:21:03,761 INFO [train.py:996] (2/4) Epoch 6, batch 750, loss[loss=0.2375, simple_loss=0.2934, pruned_loss=0.09085, over 21985.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3151, pruned_loss=0.07903, over 4191219.85 frames. ], batch size: 103, lr: 5.33e-03, grad_scale: 16.0 2023-06-22 07:21:43,716 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.48 vs. limit=10.0 2023-06-22 07:22:03,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=919458.0, ans=0.125 2023-06-22 07:23:10,561 INFO [train.py:996] (2/4) Epoch 6, batch 800, loss[loss=0.2011, simple_loss=0.2779, pruned_loss=0.06217, over 21682.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3101, pruned_loss=0.07899, over 4210684.00 frames. ], batch size: 263, lr: 5.33e-03, grad_scale: 32.0 2023-06-22 07:24:08,544 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.010e+02 2.537e+02 3.024e+02 3.645e+02 6.511e+02, threshold=6.048e+02, percent-clipped=3.0 2023-06-22 07:24:17,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=919758.0, ans=0.2 2023-06-22 07:24:25,599 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.30 vs. limit=15.0 2023-06-22 07:24:28,131 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=919818.0, ans=0.09899494936611666 2023-06-22 07:24:28,889 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.55 vs. limit=8.0 2023-06-22 07:25:19,448 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=919878.0, ans=0.0 2023-06-22 07:25:23,374 INFO [train.py:996] (2/4) Epoch 6, batch 850, loss[loss=0.2449, simple_loss=0.3687, pruned_loss=0.06049, over 19723.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3076, pruned_loss=0.07888, over 4230507.18 frames. ], batch size: 703, lr: 5.33e-03, grad_scale: 32.0 2023-06-22 07:26:04,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=919998.0, ans=0.09899494936611666 2023-06-22 07:26:36,331 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.66 vs. limit=15.0 2023-06-22 07:26:38,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=920058.0, ans=0.125 2023-06-22 07:26:59,557 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=920118.0, ans=0.1 2023-06-22 07:27:43,055 INFO [train.py:996] (2/4) Epoch 6, batch 900, loss[loss=0.2169, simple_loss=0.2832, pruned_loss=0.07534, over 21471.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3045, pruned_loss=0.07796, over 4248176.28 frames. ], batch size: 194, lr: 5.33e-03, grad_scale: 32.0 2023-06-22 07:28:24,036 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.030e+02 2.591e+02 2.994e+02 3.530e+02 5.655e+02, threshold=5.988e+02, percent-clipped=0.0 2023-06-22 07:28:30,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=920358.0, ans=0.125 2023-06-22 07:29:48,823 INFO [train.py:996] (2/4) Epoch 6, batch 950, loss[loss=0.2357, simple_loss=0.3111, pruned_loss=0.08018, over 21877.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.301, pruned_loss=0.07706, over 4260853.64 frames. ], batch size: 107, lr: 5.33e-03, grad_scale: 32.0 2023-06-22 07:29:49,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=920538.0, ans=0.125 2023-06-22 07:30:31,077 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.17 vs. limit=22.5 2023-06-22 07:31:35,368 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.43 vs. limit=15.0 2023-06-22 07:32:07,392 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=920838.0, ans=0.02 2023-06-22 07:32:08,536 INFO [train.py:996] (2/4) Epoch 6, batch 1000, loss[loss=0.2394, simple_loss=0.3312, pruned_loss=0.07384, over 21628.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.2996, pruned_loss=0.07694, over 4270078.92 frames. ], batch size: 414, lr: 5.33e-03, grad_scale: 32.0 2023-06-22 07:32:15,111 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 07:32:18,094 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 07:32:24,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=920898.0, ans=0.2 2023-06-22 07:33:00,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=920958.0, ans=0.0 2023-06-22 07:33:01,791 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.125e+02 2.550e+02 2.798e+02 3.235e+02 6.072e+02, threshold=5.596e+02, percent-clipped=1.0 2023-06-22 07:33:39,509 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.85 vs. limit=15.0 2023-06-22 07:34:20,078 INFO [train.py:996] (2/4) Epoch 6, batch 1050, loss[loss=0.2042, simple_loss=0.2912, pruned_loss=0.05862, over 21742.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3027, pruned_loss=0.07815, over 4279929.74 frames. ], batch size: 247, lr: 5.33e-03, grad_scale: 32.0 2023-06-22 07:34:29,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=921138.0, ans=0.1 2023-06-22 07:34:33,823 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=921198.0, ans=0.125 2023-06-22 07:35:28,239 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.72 vs. limit=15.0 2023-06-22 07:36:21,130 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=921378.0, ans=0.2 2023-06-22 07:36:22,605 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=921378.0, ans=0.125 2023-06-22 07:36:23,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=921378.0, ans=0.5 2023-06-22 07:36:26,863 INFO [train.py:996] (2/4) Epoch 6, batch 1100, loss[loss=0.2273, simple_loss=0.3097, pruned_loss=0.07249, over 21531.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3043, pruned_loss=0.07823, over 4280617.68 frames. ], batch size: 471, lr: 5.33e-03, grad_scale: 16.0 2023-06-22 07:37:12,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=921498.0, ans=0.125 2023-06-22 07:37:18,783 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.935e+02 2.723e+02 3.096e+02 3.940e+02 7.393e+02, threshold=6.192e+02, percent-clipped=9.0 2023-06-22 07:38:44,053 INFO [train.py:996] (2/4) Epoch 6, batch 1150, loss[loss=0.2262, simple_loss=0.2998, pruned_loss=0.07631, over 21301.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3026, pruned_loss=0.07648, over 4285331.17 frames. ], batch size: 143, lr: 5.33e-03, grad_scale: 16.0 2023-06-22 07:39:47,643 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=921798.0, ans=0.1 2023-06-22 07:40:45,640 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 07:41:01,470 INFO [train.py:996] (2/4) Epoch 6, batch 1200, loss[loss=0.2488, simple_loss=0.3473, pruned_loss=0.07515, over 21649.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3033, pruned_loss=0.0767, over 4280157.64 frames. ], batch size: 414, lr: 5.33e-03, grad_scale: 32.0 2023-06-22 07:41:31,434 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=922038.0, ans=0.125 2023-06-22 07:41:31,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=922038.0, ans=0.2 2023-06-22 07:41:33,364 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.44 vs. limit=15.0 2023-06-22 07:41:45,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=922098.0, ans=0.025 2023-06-22 07:42:14,467 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.142e+02 2.527e+02 2.979e+02 3.746e+02 6.173e+02, threshold=5.958e+02, percent-clipped=0.0 2023-06-22 07:42:16,361 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=922158.0, ans=0.07 2023-06-22 07:43:11,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=922278.0, ans=0.09899494936611666 2023-06-22 07:43:23,982 INFO [train.py:996] (2/4) Epoch 6, batch 1250, loss[loss=0.2283, simple_loss=0.3114, pruned_loss=0.07258, over 21677.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3054, pruned_loss=0.07838, over 4287135.53 frames. ], batch size: 389, lr: 5.32e-03, grad_scale: 32.0 2023-06-22 07:43:32,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=922338.0, ans=0.0 2023-06-22 07:43:47,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=922338.0, ans=0.0 2023-06-22 07:44:33,141 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=922458.0, ans=0.125 2023-06-22 07:45:07,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=922518.0, ans=0.0 2023-06-22 07:45:12,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=922518.0, ans=0.0 2023-06-22 07:45:34,451 INFO [train.py:996] (2/4) Epoch 6, batch 1300, loss[loss=0.2335, simple_loss=0.3013, pruned_loss=0.0829, over 21359.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3062, pruned_loss=0.07866, over 4293375.22 frames. ], batch size: 159, lr: 5.32e-03, grad_scale: 32.0 2023-06-22 07:46:05,351 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.22 vs. limit=6.0 2023-06-22 07:46:17,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=922698.0, ans=0.0 2023-06-22 07:46:54,074 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.178e+02 2.638e+02 3.096e+02 3.825e+02 7.395e+02, threshold=6.191e+02, percent-clipped=3.0 2023-06-22 07:46:56,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=922758.0, ans=0.0 2023-06-22 07:47:02,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=922758.0, ans=0.07 2023-06-22 07:47:29,174 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=922878.0, ans=0.025 2023-06-22 07:47:29,255 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=922878.0, ans=0.0 2023-06-22 07:47:33,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=922878.0, ans=0.025 2023-06-22 07:47:59,659 INFO [train.py:996] (2/4) Epoch 6, batch 1350, loss[loss=0.2282, simple_loss=0.2876, pruned_loss=0.08438, over 21327.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3067, pruned_loss=0.0794, over 4291585.93 frames. ], batch size: 159, lr: 5.32e-03, grad_scale: 32.0 2023-06-22 07:48:13,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=922938.0, ans=6.0 2023-06-22 07:49:26,430 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=923058.0, ans=0.0 2023-06-22 07:49:29,658 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.93 vs. limit=22.5 2023-06-22 07:49:30,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=923058.0, ans=0.0 2023-06-22 07:49:40,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=923118.0, ans=0.0 2023-06-22 07:50:01,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=923178.0, ans=0.0 2023-06-22 07:50:03,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=923238.0, ans=0.05 2023-06-22 07:50:04,093 INFO [train.py:996] (2/4) Epoch 6, batch 1400, loss[loss=0.2418, simple_loss=0.3119, pruned_loss=0.08588, over 21315.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3052, pruned_loss=0.07893, over 4293678.24 frames. ], batch size: 159, lr: 5.32e-03, grad_scale: 32.0 2023-06-22 07:50:26,622 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=923238.0, ans=0.1 2023-06-22 07:50:50,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=923298.0, ans=0.125 2023-06-22 07:51:10,361 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.994e+02 2.497e+02 2.735e+02 3.069e+02 5.769e+02, threshold=5.470e+02, percent-clipped=0.0 2023-06-22 07:51:22,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=923358.0, ans=0.125 2023-06-22 07:51:28,497 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=923418.0, ans=0.0 2023-06-22 07:51:29,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=923418.0, ans=0.125 2023-06-22 07:51:46,438 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.71 vs. limit=15.0 2023-06-22 07:52:19,698 INFO [train.py:996] (2/4) Epoch 6, batch 1450, loss[loss=0.2333, simple_loss=0.3067, pruned_loss=0.08, over 21384.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3063, pruned_loss=0.0796, over 4290911.07 frames. ], batch size: 549, lr: 5.32e-03, grad_scale: 32.0 2023-06-22 07:52:23,446 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.29 vs. limit=15.0 2023-06-22 07:53:39,469 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=923718.0, ans=0.125 2023-06-22 07:53:41,727 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=12.0 2023-06-22 07:54:26,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=923838.0, ans=0.125 2023-06-22 07:54:36,508 INFO [train.py:996] (2/4) Epoch 6, batch 1500, loss[loss=0.2364, simple_loss=0.2992, pruned_loss=0.08675, over 21615.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3091, pruned_loss=0.08136, over 4295979.45 frames. ], batch size: 548, lr: 5.32e-03, grad_scale: 32.0 2023-06-22 07:54:39,830 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=923838.0, ans=0.2 2023-06-22 07:55:03,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=923898.0, ans=0.0 2023-06-22 07:55:24,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=923898.0, ans=0.0 2023-06-22 07:55:31,685 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.072e+02 2.593e+02 2.969e+02 3.439e+02 4.928e+02, threshold=5.939e+02, percent-clipped=0.0 2023-06-22 07:55:59,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=924018.0, ans=0.125 2023-06-22 07:56:17,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=924078.0, ans=0.2 2023-06-22 07:56:20,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=924078.0, ans=0.1 2023-06-22 07:56:39,027 INFO [train.py:996] (2/4) Epoch 6, batch 1550, loss[loss=0.2216, simple_loss=0.2946, pruned_loss=0.07429, over 21866.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3051, pruned_loss=0.07862, over 4301294.13 frames. ], batch size: 351, lr: 5.32e-03, grad_scale: 32.0 2023-06-22 07:57:23,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=924198.0, ans=0.125 2023-06-22 07:59:11,421 INFO [train.py:996] (2/4) Epoch 6, batch 1600, loss[loss=0.2218, simple_loss=0.2968, pruned_loss=0.07339, over 20039.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3028, pruned_loss=0.0782, over 4299322.92 frames. ], batch size: 702, lr: 5.32e-03, grad_scale: 32.0 2023-06-22 08:00:03,602 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=924498.0, ans=0.2 2023-06-22 08:00:09,253 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.968e+02 2.447e+02 2.889e+02 3.506e+02 5.752e+02, threshold=5.778e+02, percent-clipped=0.0 2023-06-22 08:00:09,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=924558.0, ans=0.2 2023-06-22 08:01:19,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=924678.0, ans=0.1 2023-06-22 08:01:23,864 INFO [train.py:996] (2/4) Epoch 6, batch 1650, loss[loss=0.2822, simple_loss=0.3335, pruned_loss=0.1155, over 21607.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3025, pruned_loss=0.07837, over 4287659.94 frames. ], batch size: 471, lr: 5.32e-03, grad_scale: 16.0 2023-06-22 08:01:46,338 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 08:02:28,139 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 08:02:28,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=924858.0, ans=0.1 2023-06-22 08:02:28,664 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.23 vs. limit=22.5 2023-06-22 08:03:42,429 INFO [train.py:996] (2/4) Epoch 6, batch 1700, loss[loss=0.187, simple_loss=0.2824, pruned_loss=0.04583, over 21647.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3043, pruned_loss=0.07894, over 4286813.99 frames. ], batch size: 263, lr: 5.32e-03, grad_scale: 16.0 2023-06-22 08:04:39,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=925158.0, ans=0.125 2023-06-22 08:04:40,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=925158.0, ans=0.0 2023-06-22 08:04:42,724 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.047e+02 2.575e+02 2.876e+02 3.379e+02 6.371e+02, threshold=5.752e+02, percent-clipped=1.0 2023-06-22 08:05:03,178 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=925158.0, ans=0.0 2023-06-22 08:06:13,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=925338.0, ans=0.04949747468305833 2023-06-22 08:06:14,620 INFO [train.py:996] (2/4) Epoch 6, batch 1750, loss[loss=0.1508, simple_loss=0.22, pruned_loss=0.04083, over 21377.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3044, pruned_loss=0.07674, over 4288823.31 frames. ], batch size: 131, lr: 5.32e-03, grad_scale: 16.0 2023-06-22 08:07:03,111 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=925398.0, ans=0.2 2023-06-22 08:08:00,788 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=925518.0, ans=10.0 2023-06-22 08:08:12,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=925518.0, ans=0.125 2023-06-22 08:08:35,956 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=925578.0, ans=0.125 2023-06-22 08:08:43,949 INFO [train.py:996] (2/4) Epoch 6, batch 1800, loss[loss=0.2183, simple_loss=0.2961, pruned_loss=0.07024, over 21737.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3025, pruned_loss=0.07431, over 4292322.62 frames. ], batch size: 298, lr: 5.32e-03, grad_scale: 16.0 2023-06-22 08:09:38,634 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.899e+02 2.513e+02 3.137e+02 3.734e+02 6.683e+02, threshold=6.274e+02, percent-clipped=3.0 2023-06-22 08:09:39,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=925758.0, ans=0.1 2023-06-22 08:10:18,085 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=925818.0, ans=0.1 2023-06-22 08:10:20,384 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.18 vs. limit=15.0 2023-06-22 08:10:38,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=925878.0, ans=0.125 2023-06-22 08:10:54,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=925878.0, ans=0.125 2023-06-22 08:10:58,598 INFO [train.py:996] (2/4) Epoch 6, batch 1850, loss[loss=0.2175, simple_loss=0.313, pruned_loss=0.06103, over 21753.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3053, pruned_loss=0.07366, over 4293696.66 frames. ], batch size: 351, lr: 5.31e-03, grad_scale: 16.0 2023-06-22 08:11:02,130 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=925938.0, ans=0.125 2023-06-22 08:11:03,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=925938.0, ans=0.125 2023-06-22 08:11:15,498 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=925998.0, ans=0.125 2023-06-22 08:11:45,919 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.97 vs. limit=15.0 2023-06-22 08:11:46,628 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=926058.0, ans=0.125 2023-06-22 08:12:00,917 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=926058.0, ans=0.0 2023-06-22 08:12:11,178 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=926058.0, ans=0.125 2023-06-22 08:13:07,276 INFO [train.py:996] (2/4) Epoch 6, batch 1900, loss[loss=0.2415, simple_loss=0.3153, pruned_loss=0.08381, over 21193.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3037, pruned_loss=0.07341, over 4293664.90 frames. ], batch size: 548, lr: 5.31e-03, grad_scale: 16.0 2023-06-22 08:13:59,633 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.839e+02 2.371e+02 2.637e+02 3.288e+02 5.530e+02, threshold=5.274e+02, percent-clipped=0.0 2023-06-22 08:14:05,038 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.06 vs. limit=10.0 2023-06-22 08:15:01,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=926478.0, ans=0.1 2023-06-22 08:15:01,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=926478.0, ans=0.125 2023-06-22 08:15:08,554 INFO [train.py:996] (2/4) Epoch 6, batch 1950, loss[loss=0.215, simple_loss=0.3138, pruned_loss=0.05811, over 21698.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.3008, pruned_loss=0.07238, over 4291527.35 frames. ], batch size: 298, lr: 5.31e-03, grad_scale: 16.0 2023-06-22 08:15:53,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=926598.0, ans=0.2 2023-06-22 08:16:24,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=926658.0, ans=0.09899494936611666 2023-06-22 08:17:25,658 INFO [train.py:996] (2/4) Epoch 6, batch 2000, loss[loss=0.1846, simple_loss=0.2603, pruned_loss=0.05445, over 21294.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2971, pruned_loss=0.07045, over 4293619.36 frames. ], batch size: 159, lr: 5.31e-03, grad_scale: 32.0 2023-06-22 08:17:59,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=926898.0, ans=0.0 2023-06-22 08:18:12,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=926898.0, ans=0.035 2023-06-22 08:18:41,769 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.772e+02 2.517e+02 3.003e+02 3.680e+02 6.988e+02, threshold=6.006e+02, percent-clipped=2.0 2023-06-22 08:19:08,982 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=927018.0, ans=0.07 2023-06-22 08:19:40,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=927138.0, ans=0.125 2023-06-22 08:19:41,270 INFO [train.py:996] (2/4) Epoch 6, batch 2050, loss[loss=0.1787, simple_loss=0.2494, pruned_loss=0.05394, over 21530.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2972, pruned_loss=0.07077, over 4292253.80 frames. ], batch size: 195, lr: 5.31e-03, grad_scale: 32.0 2023-06-22 08:21:09,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=927258.0, ans=0.09899494936611666 2023-06-22 08:21:27,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=927318.0, ans=0.5 2023-06-22 08:21:53,269 INFO [train.py:996] (2/4) Epoch 6, batch 2100, loss[loss=0.2167, simple_loss=0.2973, pruned_loss=0.068, over 21657.00 frames. ], tot_loss[loss=0.223, simple_loss=0.3, pruned_loss=0.073, over 4289683.09 frames. ], batch size: 263, lr: 5.31e-03, grad_scale: 32.0 2023-06-22 08:22:17,966 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.14 vs. limit=15.0 2023-06-22 08:23:02,543 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.001e+02 2.488e+02 2.797e+02 3.181e+02 4.805e+02, threshold=5.593e+02, percent-clipped=0.0 2023-06-22 08:23:07,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=927558.0, ans=0.1 2023-06-22 08:23:21,009 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=927618.0, ans=0.0 2023-06-22 08:23:55,916 INFO [train.py:996] (2/4) Epoch 6, batch 2150, loss[loss=0.2373, simple_loss=0.3191, pruned_loss=0.07771, over 21374.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3016, pruned_loss=0.07524, over 4287645.47 frames. ], batch size: 211, lr: 5.31e-03, grad_scale: 16.0 2023-06-22 08:24:11,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=927738.0, ans=0.125 2023-06-22 08:24:34,498 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=927798.0, ans=0.125 2023-06-22 08:24:40,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=927798.0, ans=0.0 2023-06-22 08:24:41,879 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=927798.0, ans=0.0 2023-06-22 08:24:57,792 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 08:26:29,702 INFO [train.py:996] (2/4) Epoch 6, batch 2200, loss[loss=0.2144, simple_loss=0.2704, pruned_loss=0.07914, over 21291.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3034, pruned_loss=0.07628, over 4281163.79 frames. ], batch size: 608, lr: 5.31e-03, grad_scale: 16.0 2023-06-22 08:26:52,637 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.83 vs. limit=15.0 2023-06-22 08:26:57,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=928098.0, ans=0.125 2023-06-22 08:27:04,695 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=928098.0, ans=0.125 2023-06-22 08:27:19,791 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.050e+02 2.524e+02 2.938e+02 3.360e+02 6.065e+02, threshold=5.877e+02, percent-clipped=1.0 2023-06-22 08:27:51,299 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.37 vs. limit=22.5 2023-06-22 08:28:32,323 INFO [train.py:996] (2/4) Epoch 6, batch 2250, loss[loss=0.187, simple_loss=0.2513, pruned_loss=0.06132, over 21764.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3008, pruned_loss=0.07469, over 4277261.15 frames. ], batch size: 112, lr: 5.31e-03, grad_scale: 16.0 2023-06-22 08:29:26,857 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=928458.0, ans=0.1 2023-06-22 08:30:01,014 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.16 vs. limit=6.0 2023-06-22 08:30:01,921 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=928518.0, ans=0.2 2023-06-22 08:30:19,470 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 08:30:23,223 INFO [train.py:996] (2/4) Epoch 6, batch 2300, loss[loss=0.2194, simple_loss=0.2873, pruned_loss=0.0757, over 22019.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2978, pruned_loss=0.07474, over 4279861.38 frames. ], batch size: 103, lr: 5.31e-03, grad_scale: 16.0 2023-06-22 08:30:50,644 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=928638.0, ans=0.0 2023-06-22 08:31:25,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=928758.0, ans=0.04949747468305833 2023-06-22 08:31:28,780 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.850e+02 2.418e+02 2.777e+02 3.405e+02 6.239e+02, threshold=5.554e+02, percent-clipped=2.0 2023-06-22 08:32:01,008 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 08:32:02,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=928818.0, ans=0.0 2023-06-22 08:32:04,531 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=16.04 vs. limit=15.0 2023-06-22 08:32:16,004 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=928878.0, ans=0.125 2023-06-22 08:32:24,873 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=928878.0, ans=0.1 2023-06-22 08:32:31,783 INFO [train.py:996] (2/4) Epoch 6, batch 2350, loss[loss=0.2217, simple_loss=0.3033, pruned_loss=0.07001, over 20692.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2956, pruned_loss=0.07466, over 4264802.64 frames. ], batch size: 607, lr: 5.31e-03, grad_scale: 16.0 2023-06-22 08:32:35,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=928938.0, ans=0.125 2023-06-22 08:32:57,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=928938.0, ans=0.125 2023-06-22 08:33:47,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=929058.0, ans=0.2 2023-06-22 08:34:50,007 INFO [train.py:996] (2/4) Epoch 6, batch 2400, loss[loss=0.1955, simple_loss=0.258, pruned_loss=0.06648, over 21559.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.2993, pruned_loss=0.07646, over 4260615.02 frames. ], batch size: 263, lr: 5.31e-03, grad_scale: 32.0 2023-06-22 08:36:02,025 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.593e+02 2.927e+02 3.472e+02 6.319e+02, threshold=5.855e+02, percent-clipped=5.0 2023-06-22 08:36:49,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=929478.0, ans=0.125 2023-06-22 08:37:11,337 INFO [train.py:996] (2/4) Epoch 6, batch 2450, loss[loss=0.2495, simple_loss=0.3271, pruned_loss=0.08599, over 21765.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3037, pruned_loss=0.07861, over 4260509.67 frames. ], batch size: 113, lr: 5.30e-03, grad_scale: 32.0 2023-06-22 08:37:15,369 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.51 vs. limit=8.0 2023-06-22 08:37:43,794 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.35 vs. limit=22.5 2023-06-22 08:37:45,282 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.07 vs. limit=15.0 2023-06-22 08:39:13,826 INFO [train.py:996] (2/4) Epoch 6, batch 2500, loss[loss=0.2228, simple_loss=0.3017, pruned_loss=0.07195, over 21550.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.304, pruned_loss=0.07939, over 4266757.97 frames. ], batch size: 414, lr: 5.30e-03, grad_scale: 32.0 2023-06-22 08:39:52,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=929898.0, ans=0.125 2023-06-22 08:40:23,085 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.077e+02 2.580e+02 2.958e+02 3.428e+02 5.178e+02, threshold=5.916e+02, percent-clipped=0.0 2023-06-22 08:41:01,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=930078.0, ans=0.125 2023-06-22 08:41:26,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=930078.0, ans=0.2 2023-06-22 08:41:28,667 INFO [train.py:996] (2/4) Epoch 6, batch 2550, loss[loss=0.2031, simple_loss=0.2831, pruned_loss=0.06153, over 21393.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3012, pruned_loss=0.07778, over 4271642.92 frames. ], batch size: 131, lr: 5.30e-03, grad_scale: 16.0 2023-06-22 08:41:33,531 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.15 vs. limit=12.0 2023-06-22 08:41:48,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=930138.0, ans=0.0 2023-06-22 08:42:21,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=930198.0, ans=0.125 2023-06-22 08:43:09,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=930318.0, ans=0.0 2023-06-22 08:43:51,034 INFO [train.py:996] (2/4) Epoch 6, batch 2600, loss[loss=0.1917, simple_loss=0.2642, pruned_loss=0.05959, over 21590.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3007, pruned_loss=0.07753, over 4263372.78 frames. ], batch size: 247, lr: 5.30e-03, grad_scale: 16.0 2023-06-22 08:45:08,326 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.989e+02 2.524e+02 2.947e+02 3.277e+02 5.096e+02, threshold=5.894e+02, percent-clipped=0.0 2023-06-22 08:45:26,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=930618.0, ans=0.95 2023-06-22 08:46:10,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=930738.0, ans=0.125 2023-06-22 08:46:11,431 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.63 vs. limit=12.0 2023-06-22 08:46:11,835 INFO [train.py:996] (2/4) Epoch 6, batch 2650, loss[loss=0.2368, simple_loss=0.3173, pruned_loss=0.07816, over 21473.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3019, pruned_loss=0.07734, over 4264170.87 frames. ], batch size: 131, lr: 5.30e-03, grad_scale: 16.0 2023-06-22 08:46:32,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=930738.0, ans=0.0 2023-06-22 08:46:39,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=930738.0, ans=0.125 2023-06-22 08:46:41,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=930738.0, ans=0.1 2023-06-22 08:46:42,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=930738.0, ans=0.0 2023-06-22 08:46:42,711 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=930738.0, ans=0.0 2023-06-22 08:48:18,339 INFO [train.py:996] (2/4) Epoch 6, batch 2700, loss[loss=0.2742, simple_loss=0.3438, pruned_loss=0.1023, over 21555.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3011, pruned_loss=0.07745, over 4255886.27 frames. ], batch size: 471, lr: 5.30e-03, grad_scale: 16.0 2023-06-22 08:48:33,867 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.86 vs. limit=15.0 2023-06-22 08:48:38,147 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.59 vs. limit=15.0 2023-06-22 08:49:04,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=931098.0, ans=0.0 2023-06-22 08:49:35,952 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.061e+02 2.647e+02 2.951e+02 3.422e+02 5.387e+02, threshold=5.902e+02, percent-clipped=0.0 2023-06-22 08:49:42,948 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=931158.0, ans=0.125 2023-06-22 08:49:57,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=931218.0, ans=0.125 2023-06-22 08:50:26,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=931278.0, ans=0.5 2023-06-22 08:50:35,159 INFO [train.py:996] (2/4) Epoch 6, batch 2750, loss[loss=0.2527, simple_loss=0.3265, pruned_loss=0.08944, over 21693.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.2998, pruned_loss=0.07735, over 4253674.59 frames. ], batch size: 441, lr: 5.30e-03, grad_scale: 16.0 2023-06-22 08:51:32,895 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=931398.0, ans=0.1 2023-06-22 08:51:49,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=931458.0, ans=0.05 2023-06-22 08:51:52,359 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=931458.0, ans=0.125 2023-06-22 08:53:00,969 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.69 vs. limit=10.0 2023-06-22 08:53:01,320 INFO [train.py:996] (2/4) Epoch 6, batch 2800, loss[loss=0.2584, simple_loss=0.3398, pruned_loss=0.08844, over 21768.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3039, pruned_loss=0.07805, over 4258570.18 frames. ], batch size: 298, lr: 5.30e-03, grad_scale: 32.0 2023-06-22 08:54:11,412 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.064e+02 2.698e+02 3.040e+02 3.430e+02 5.325e+02, threshold=6.080e+02, percent-clipped=0.0 2023-06-22 08:54:43,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=931818.0, ans=0.0 2023-06-22 08:55:00,749 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 08:55:14,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=931878.0, ans=0.1 2023-06-22 08:55:25,870 INFO [train.py:996] (2/4) Epoch 6, batch 2850, loss[loss=0.2004, simple_loss=0.2588, pruned_loss=0.07099, over 21273.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3076, pruned_loss=0.0801, over 4257014.24 frames. ], batch size: 549, lr: 5.30e-03, grad_scale: 32.0 2023-06-22 08:56:01,348 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=931998.0, ans=0.125 2023-06-22 08:56:17,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=932058.0, ans=0.1 2023-06-22 08:56:18,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=932058.0, ans=0.125 2023-06-22 08:56:43,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=932118.0, ans=0.1 2023-06-22 08:57:12,229 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=932118.0, ans=0.125 2023-06-22 08:57:34,744 INFO [train.py:996] (2/4) Epoch 6, batch 2900, loss[loss=0.2583, simple_loss=0.3221, pruned_loss=0.09725, over 22056.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3043, pruned_loss=0.07947, over 4266738.11 frames. ], batch size: 119, lr: 5.30e-03, grad_scale: 32.0 2023-06-22 08:58:08,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=932238.0, ans=0.05 2023-06-22 08:58:20,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=932298.0, ans=0.125 2023-06-22 08:58:38,200 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.091e+02 2.660e+02 3.154e+02 3.986e+02 8.685e+02, threshold=6.308e+02, percent-clipped=2.0 2023-06-22 08:59:24,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=932478.0, ans=0.1 2023-06-22 08:59:46,368 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=932538.0, ans=0.125 2023-06-22 08:59:51,142 INFO [train.py:996] (2/4) Epoch 6, batch 2950, loss[loss=0.2622, simple_loss=0.3475, pruned_loss=0.08846, over 21656.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3052, pruned_loss=0.07977, over 4271521.27 frames. ], batch size: 263, lr: 5.30e-03, grad_scale: 32.0 2023-06-22 09:00:21,428 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.95 vs. limit=12.0 2023-06-22 09:00:58,069 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.72 vs. limit=15.0 2023-06-22 09:01:10,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=932658.0, ans=0.125 2023-06-22 09:01:11,320 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.11 vs. limit=15.0 2023-06-22 09:02:09,989 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.96 vs. limit=5.0 2023-06-22 09:02:10,246 INFO [train.py:996] (2/4) Epoch 6, batch 3000, loss[loss=0.2814, simple_loss=0.3629, pruned_loss=0.09994, over 21468.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3092, pruned_loss=0.07991, over 4275700.05 frames. ], batch size: 131, lr: 5.29e-03, grad_scale: 16.0 2023-06-22 09:02:10,249 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-22 09:03:08,557 INFO [train.py:1028] (2/4) Epoch 6, validation: loss=0.2509, simple_loss=0.3421, pruned_loss=0.07991, over 1796401.00 frames. 2023-06-22 09:03:08,558 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-22 09:03:17,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=932838.0, ans=0.0 2023-06-22 09:03:57,561 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.033e+02 2.599e+02 2.976e+02 3.379e+02 5.904e+02, threshold=5.951e+02, percent-clipped=0.0 2023-06-22 09:04:29,679 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=933018.0, ans=0.125 2023-06-22 09:04:32,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=933018.0, ans=0.2 2023-06-22 09:05:27,298 INFO [train.py:996] (2/4) Epoch 6, batch 3050, loss[loss=0.1798, simple_loss=0.2555, pruned_loss=0.05211, over 21455.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3099, pruned_loss=0.07911, over 4280367.64 frames. ], batch size: 194, lr: 5.29e-03, grad_scale: 16.0 2023-06-22 09:05:51,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=933198.0, ans=0.1 2023-06-22 09:07:39,457 INFO [train.py:996] (2/4) Epoch 6, batch 3100, loss[loss=0.2194, simple_loss=0.3076, pruned_loss=0.0656, over 21604.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3097, pruned_loss=0.07795, over 4284701.29 frames. ], batch size: 230, lr: 5.29e-03, grad_scale: 16.0 2023-06-22 09:08:19,730 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.45 vs. limit=6.0 2023-06-22 09:08:44,157 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.949e+02 2.677e+02 3.194e+02 3.911e+02 7.241e+02, threshold=6.388e+02, percent-clipped=3.0 2023-06-22 09:10:02,627 INFO [train.py:996] (2/4) Epoch 6, batch 3150, loss[loss=0.2391, simple_loss=0.3111, pruned_loss=0.0836, over 21594.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3104, pruned_loss=0.07791, over 4279294.21 frames. ], batch size: 230, lr: 5.29e-03, grad_scale: 16.0 2023-06-22 09:11:05,644 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=933858.0, ans=0.125 2023-06-22 09:11:44,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=933858.0, ans=0.025 2023-06-22 09:11:48,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=933918.0, ans=0.0 2023-06-22 09:12:30,956 INFO [train.py:996] (2/4) Epoch 6, batch 3200, loss[loss=0.2636, simple_loss=0.3455, pruned_loss=0.09085, over 21756.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3123, pruned_loss=0.07864, over 4274732.39 frames. ], batch size: 441, lr: 5.29e-03, grad_scale: 32.0 2023-06-22 09:12:35,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=934038.0, ans=0.125 2023-06-22 09:12:37,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=934038.0, ans=0.1 2023-06-22 09:13:53,861 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.766e+02 2.451e+02 2.830e+02 3.191e+02 4.381e+02, threshold=5.660e+02, percent-clipped=0.0 2023-06-22 09:13:57,813 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.45 vs. limit=15.0 2023-06-22 09:14:08,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=934218.0, ans=0.0 2023-06-22 09:14:44,665 INFO [train.py:996] (2/4) Epoch 6, batch 3250, loss[loss=0.2169, simple_loss=0.2788, pruned_loss=0.07748, over 21744.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3149, pruned_loss=0.08101, over 4280239.74 frames. ], batch size: 282, lr: 5.29e-03, grad_scale: 32.0 2023-06-22 09:14:47,256 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.02 vs. limit=15.0 2023-06-22 09:15:34,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=934398.0, ans=0.125 2023-06-22 09:16:15,060 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=934458.0, ans=0.1 2023-06-22 09:16:21,266 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=934518.0, ans=0.0 2023-06-22 09:17:13,740 INFO [train.py:996] (2/4) Epoch 6, batch 3300, loss[loss=0.2072, simple_loss=0.2897, pruned_loss=0.06239, over 21388.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3076, pruned_loss=0.08035, over 4279840.97 frames. ], batch size: 194, lr: 5.29e-03, grad_scale: 32.0 2023-06-22 09:18:13,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=934758.0, ans=0.0 2023-06-22 09:18:24,230 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.938e+02 2.651e+02 2.919e+02 3.305e+02 7.329e+02, threshold=5.839e+02, percent-clipped=1.0 2023-06-22 09:18:24,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=934758.0, ans=0.0 2023-06-22 09:18:30,206 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.67 vs. limit=15.0 2023-06-22 09:19:38,758 INFO [train.py:996] (2/4) Epoch 6, batch 3350, loss[loss=0.2135, simple_loss=0.2874, pruned_loss=0.06977, over 21836.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3101, pruned_loss=0.08029, over 4277330.69 frames. ], batch size: 247, lr: 5.29e-03, grad_scale: 32.0 2023-06-22 09:20:06,507 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.86 vs. limit=15.0 2023-06-22 09:20:45,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=935058.0, ans=0.1 2023-06-22 09:21:12,527 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=935118.0, ans=0.2 2023-06-22 09:21:20,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=935118.0, ans=0.125 2023-06-22 09:21:40,320 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=17.13 vs. limit=22.5 2023-06-22 09:21:45,503 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=935178.0, ans=0.1 2023-06-22 09:21:52,353 INFO [train.py:996] (2/4) Epoch 6, batch 3400, loss[loss=0.2095, simple_loss=0.2762, pruned_loss=0.07145, over 21244.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3093, pruned_loss=0.0812, over 4275477.99 frames. ], batch size: 176, lr: 5.29e-03, grad_scale: 32.0 2023-06-22 09:21:52,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=935238.0, ans=0.125 2023-06-22 09:21:59,590 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-22 09:22:43,023 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.83 vs. limit=15.0 2023-06-22 09:22:59,426 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-22 09:23:02,958 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.156e+02 2.660e+02 3.094e+02 4.133e+02 6.206e+02, threshold=6.188e+02, percent-clipped=2.0 2023-06-22 09:23:43,645 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=935418.0, ans=0.0 2023-06-22 09:24:19,147 INFO [train.py:996] (2/4) Epoch 6, batch 3450, loss[loss=0.2119, simple_loss=0.2811, pruned_loss=0.07135, over 21941.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3056, pruned_loss=0.08004, over 4275076.45 frames. ], batch size: 113, lr: 5.29e-03, grad_scale: 16.0 2023-06-22 09:26:30,614 INFO [train.py:996] (2/4) Epoch 6, batch 3500, loss[loss=0.2483, simple_loss=0.3247, pruned_loss=0.08591, over 21277.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3113, pruned_loss=0.08288, over 4257123.75 frames. ], batch size: 143, lr: 5.29e-03, grad_scale: 16.0 2023-06-22 09:26:35,488 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=935838.0, ans=0.1 2023-06-22 09:26:55,565 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.94 vs. limit=15.0 2023-06-22 09:27:33,649 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.200e+02 2.716e+02 3.159e+02 3.539e+02 5.891e+02, threshold=6.318e+02, percent-clipped=0.0 2023-06-22 09:27:46,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=935958.0, ans=0.0 2023-06-22 09:27:51,491 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.33 vs. limit=6.0 2023-06-22 09:28:21,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=936078.0, ans=0.125 2023-06-22 09:28:37,241 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.59 vs. limit=6.0 2023-06-22 09:28:44,936 INFO [train.py:996] (2/4) Epoch 6, batch 3550, loss[loss=0.2103, simple_loss=0.2847, pruned_loss=0.06791, over 21726.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3141, pruned_loss=0.08417, over 4259500.85 frames. ], batch size: 351, lr: 5.29e-03, grad_scale: 16.0 2023-06-22 09:29:16,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=936198.0, ans=0.125 2023-06-22 09:30:47,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=936378.0, ans=0.0 2023-06-22 09:30:48,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=936378.0, ans=0.0 2023-06-22 09:30:51,132 INFO [train.py:996] (2/4) Epoch 6, batch 3600, loss[loss=0.243, simple_loss=0.3084, pruned_loss=0.0888, over 21676.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3102, pruned_loss=0.08346, over 4262537.68 frames. ], batch size: 351, lr: 5.28e-03, grad_scale: 32.0 2023-06-22 09:31:02,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=936438.0, ans=0.0 2023-06-22 09:31:21,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=936498.0, ans=0.0 2023-06-22 09:32:15,640 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.063e+02 2.586e+02 3.004e+02 3.454e+02 6.703e+02, threshold=6.007e+02, percent-clipped=1.0 2023-06-22 09:32:18,980 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=936558.0, ans=0.125 2023-06-22 09:32:32,888 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.08 vs. limit=15.0 2023-06-22 09:32:34,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=936618.0, ans=0.2 2023-06-22 09:33:23,817 INFO [train.py:996] (2/4) Epoch 6, batch 3650, loss[loss=0.2362, simple_loss=0.3219, pruned_loss=0.07522, over 21657.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3113, pruned_loss=0.08351, over 4270857.24 frames. ], batch size: 389, lr: 5.28e-03, grad_scale: 32.0 2023-06-22 09:34:12,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=936858.0, ans=0.2 2023-06-22 09:35:24,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=936978.0, ans=0.0 2023-06-22 09:35:30,884 INFO [train.py:996] (2/4) Epoch 6, batch 3700, loss[loss=0.2901, simple_loss=0.3689, pruned_loss=0.1057, over 21361.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3113, pruned_loss=0.08295, over 4272275.10 frames. ], batch size: 549, lr: 5.28e-03, grad_scale: 32.0 2023-06-22 09:36:06,063 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.46 vs. limit=22.5 2023-06-22 09:36:35,262 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.012e+02 2.535e+02 2.948e+02 3.482e+02 5.680e+02, threshold=5.896e+02, percent-clipped=0.0 2023-06-22 09:36:53,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=937218.0, ans=0.125 2023-06-22 09:37:50,984 INFO [train.py:996] (2/4) Epoch 6, batch 3750, loss[loss=0.2044, simple_loss=0.2855, pruned_loss=0.06169, over 21849.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3096, pruned_loss=0.08242, over 4280114.28 frames. ], batch size: 351, lr: 5.28e-03, grad_scale: 32.0 2023-06-22 09:38:37,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=937398.0, ans=0.125 2023-06-22 09:38:50,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=937458.0, ans=0.125 2023-06-22 09:38:55,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=937458.0, ans=0.125 2023-06-22 09:39:27,424 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 09:39:43,505 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.19 vs. limit=15.0 2023-06-22 09:40:13,718 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.22 vs. limit=12.0 2023-06-22 09:40:15,567 INFO [train.py:996] (2/4) Epoch 6, batch 3800, loss[loss=0.2131, simple_loss=0.2885, pruned_loss=0.06885, over 21785.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3066, pruned_loss=0.08012, over 4279334.17 frames. ], batch size: 247, lr: 5.28e-03, grad_scale: 32.0 2023-06-22 09:41:16,579 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.603e+02 2.491e+02 2.732e+02 3.396e+02 7.408e+02, threshold=5.464e+02, percent-clipped=3.0 2023-06-22 09:42:24,410 INFO [train.py:996] (2/4) Epoch 6, batch 3850, loss[loss=0.3033, simple_loss=0.4054, pruned_loss=0.1006, over 19980.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3058, pruned_loss=0.08059, over 4281624.53 frames. ], batch size: 702, lr: 5.28e-03, grad_scale: 16.0 2023-06-22 09:42:59,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=937998.0, ans=0.0 2023-06-22 09:43:25,872 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.53 vs. limit=15.0 2023-06-22 09:44:13,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=938118.0, ans=0.125 2023-06-22 09:44:20,724 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=938118.0, ans=0.0 2023-06-22 09:44:26,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.whiten.whitening_limit, batch_count=938178.0, ans=15.0 2023-06-22 09:44:44,220 INFO [train.py:996] (2/4) Epoch 6, batch 3900, loss[loss=0.241, simple_loss=0.3091, pruned_loss=0.0865, over 21827.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3026, pruned_loss=0.08044, over 4276466.08 frames. ], batch size: 107, lr: 5.28e-03, grad_scale: 16.0 2023-06-22 09:44:54,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=938238.0, ans=0.0 2023-06-22 09:45:34,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=938358.0, ans=0.1 2023-06-22 09:45:53,748 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.196e+02 2.660e+02 3.085e+02 3.718e+02 7.121e+02, threshold=6.170e+02, percent-clipped=3.0 2023-06-22 09:45:54,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=938358.0, ans=0.1 2023-06-22 09:46:34,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=938478.0, ans=0.125 2023-06-22 09:47:02,443 INFO [train.py:996] (2/4) Epoch 6, batch 3950, loss[loss=0.2148, simple_loss=0.2592, pruned_loss=0.08514, over 20169.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3037, pruned_loss=0.08009, over 4277115.97 frames. ], batch size: 703, lr: 5.28e-03, grad_scale: 16.0 2023-06-22 09:47:20,471 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=938538.0, ans=0.125 2023-06-22 09:47:21,944 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 09:49:05,508 INFO [train.py:996] (2/4) Epoch 6, batch 4000, loss[loss=0.2409, simple_loss=0.3063, pruned_loss=0.0878, over 20613.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.2964, pruned_loss=0.07647, over 4276450.54 frames. ], batch size: 607, lr: 5.28e-03, grad_scale: 32.0 2023-06-22 09:49:23,514 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=938838.0, ans=0.0 2023-06-22 09:49:46,986 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=938898.0, ans=0.2 2023-06-22 09:50:15,096 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.994e+02 2.398e+02 2.701e+02 3.226e+02 5.808e+02, threshold=5.402e+02, percent-clipped=0.0 2023-06-22 09:51:00,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=939078.0, ans=0.125 2023-06-22 09:51:10,098 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 09:51:27,241 INFO [train.py:996] (2/4) Epoch 6, batch 4050, loss[loss=0.2525, simple_loss=0.3109, pruned_loss=0.09706, over 21549.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2962, pruned_loss=0.07516, over 4277835.21 frames. ], batch size: 441, lr: 5.28e-03, grad_scale: 32.0 2023-06-22 09:51:39,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=939138.0, ans=0.2 2023-06-22 09:52:36,196 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=939258.0, ans=0.1 2023-06-22 09:53:14,445 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.48 vs. limit=15.0 2023-06-22 09:53:57,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=939438.0, ans=0.2 2023-06-22 09:53:58,554 INFO [train.py:996] (2/4) Epoch 6, batch 4100, loss[loss=0.2078, simple_loss=0.2945, pruned_loss=0.06054, over 21607.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2977, pruned_loss=0.07413, over 4275130.96 frames. ], batch size: 230, lr: 5.28e-03, grad_scale: 32.0 2023-06-22 09:55:09,873 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.954e+02 2.299e+02 2.709e+02 3.070e+02 5.765e+02, threshold=5.418e+02, percent-clipped=2.0 2023-06-22 09:56:10,772 INFO [train.py:996] (2/4) Epoch 6, batch 4150, loss[loss=0.1688, simple_loss=0.2675, pruned_loss=0.03505, over 21542.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2966, pruned_loss=0.07134, over 4280337.06 frames. ], batch size: 212, lr: 5.28e-03, grad_scale: 32.0 2023-06-22 09:56:28,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=939738.0, ans=0.125 2023-06-22 09:56:42,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=939798.0, ans=0.125 2023-06-22 09:56:53,427 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.93 vs. limit=6.0 2023-06-22 09:56:59,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=939858.0, ans=0.0 2023-06-22 09:58:24,480 INFO [train.py:996] (2/4) Epoch 6, batch 4200, loss[loss=0.1959, simple_loss=0.2743, pruned_loss=0.05876, over 21559.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2966, pruned_loss=0.07127, over 4275458.06 frames. ], batch size: 195, lr: 5.27e-03, grad_scale: 32.0 2023-06-22 09:58:55,274 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-06-22 09:59:01,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=940098.0, ans=0.2 2023-06-22 09:59:27,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=940158.0, ans=0.125 2023-06-22 09:59:32,811 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.860e+02 2.356e+02 2.693e+02 3.335e+02 6.713e+02, threshold=5.385e+02, percent-clipped=2.0 2023-06-22 10:00:52,417 INFO [train.py:996] (2/4) Epoch 6, batch 4250, loss[loss=0.2625, simple_loss=0.3386, pruned_loss=0.09318, over 21718.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3048, pruned_loss=0.0739, over 4266317.45 frames. ], batch size: 351, lr: 5.27e-03, grad_scale: 32.0 2023-06-22 10:01:00,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=940338.0, ans=0.125 2023-06-22 10:01:32,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=940398.0, ans=0.125 2023-06-22 10:03:03,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=940578.0, ans=0.0 2023-06-22 10:03:09,329 INFO [train.py:996] (2/4) Epoch 6, batch 4300, loss[loss=0.2525, simple_loss=0.3516, pruned_loss=0.07669, over 21637.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3118, pruned_loss=0.0761, over 4273642.90 frames. ], batch size: 414, lr: 5.27e-03, grad_scale: 32.0 2023-06-22 10:03:09,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=940638.0, ans=0.125 2023-06-22 10:03:57,480 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=940698.0, ans=0.1 2023-06-22 10:04:07,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=940758.0, ans=0.0 2023-06-22 10:04:07,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=940758.0, ans=0.0 2023-06-22 10:04:43,301 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.946e+02 3.391e+02 4.074e+02 6.738e+02, threshold=6.781e+02, percent-clipped=7.0 2023-06-22 10:05:01,042 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.87 vs. limit=15.0 2023-06-22 10:05:35,243 INFO [train.py:996] (2/4) Epoch 6, batch 4350, loss[loss=0.2188, simple_loss=0.2855, pruned_loss=0.07608, over 21885.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3092, pruned_loss=0.07514, over 4274436.60 frames. ], batch size: 107, lr: 5.27e-03, grad_scale: 16.0 2023-06-22 10:05:49,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=940938.0, ans=0.125 2023-06-22 10:06:07,575 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=940998.0, ans=0.0 2023-06-22 10:06:11,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=940998.0, ans=0.0 2023-06-22 10:06:51,646 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.47 vs. limit=6.0 2023-06-22 10:06:53,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=941118.0, ans=0.1 2023-06-22 10:07:05,727 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.47 vs. limit=12.0 2023-06-22 10:07:38,638 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.17 vs. limit=22.5 2023-06-22 10:07:52,272 INFO [train.py:996] (2/4) Epoch 6, batch 4400, loss[loss=0.2032, simple_loss=0.2717, pruned_loss=0.06739, over 21200.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3047, pruned_loss=0.07485, over 4275327.50 frames. ], batch size: 548, lr: 5.27e-03, grad_scale: 32.0 2023-06-22 10:07:53,311 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.60 vs. limit=6.0 2023-06-22 10:08:06,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=941238.0, ans=0.125 2023-06-22 10:09:18,637 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.562e+02 2.802e+02 3.458e+02 5.737e+02, threshold=5.605e+02, percent-clipped=0.0 2023-06-22 10:10:17,113 INFO [train.py:996] (2/4) Epoch 6, batch 4450, loss[loss=0.345, simple_loss=0.4322, pruned_loss=0.1289, over 21523.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3128, pruned_loss=0.07653, over 4271875.86 frames. ], batch size: 471, lr: 5.27e-03, grad_scale: 32.0 2023-06-22 10:11:44,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=941718.0, ans=0.0 2023-06-22 10:12:15,223 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.54 vs. limit=6.0 2023-06-22 10:12:37,933 INFO [train.py:996] (2/4) Epoch 6, batch 4500, loss[loss=0.2312, simple_loss=0.3188, pruned_loss=0.07178, over 21682.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.314, pruned_loss=0.07878, over 4281118.55 frames. ], batch size: 263, lr: 5.27e-03, grad_scale: 32.0 2023-06-22 10:13:10,497 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=941898.0, ans=0.0 2023-06-22 10:13:17,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=941898.0, ans=0.125 2023-06-22 10:13:48,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=941958.0, ans=0.025 2023-06-22 10:13:51,834 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.66 vs. limit=15.0 2023-06-22 10:13:55,555 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.947e+02 2.436e+02 2.759e+02 3.510e+02 5.897e+02, threshold=5.518e+02, percent-clipped=2.0 2023-06-22 10:14:16,605 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=942018.0, ans=0.125 2023-06-22 10:14:23,341 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.78 vs. limit=15.0 2023-06-22 10:15:15,313 INFO [train.py:996] (2/4) Epoch 6, batch 4550, loss[loss=0.3249, simple_loss=0.3758, pruned_loss=0.137, over 21321.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3173, pruned_loss=0.07932, over 4279542.64 frames. ], batch size: 507, lr: 5.27e-03, grad_scale: 32.0 2023-06-22 10:15:22,524 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.46 vs. limit=15.0 2023-06-22 10:15:41,706 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.34 vs. limit=6.0 2023-06-22 10:15:57,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=942258.0, ans=0.125 2023-06-22 10:16:54,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=942378.0, ans=0.0 2023-06-22 10:17:15,164 INFO [train.py:996] (2/4) Epoch 6, batch 4600, loss[loss=0.2441, simple_loss=0.3321, pruned_loss=0.07803, over 21611.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.32, pruned_loss=0.08112, over 4282358.45 frames. ], batch size: 414, lr: 5.27e-03, grad_scale: 32.0 2023-06-22 10:18:38,666 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.101e+02 2.628e+02 3.061e+02 3.498e+02 7.398e+02, threshold=6.122e+02, percent-clipped=1.0 2023-06-22 10:19:41,639 INFO [train.py:996] (2/4) Epoch 6, batch 4650, loss[loss=0.1727, simple_loss=0.2481, pruned_loss=0.04862, over 21753.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.314, pruned_loss=0.07981, over 4287682.12 frames. ], batch size: 298, lr: 5.27e-03, grad_scale: 16.0 2023-06-22 10:19:48,240 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=942738.0, ans=0.125 2023-06-22 10:20:23,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=942858.0, ans=0.125 2023-06-22 10:20:24,241 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=942858.0, ans=0.05 2023-06-22 10:20:24,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=942858.0, ans=0.95 2023-06-22 10:21:22,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=942978.0, ans=0.125 2023-06-22 10:21:32,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=942978.0, ans=0.0 2023-06-22 10:21:47,938 INFO [train.py:996] (2/4) Epoch 6, batch 4700, loss[loss=0.2124, simple_loss=0.2699, pruned_loss=0.07744, over 21265.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3037, pruned_loss=0.07678, over 4283115.26 frames. ], batch size: 144, lr: 5.27e-03, grad_scale: 16.0 2023-06-22 10:22:17,509 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=943038.0, ans=0.07 2023-06-22 10:22:40,940 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.85 vs. limit=15.0 2023-06-22 10:22:57,994 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.865e+02 2.395e+02 2.699e+02 3.102e+02 5.296e+02, threshold=5.398e+02, percent-clipped=0.0 2023-06-22 10:23:12,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=943218.0, ans=0.1 2023-06-22 10:23:44,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=943278.0, ans=0.125 2023-06-22 10:23:59,398 INFO [train.py:996] (2/4) Epoch 6, batch 4750, loss[loss=0.2343, simple_loss=0.2984, pruned_loss=0.08508, over 21859.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2977, pruned_loss=0.07605, over 4286411.85 frames. ], batch size: 351, lr: 5.27e-03, grad_scale: 16.0 2023-06-22 10:24:12,524 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=943338.0, ans=0.125 2023-06-22 10:25:32,890 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.30 vs. limit=15.0 2023-06-22 10:26:16,876 INFO [train.py:996] (2/4) Epoch 6, batch 4800, loss[loss=0.2395, simple_loss=0.3452, pruned_loss=0.0669, over 21682.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.2985, pruned_loss=0.07706, over 4293797.51 frames. ], batch size: 414, lr: 5.26e-03, grad_scale: 32.0 2023-06-22 10:27:30,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=943758.0, ans=0.125 2023-06-22 10:27:31,094 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.170e+02 2.729e+02 2.951e+02 3.440e+02 4.423e+02, threshold=5.901e+02, percent-clipped=0.0 2023-06-22 10:27:59,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=943818.0, ans=0.125 2023-06-22 10:28:18,131 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=943878.0, ans=0.0 2023-06-22 10:28:26,340 INFO [train.py:996] (2/4) Epoch 6, batch 4850, loss[loss=0.2597, simple_loss=0.3256, pruned_loss=0.09687, over 21667.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.2984, pruned_loss=0.07658, over 4294377.84 frames. ], batch size: 441, lr: 5.26e-03, grad_scale: 16.0 2023-06-22 10:28:28,805 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.73 vs. limit=22.5 2023-06-22 10:28:31,302 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=943938.0, ans=0.125 2023-06-22 10:28:50,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=943938.0, ans=10.0 2023-06-22 10:29:22,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=944058.0, ans=0.5 2023-06-22 10:30:38,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=944178.0, ans=0.0 2023-06-22 10:30:41,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=944178.0, ans=0.0 2023-06-22 10:30:48,617 INFO [train.py:996] (2/4) Epoch 6, batch 4900, loss[loss=0.2386, simple_loss=0.3133, pruned_loss=0.08192, over 21305.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.2989, pruned_loss=0.07709, over 4281495.84 frames. ], batch size: 159, lr: 5.26e-03, grad_scale: 16.0 2023-06-22 10:31:05,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=944238.0, ans=0.0 2023-06-22 10:31:43,819 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=944358.0, ans=0.125 2023-06-22 10:32:06,592 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=944358.0, ans=0.1 2023-06-22 10:32:13,618 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.012e+02 2.505e+02 2.707e+02 3.125e+02 4.814e+02, threshold=5.414e+02, percent-clipped=0.0 2023-06-22 10:32:49,118 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.13 vs. limit=22.5 2023-06-22 10:33:20,831 INFO [train.py:996] (2/4) Epoch 6, batch 4950, loss[loss=0.1956, simple_loss=0.2929, pruned_loss=0.04916, over 21611.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3038, pruned_loss=0.07598, over 4275634.90 frames. ], batch size: 389, lr: 5.26e-03, grad_scale: 16.0 2023-06-22 10:33:21,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=944538.0, ans=0.1 2023-06-22 10:33:24,359 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=944538.0, ans=0.125 2023-06-22 10:33:39,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=944538.0, ans=0.0 2023-06-22 10:33:48,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=944598.0, ans=0.09899494936611666 2023-06-22 10:34:06,610 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.50 vs. limit=15.0 2023-06-22 10:35:31,665 INFO [train.py:996] (2/4) Epoch 6, batch 5000, loss[loss=0.224, simple_loss=0.2982, pruned_loss=0.07488, over 21469.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3034, pruned_loss=0.07269, over 4270519.34 frames. ], batch size: 211, lr: 5.26e-03, grad_scale: 16.0 2023-06-22 10:36:12,297 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=944898.0, ans=0.2 2023-06-22 10:36:33,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=944958.0, ans=10.0 2023-06-22 10:36:52,262 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.780e+02 2.481e+02 2.862e+02 3.375e+02 4.928e+02, threshold=5.725e+02, percent-clipped=0.0 2023-06-22 10:37:29,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=945078.0, ans=0.2 2023-06-22 10:37:30,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=945078.0, ans=0.125 2023-06-22 10:37:47,853 INFO [train.py:996] (2/4) Epoch 6, batch 5050, loss[loss=0.2221, simple_loss=0.2921, pruned_loss=0.076, over 21559.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3031, pruned_loss=0.07467, over 4276475.53 frames. ], batch size: 212, lr: 5.26e-03, grad_scale: 16.0 2023-06-22 10:38:15,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=945198.0, ans=0.125 2023-06-22 10:38:32,528 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-22 10:38:49,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=945258.0, ans=0.2 2023-06-22 10:39:05,645 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.95 vs. limit=22.5 2023-06-22 10:39:31,045 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=15.0 2023-06-22 10:39:40,717 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=945378.0, ans=0.025 2023-06-22 10:40:04,976 INFO [train.py:996] (2/4) Epoch 6, batch 5100, loss[loss=0.2555, simple_loss=0.3628, pruned_loss=0.07405, over 19796.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.302, pruned_loss=0.07418, over 4278022.66 frames. ], batch size: 703, lr: 5.26e-03, grad_scale: 16.0 2023-06-22 10:40:36,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=945498.0, ans=0.125 2023-06-22 10:41:18,241 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.857e+02 2.681e+02 3.101e+02 3.777e+02 6.060e+02, threshold=6.201e+02, percent-clipped=2.0 2023-06-22 10:41:24,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=945618.0, ans=0.0 2023-06-22 10:41:33,250 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 10:41:46,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=945618.0, ans=0.125 2023-06-22 10:42:14,431 INFO [train.py:996] (2/4) Epoch 6, batch 5150, loss[loss=0.2308, simple_loss=0.3107, pruned_loss=0.07547, over 21838.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3005, pruned_loss=0.07521, over 4286780.05 frames. ], batch size: 332, lr: 5.26e-03, grad_scale: 16.0 2023-06-22 10:42:49,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=945798.0, ans=0.0 2023-06-22 10:43:07,792 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=945858.0, ans=0.125 2023-06-22 10:43:59,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=945918.0, ans=0.0 2023-06-22 10:44:08,041 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.69 vs. limit=6.0 2023-06-22 10:44:25,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=945978.0, ans=0.125 2023-06-22 10:44:28,100 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=945978.0, ans=0.07 2023-06-22 10:44:33,434 INFO [train.py:996] (2/4) Epoch 6, batch 5200, loss[loss=0.2243, simple_loss=0.3108, pruned_loss=0.06893, over 21410.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3048, pruned_loss=0.07704, over 4287085.41 frames. ], batch size: 194, lr: 5.26e-03, grad_scale: 32.0 2023-06-22 10:45:51,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=946158.0, ans=0.125 2023-06-22 10:45:52,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=946158.0, ans=0.125 2023-06-22 10:45:53,888 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=946158.0, ans=0.0 2023-06-22 10:45:55,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=946218.0, ans=0.0 2023-06-22 10:45:56,345 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.187e+02 2.577e+02 3.076e+02 3.772e+02 6.113e+02, threshold=6.153e+02, percent-clipped=0.0 2023-06-22 10:46:50,509 INFO [train.py:996] (2/4) Epoch 6, batch 5250, loss[loss=0.2439, simple_loss=0.3328, pruned_loss=0.07746, over 21740.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3084, pruned_loss=0.07565, over 4272470.02 frames. ], batch size: 414, lr: 5.26e-03, grad_scale: 32.0 2023-06-22 10:47:08,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=946338.0, ans=0.1 2023-06-22 10:47:25,715 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=946398.0, ans=0.125 2023-06-22 10:47:31,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=946398.0, ans=0.0 2023-06-22 10:47:41,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=946398.0, ans=0.125 2023-06-22 10:47:42,117 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-06-22 10:47:47,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=946398.0, ans=0.125 2023-06-22 10:48:01,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=946458.0, ans=0.0 2023-06-22 10:48:39,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=946518.0, ans=0.0 2023-06-22 10:49:19,193 INFO [train.py:996] (2/4) Epoch 6, batch 5300, loss[loss=0.2024, simple_loss=0.3062, pruned_loss=0.04931, over 19726.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.307, pruned_loss=0.0753, over 4266400.64 frames. ], batch size: 703, lr: 5.26e-03, grad_scale: 32.0 2023-06-22 10:50:30,807 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 2.544e+02 2.916e+02 3.415e+02 4.967e+02, threshold=5.832e+02, percent-clipped=0.0 2023-06-22 10:51:25,705 INFO [train.py:996] (2/4) Epoch 6, batch 5350, loss[loss=0.2191, simple_loss=0.2935, pruned_loss=0.07237, over 21552.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3044, pruned_loss=0.07608, over 4279172.22 frames. ], batch size: 131, lr: 5.26e-03, grad_scale: 32.0 2023-06-22 10:52:20,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=946998.0, ans=0.025 2023-06-22 10:52:28,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=946998.0, ans=0.125 2023-06-22 10:52:31,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=947058.0, ans=0.125 2023-06-22 10:52:53,419 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.07 vs. limit=15.0 2023-06-22 10:53:23,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=947178.0, ans=0.02 2023-06-22 10:53:29,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=947178.0, ans=0.0 2023-06-22 10:54:00,185 INFO [train.py:996] (2/4) Epoch 6, batch 5400, loss[loss=0.2319, simple_loss=0.3066, pruned_loss=0.07859, over 21734.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.303, pruned_loss=0.07693, over 4285895.57 frames. ], batch size: 441, lr: 5.25e-03, grad_scale: 16.0 2023-06-22 10:55:20,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=947418.0, ans=0.125 2023-06-22 10:55:21,794 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.828e+02 2.657e+02 2.997e+02 3.766e+02 7.720e+02, threshold=5.994e+02, percent-clipped=1.0 2023-06-22 10:55:26,588 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 10:55:30,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=947418.0, ans=0.125 2023-06-22 10:56:10,774 INFO [train.py:996] (2/4) Epoch 6, batch 5450, loss[loss=0.2128, simple_loss=0.2956, pruned_loss=0.06503, over 21177.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3027, pruned_loss=0.07534, over 4283614.62 frames. ], batch size: 143, lr: 5.25e-03, grad_scale: 16.0 2023-06-22 10:56:14,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=947538.0, ans=0.0 2023-06-22 10:56:31,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=947538.0, ans=0.125 2023-06-22 10:56:44,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=947598.0, ans=0.1 2023-06-22 10:57:29,840 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=947658.0, ans=0.0 2023-06-22 10:57:30,362 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.32 vs. limit=15.0 2023-06-22 10:57:39,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=947658.0, ans=0.0 2023-06-22 10:58:26,439 INFO [train.py:996] (2/4) Epoch 6, batch 5500, loss[loss=0.2086, simple_loss=0.3081, pruned_loss=0.0546, over 21706.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3078, pruned_loss=0.07308, over 4287807.81 frames. ], batch size: 298, lr: 5.25e-03, grad_scale: 16.0 2023-06-22 10:59:54,277 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.745e+02 2.313e+02 2.665e+02 3.124e+02 5.281e+02, threshold=5.330e+02, percent-clipped=0.0 2023-06-22 11:00:09,142 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=948018.0, ans=0.125 2023-06-22 11:00:47,705 INFO [train.py:996] (2/4) Epoch 6, batch 5550, loss[loss=0.2315, simple_loss=0.3287, pruned_loss=0.06712, over 21434.00 frames. ], tot_loss[loss=0.223, simple_loss=0.3059, pruned_loss=0.07005, over 4277136.42 frames. ], batch size: 471, lr: 5.25e-03, grad_scale: 16.0 2023-06-22 11:01:27,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=948198.0, ans=0.125 2023-06-22 11:02:51,124 INFO [train.py:996] (2/4) Epoch 6, batch 5600, loss[loss=0.3312, simple_loss=0.4156, pruned_loss=0.1234, over 21523.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.3046, pruned_loss=0.06746, over 4280969.24 frames. ], batch size: 471, lr: 5.25e-03, grad_scale: 32.0 2023-06-22 11:03:06,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=948438.0, ans=0.0 2023-06-22 11:03:06,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=948438.0, ans=0.125 2023-06-22 11:04:21,415 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.709e+02 2.304e+02 2.857e+02 3.382e+02 5.869e+02, threshold=5.713e+02, percent-clipped=3.0 2023-06-22 11:05:01,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=948678.0, ans=0.0 2023-06-22 11:05:02,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=948678.0, ans=0.035 2023-06-22 11:05:07,826 INFO [train.py:996] (2/4) Epoch 6, batch 5650, loss[loss=0.2163, simple_loss=0.2932, pruned_loss=0.06969, over 21872.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3085, pruned_loss=0.07034, over 4280444.41 frames. ], batch size: 298, lr: 5.25e-03, grad_scale: 32.0 2023-06-22 11:05:34,684 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 11:05:37,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=948798.0, ans=0.1 2023-06-22 11:06:47,409 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=948918.0, ans=0.2 2023-06-22 11:07:46,666 INFO [train.py:996] (2/4) Epoch 6, batch 5700, loss[loss=0.2497, simple_loss=0.3358, pruned_loss=0.08178, over 21562.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3101, pruned_loss=0.07239, over 4282998.36 frames. ], batch size: 441, lr: 5.25e-03, grad_scale: 32.0 2023-06-22 11:07:57,943 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.62 vs. limit=22.5 2023-06-22 11:08:15,466 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=949098.0, ans=0.125 2023-06-22 11:08:37,317 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.34 vs. limit=15.0 2023-06-22 11:08:50,350 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.11 vs. limit=15.0 2023-06-22 11:09:03,670 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.956e+02 2.632e+02 3.011e+02 3.527e+02 5.738e+02, threshold=6.022e+02, percent-clipped=1.0 2023-06-22 11:09:35,082 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.60 vs. limit=15.0 2023-06-22 11:10:03,786 INFO [train.py:996] (2/4) Epoch 6, batch 5750, loss[loss=0.1481, simple_loss=0.228, pruned_loss=0.03407, over 21359.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.3066, pruned_loss=0.07013, over 4280678.55 frames. ], batch size: 176, lr: 5.25e-03, grad_scale: 32.0 2023-06-22 11:10:28,461 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=949398.0, ans=0.0 2023-06-22 11:11:05,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=949458.0, ans=0.0 2023-06-22 11:11:52,840 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=949518.0, ans=0.04949747468305833 2023-06-22 11:12:21,461 INFO [train.py:996] (2/4) Epoch 6, batch 5800, loss[loss=0.242, simple_loss=0.3398, pruned_loss=0.07207, over 21815.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.3044, pruned_loss=0.06867, over 4274806.68 frames. ], batch size: 371, lr: 5.25e-03, grad_scale: 32.0 2023-06-22 11:14:01,139 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.618e+02 2.381e+02 2.868e+02 4.054e+02 6.693e+02, threshold=5.736e+02, percent-clipped=1.0 2023-06-22 11:14:58,430 INFO [train.py:996] (2/4) Epoch 6, batch 5850, loss[loss=0.1726, simple_loss=0.2689, pruned_loss=0.03814, over 21421.00 frames. ], tot_loss[loss=0.215, simple_loss=0.3012, pruned_loss=0.06446, over 4276204.55 frames. ], batch size: 211, lr: 5.25e-03, grad_scale: 32.0 2023-06-22 11:16:00,619 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.86 vs. limit=15.0 2023-06-22 11:16:32,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=950118.0, ans=0.1 2023-06-22 11:17:07,183 INFO [train.py:996] (2/4) Epoch 6, batch 5900, loss[loss=0.2038, simple_loss=0.2827, pruned_loss=0.06239, over 21892.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2939, pruned_loss=0.05978, over 4277207.52 frames. ], batch size: 371, lr: 5.25e-03, grad_scale: 32.0 2023-06-22 11:18:16,107 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=950358.0, ans=0.1 2023-06-22 11:18:39,854 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.379e+02 1.954e+02 2.379e+02 3.002e+02 5.426e+02, threshold=4.759e+02, percent-clipped=0.0 2023-06-22 11:19:04,191 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.32 vs. limit=6.0 2023-06-22 11:19:22,076 INFO [train.py:996] (2/4) Epoch 6, batch 5950, loss[loss=0.2071, simple_loss=0.273, pruned_loss=0.07056, over 21408.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2925, pruned_loss=0.06293, over 4278837.14 frames. ], batch size: 194, lr: 5.25e-03, grad_scale: 32.0 2023-06-22 11:21:26,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=950778.0, ans=0.1 2023-06-22 11:21:37,680 INFO [train.py:996] (2/4) Epoch 6, batch 6000, loss[loss=0.202, simple_loss=0.2629, pruned_loss=0.0705, over 21270.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2885, pruned_loss=0.06624, over 4265754.51 frames. ], batch size: 176, lr: 5.24e-03, grad_scale: 32.0 2023-06-22 11:21:37,681 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-22 11:22:41,438 INFO [train.py:1028] (2/4) Epoch 6, validation: loss=0.2615, simple_loss=0.3543, pruned_loss=0.08434, over 1796401.00 frames. 2023-06-22 11:22:41,439 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-22 11:22:53,777 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-06-22 11:23:19,147 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 11:23:25,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=950958.0, ans=0.125 2023-06-22 11:23:47,008 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.133e+02 2.609e+02 2.903e+02 3.362e+02 5.705e+02, threshold=5.807e+02, percent-clipped=2.0 2023-06-22 11:23:47,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=951018.0, ans=0.2 2023-06-22 11:24:15,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=951078.0, ans=0.0 2023-06-22 11:24:29,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=951138.0, ans=0.04949747468305833 2023-06-22 11:24:30,287 INFO [train.py:996] (2/4) Epoch 6, batch 6050, loss[loss=0.1973, simple_loss=0.2589, pruned_loss=0.06785, over 21616.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2841, pruned_loss=0.06702, over 4262869.82 frames. ], batch size: 298, lr: 5.24e-03, grad_scale: 32.0 2023-06-22 11:25:42,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=951258.0, ans=0.2 2023-06-22 11:25:53,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=951258.0, ans=0.0 2023-06-22 11:25:53,534 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=12.0 2023-06-22 11:26:01,186 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=951258.0, ans=0.0 2023-06-22 11:26:27,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=951378.0, ans=0.125 2023-06-22 11:26:33,610 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=951378.0, ans=0.125 2023-06-22 11:26:42,251 INFO [train.py:996] (2/4) Epoch 6, batch 6100, loss[loss=0.2258, simple_loss=0.3046, pruned_loss=0.07347, over 21712.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2825, pruned_loss=0.06573, over 4264412.34 frames. ], batch size: 389, lr: 5.24e-03, grad_scale: 32.0 2023-06-22 11:27:26,879 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=951498.0, ans=0.125 2023-06-22 11:28:19,202 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.729e+02 2.238e+02 2.454e+02 2.758e+02 3.934e+02, threshold=4.908e+02, percent-clipped=0.0 2023-06-22 11:28:21,460 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=951618.0, ans=0.2 2023-06-22 11:28:31,367 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=951678.0, ans=0.0 2023-06-22 11:28:59,898 INFO [train.py:996] (2/4) Epoch 6, batch 6150, loss[loss=0.2137, simple_loss=0.2906, pruned_loss=0.06838, over 21750.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.286, pruned_loss=0.06847, over 4276599.56 frames. ], batch size: 124, lr: 5.24e-03, grad_scale: 32.0 2023-06-22 11:30:17,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=951858.0, ans=0.0 2023-06-22 11:30:51,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=951978.0, ans=0.125 2023-06-22 11:31:17,606 INFO [train.py:996] (2/4) Epoch 6, batch 6200, loss[loss=0.2624, simple_loss=0.3422, pruned_loss=0.09133, over 21494.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2901, pruned_loss=0.06889, over 4277210.61 frames. ], batch size: 548, lr: 5.24e-03, grad_scale: 32.0 2023-06-22 11:31:29,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=952038.0, ans=0.0 2023-06-22 11:32:50,651 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.775e+02 2.387e+02 2.806e+02 3.164e+02 6.088e+02, threshold=5.612e+02, percent-clipped=2.0 2023-06-22 11:33:38,312 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.23 vs. limit=6.0 2023-06-22 11:33:45,688 INFO [train.py:996] (2/4) Epoch 6, batch 6250, loss[loss=0.2452, simple_loss=0.3492, pruned_loss=0.07064, over 21769.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2967, pruned_loss=0.06881, over 4282814.25 frames. ], batch size: 332, lr: 5.24e-03, grad_scale: 16.0 2023-06-22 11:34:31,715 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-06-22 11:34:49,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=952458.0, ans=0.125 2023-06-22 11:36:01,121 INFO [train.py:996] (2/4) Epoch 6, batch 6300, loss[loss=0.2372, simple_loss=0.3088, pruned_loss=0.08278, over 21932.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2989, pruned_loss=0.06799, over 4284841.95 frames. ], batch size: 113, lr: 5.24e-03, grad_scale: 16.0 2023-06-22 11:36:21,069 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.79 vs. limit=15.0 2023-06-22 11:36:23,891 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.56 vs. limit=10.0 2023-06-22 11:36:37,398 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=952698.0, ans=0.125 2023-06-22 11:37:15,103 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=952818.0, ans=0.1 2023-06-22 11:37:18,971 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.772e+02 2.436e+02 3.137e+02 3.711e+02 6.138e+02, threshold=6.275e+02, percent-clipped=3.0 2023-06-22 11:37:22,469 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=952818.0, ans=0.125 2023-06-22 11:37:31,360 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.03 vs. limit=15.0 2023-06-22 11:38:05,832 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.10 vs. limit=6.0 2023-06-22 11:38:09,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=952938.0, ans=0.125 2023-06-22 11:38:10,540 INFO [train.py:996] (2/4) Epoch 6, batch 6350, loss[loss=0.2487, simple_loss=0.3142, pruned_loss=0.09166, over 21622.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.3037, pruned_loss=0.07242, over 4291227.45 frames. ], batch size: 263, lr: 5.24e-03, grad_scale: 16.0 2023-06-22 11:38:37,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=952998.0, ans=0.2 2023-06-22 11:38:50,235 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=952998.0, ans=0.0 2023-06-22 11:39:00,791 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=953058.0, ans=0.0 2023-06-22 11:39:02,114 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=953058.0, ans=0.0 2023-06-22 11:39:35,131 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.77 vs. limit=6.0 2023-06-22 11:39:46,604 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=953118.0, ans=0.125 2023-06-22 11:40:05,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=953178.0, ans=0.125 2023-06-22 11:40:14,636 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=953178.0, ans=0.04949747468305833 2023-06-22 11:40:25,776 INFO [train.py:996] (2/4) Epoch 6, batch 6400, loss[loss=0.2421, simple_loss=0.3101, pruned_loss=0.08707, over 21361.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3092, pruned_loss=0.07684, over 4294084.91 frames. ], batch size: 548, lr: 5.24e-03, grad_scale: 32.0 2023-06-22 11:40:27,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=953238.0, ans=0.1 2023-06-22 11:41:03,636 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=953298.0, ans=0.125 2023-06-22 11:41:05,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=953298.0, ans=0.125 2023-06-22 11:41:09,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=953298.0, ans=0.125 2023-06-22 11:42:05,392 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.148e+02 2.715e+02 2.941e+02 3.415e+02 4.411e+02, threshold=5.882e+02, percent-clipped=0.0 2023-06-22 11:42:10,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=953418.0, ans=0.07 2023-06-22 11:42:49,214 INFO [train.py:996] (2/4) Epoch 6, batch 6450, loss[loss=0.1907, simple_loss=0.2609, pruned_loss=0.06028, over 21829.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3115, pruned_loss=0.07601, over 4295780.67 frames. ], batch size: 107, lr: 5.24e-03, grad_scale: 32.0 2023-06-22 11:43:23,010 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.99 vs. limit=15.0 2023-06-22 11:43:26,037 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-06-22 11:43:28,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=953598.0, ans=0.125 2023-06-22 11:44:02,179 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=953718.0, ans=0.125 2023-06-22 11:45:02,722 INFO [train.py:996] (2/4) Epoch 6, batch 6500, loss[loss=0.2644, simple_loss=0.371, pruned_loss=0.07892, over 19747.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3065, pruned_loss=0.07434, over 4281172.50 frames. ], batch size: 703, lr: 5.24e-03, grad_scale: 32.0 2023-06-22 11:46:20,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=954018.0, ans=0.1 2023-06-22 11:46:23,198 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.042e+02 2.489e+02 2.776e+02 3.304e+02 5.891e+02, threshold=5.553e+02, percent-clipped=1.0 2023-06-22 11:46:24,235 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.52 vs. limit=10.0 2023-06-22 11:46:47,361 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.58 vs. limit=15.0 2023-06-22 11:46:48,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=954018.0, ans=0.125 2023-06-22 11:47:14,633 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-06-22 11:47:16,597 INFO [train.py:996] (2/4) Epoch 6, batch 6550, loss[loss=0.2228, simple_loss=0.2983, pruned_loss=0.07364, over 21707.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3039, pruned_loss=0.07291, over 4267928.03 frames. ], batch size: 389, lr: 5.24e-03, grad_scale: 32.0 2023-06-22 11:47:18,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=954138.0, ans=0.125 2023-06-22 11:48:07,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=954258.0, ans=0.1 2023-06-22 11:48:46,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=954318.0, ans=0.2 2023-06-22 11:48:58,669 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.81 vs. limit=6.0 2023-06-22 11:49:16,288 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=954438.0, ans=0.0 2023-06-22 11:49:17,332 INFO [train.py:996] (2/4) Epoch 6, batch 6600, loss[loss=0.2113, simple_loss=0.2726, pruned_loss=0.07497, over 21563.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2977, pruned_loss=0.07214, over 4273299.91 frames. ], batch size: 391, lr: 5.23e-03, grad_scale: 16.0 2023-06-22 11:50:33,986 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.02 vs. limit=10.0 2023-06-22 11:50:36,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=954618.0, ans=0.125 2023-06-22 11:50:38,844 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.757e+02 2.307e+02 2.568e+02 2.890e+02 5.547e+02, threshold=5.135e+02, percent-clipped=0.0 2023-06-22 11:50:50,514 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=15.0 2023-06-22 11:51:25,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=954678.0, ans=0.125 2023-06-22 11:51:30,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=954738.0, ans=0.125 2023-06-22 11:51:31,384 INFO [train.py:996] (2/4) Epoch 6, batch 6650, loss[loss=0.1847, simple_loss=0.2546, pruned_loss=0.05741, over 21596.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2893, pruned_loss=0.06976, over 4273869.94 frames. ], batch size: 247, lr: 5.23e-03, grad_scale: 16.0 2023-06-22 11:51:40,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=954738.0, ans=0.125 2023-06-22 11:52:46,155 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=954858.0, ans=0.125 2023-06-22 11:53:19,951 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.85 vs. limit=22.5 2023-06-22 11:53:52,924 INFO [train.py:996] (2/4) Epoch 6, batch 6700, loss[loss=0.2072, simple_loss=0.2791, pruned_loss=0.06763, over 21551.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2833, pruned_loss=0.06975, over 4272105.07 frames. ], batch size: 230, lr: 5.23e-03, grad_scale: 16.0 2023-06-22 11:53:53,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=955038.0, ans=0.125 2023-06-22 11:53:53,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=955038.0, ans=0.0 2023-06-22 11:54:48,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=955158.0, ans=0.2 2023-06-22 11:54:53,684 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 11:54:58,479 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.52 vs. limit=10.0 2023-06-22 11:55:20,136 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.977e+02 2.337e+02 2.565e+02 3.029e+02 4.063e+02, threshold=5.130e+02, percent-clipped=0.0 2023-06-22 11:55:47,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=955278.0, ans=0.0 2023-06-22 11:56:01,268 INFO [train.py:996] (2/4) Epoch 6, batch 6750, loss[loss=0.213, simple_loss=0.2827, pruned_loss=0.07166, over 21782.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2818, pruned_loss=0.07017, over 4278402.00 frames. ], batch size: 112, lr: 5.23e-03, grad_scale: 16.0 2023-06-22 11:57:05,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=955458.0, ans=0.2 2023-06-22 11:57:24,924 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-06-22 11:58:11,988 INFO [train.py:996] (2/4) Epoch 6, batch 6800, loss[loss=0.2361, simple_loss=0.2955, pruned_loss=0.0883, over 21743.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2861, pruned_loss=0.0738, over 4275996.11 frames. ], batch size: 112, lr: 5.23e-03, grad_scale: 32.0 2023-06-22 11:58:38,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=955638.0, ans=0.125 2023-06-22 11:59:34,862 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 11:59:41,992 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=955818.0, ans=0.125 2023-06-22 11:59:49,861 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.043e+02 2.516e+02 2.922e+02 3.493e+02 5.598e+02, threshold=5.845e+02, percent-clipped=3.0 2023-06-22 12:00:24,112 INFO [train.py:996] (2/4) Epoch 6, batch 6850, loss[loss=0.2185, simple_loss=0.2857, pruned_loss=0.07561, over 21895.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2849, pruned_loss=0.07436, over 4278071.01 frames. ], batch size: 316, lr: 5.23e-03, grad_scale: 16.0 2023-06-22 12:00:37,691 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=955938.0, ans=0.0 2023-06-22 12:01:08,578 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.77 vs. limit=22.5 2023-06-22 12:02:41,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=956178.0, ans=0.07 2023-06-22 12:02:52,553 INFO [train.py:996] (2/4) Epoch 6, batch 6900, loss[loss=0.1891, simple_loss=0.2719, pruned_loss=0.05309, over 21234.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2859, pruned_loss=0.07417, over 4279977.20 frames. ], batch size: 159, lr: 5.23e-03, grad_scale: 16.0 2023-06-22 12:04:50,162 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.655e+02 2.414e+02 2.854e+02 3.721e+02 5.667e+02, threshold=5.709e+02, percent-clipped=0.0 2023-06-22 12:05:19,760 INFO [train.py:996] (2/4) Epoch 6, batch 6950, loss[loss=0.2036, simple_loss=0.2682, pruned_loss=0.06945, over 21272.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2893, pruned_loss=0.07085, over 4282046.08 frames. ], batch size: 143, lr: 5.23e-03, grad_scale: 16.0 2023-06-22 12:05:58,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=956598.0, ans=0.125 2023-06-22 12:06:58,779 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=956718.0, ans=0.0 2023-06-22 12:07:18,200 INFO [train.py:996] (2/4) Epoch 6, batch 7000, loss[loss=0.2154, simple_loss=0.29, pruned_loss=0.07039, over 15400.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2931, pruned_loss=0.07365, over 4264843.30 frames. ], batch size: 61, lr: 5.23e-03, grad_scale: 16.0 2023-06-22 12:07:44,302 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=956838.0, ans=0.1 2023-06-22 12:08:41,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=957018.0, ans=0.035 2023-06-22 12:08:50,838 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.97 vs. limit=10.0 2023-06-22 12:08:59,604 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.746e+02 2.490e+02 2.766e+02 3.246e+02 6.090e+02, threshold=5.532e+02, percent-clipped=1.0 2023-06-22 12:09:37,434 INFO [train.py:996] (2/4) Epoch 6, batch 7050, loss[loss=0.1912, simple_loss=0.2844, pruned_loss=0.04896, over 21833.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2909, pruned_loss=0.07258, over 4263443.51 frames. ], batch size: 371, lr: 5.23e-03, grad_scale: 16.0 2023-06-22 12:11:44,053 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=957378.0, ans=0.2 2023-06-22 12:11:59,671 INFO [train.py:996] (2/4) Epoch 6, batch 7100, loss[loss=0.2874, simple_loss=0.3487, pruned_loss=0.1131, over 21457.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.2963, pruned_loss=0.07564, over 4258140.06 frames. ], batch size: 509, lr: 5.23e-03, grad_scale: 16.0 2023-06-22 12:12:07,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=957438.0, ans=0.125 2023-06-22 12:12:28,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=957498.0, ans=0.125 2023-06-22 12:12:30,186 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=957498.0, ans=0.0 2023-06-22 12:13:00,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=957558.0, ans=0.0 2023-06-22 12:13:01,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=957558.0, ans=0.125 2023-06-22 12:13:26,828 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.716e+02 2.296e+02 2.660e+02 3.134e+02 4.737e+02, threshold=5.321e+02, percent-clipped=0.0 2023-06-22 12:14:13,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=957738.0, ans=0.0 2023-06-22 12:14:15,110 INFO [train.py:996] (2/4) Epoch 6, batch 7150, loss[loss=0.2314, simple_loss=0.3252, pruned_loss=0.06887, over 16663.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.292, pruned_loss=0.07226, over 4249969.12 frames. ], batch size: 60, lr: 5.23e-03, grad_scale: 16.0 2023-06-22 12:14:37,058 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.09 vs. limit=15.0 2023-06-22 12:15:11,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=957858.0, ans=0.125 2023-06-22 12:15:26,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=957858.0, ans=0.1 2023-06-22 12:15:32,171 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=957918.0, ans=0.0 2023-06-22 12:16:21,673 INFO [train.py:996] (2/4) Epoch 6, batch 7200, loss[loss=0.2255, simple_loss=0.2892, pruned_loss=0.08085, over 21416.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2956, pruned_loss=0.07506, over 4261993.31 frames. ], batch size: 131, lr: 5.23e-03, grad_scale: 32.0 2023-06-22 12:16:30,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=958038.0, ans=0.125 2023-06-22 12:16:49,539 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=958038.0, ans=0.125 2023-06-22 12:17:25,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=958098.0, ans=0.125 2023-06-22 12:17:51,029 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.011e+02 2.534e+02 2.859e+02 3.479e+02 6.830e+02, threshold=5.718e+02, percent-clipped=3.0 2023-06-22 12:18:29,939 INFO [train.py:996] (2/4) Epoch 6, batch 7250, loss[loss=0.2241, simple_loss=0.2952, pruned_loss=0.0765, over 21798.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2913, pruned_loss=0.07464, over 4267092.42 frames. ], batch size: 107, lr: 5.22e-03, grad_scale: 32.0 2023-06-22 12:18:49,103 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=958398.0, ans=0.125 2023-06-22 12:19:13,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=958398.0, ans=0.07 2023-06-22 12:19:17,241 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=958398.0, ans=0.125 2023-06-22 12:20:00,637 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=958518.0, ans=0.125 2023-06-22 12:20:33,183 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.06 vs. limit=22.5 2023-06-22 12:20:38,077 INFO [train.py:996] (2/4) Epoch 6, batch 7300, loss[loss=0.189, simple_loss=0.256, pruned_loss=0.061, over 21575.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2855, pruned_loss=0.07348, over 4269992.64 frames. ], batch size: 247, lr: 5.22e-03, grad_scale: 32.0 2023-06-22 12:22:08,049 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.015e+02 2.524e+02 2.909e+02 3.495e+02 5.392e+02, threshold=5.818e+02, percent-clipped=0.0 2023-06-22 12:22:28,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=958878.0, ans=0.09899494936611666 2023-06-22 12:22:45,835 INFO [train.py:996] (2/4) Epoch 6, batch 7350, loss[loss=0.2738, simple_loss=0.3513, pruned_loss=0.09815, over 21860.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2843, pruned_loss=0.07341, over 4261629.26 frames. ], batch size: 124, lr: 5.22e-03, grad_scale: 16.0 2023-06-22 12:23:25,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=958998.0, ans=0.125 2023-06-22 12:23:38,812 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.08 vs. limit=6.0 2023-06-22 12:23:38,845 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.23 vs. limit=22.5 2023-06-22 12:24:11,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=959118.0, ans=0.0 2023-06-22 12:24:52,105 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=959178.0, ans=0.0 2023-06-22 12:25:08,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=959178.0, ans=0.025 2023-06-22 12:25:17,252 INFO [train.py:996] (2/4) Epoch 6, batch 7400, loss[loss=0.2168, simple_loss=0.3151, pruned_loss=0.05927, over 21827.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2899, pruned_loss=0.07561, over 4269952.93 frames. ], batch size: 372, lr: 5.22e-03, grad_scale: 16.0 2023-06-22 12:25:38,757 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.19 vs. limit=15.0 2023-06-22 12:26:17,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=959358.0, ans=0.0 2023-06-22 12:26:47,313 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.080e+02 2.566e+02 3.026e+02 3.551e+02 6.030e+02, threshold=6.052e+02, percent-clipped=1.0 2023-06-22 12:26:52,845 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-22 12:27:22,762 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.64 vs. limit=15.0 2023-06-22 12:27:30,397 INFO [train.py:996] (2/4) Epoch 6, batch 7450, loss[loss=0.2147, simple_loss=0.2801, pruned_loss=0.07467, over 21624.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2887, pruned_loss=0.07468, over 4270921.84 frames. ], batch size: 298, lr: 5.22e-03, grad_scale: 16.0 2023-06-22 12:27:36,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=959538.0, ans=0.125 2023-06-22 12:27:57,715 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.01 vs. limit=22.5 2023-06-22 12:28:01,870 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 12:28:03,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=959598.0, ans=0.125 2023-06-22 12:28:31,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=959658.0, ans=0.0 2023-06-22 12:28:33,564 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.21 vs. limit=22.5 2023-06-22 12:28:35,932 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 12:28:42,026 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=959718.0, ans=0.0 2023-06-22 12:28:43,374 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=959718.0, ans=0.125 2023-06-22 12:28:53,799 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=959718.0, ans=0.125 2023-06-22 12:28:54,328 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.94 vs. limit=15.0 2023-06-22 12:29:03,459 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=959718.0, ans=0.125 2023-06-22 12:29:04,899 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=959718.0, ans=0.0 2023-06-22 12:29:34,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=959778.0, ans=0.125 2023-06-22 12:29:38,269 INFO [train.py:996] (2/4) Epoch 6, batch 7500, loss[loss=0.2424, simple_loss=0.3368, pruned_loss=0.07396, over 21443.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2933, pruned_loss=0.07647, over 4270623.51 frames. ], batch size: 211, lr: 5.22e-03, grad_scale: 16.0 2023-06-22 12:30:33,195 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.39 vs. limit=22.5 2023-06-22 12:30:48,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=959958.0, ans=0.125 2023-06-22 12:31:28,375 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.968e+02 2.801e+02 3.373e+02 4.203e+02 7.469e+02, threshold=6.746e+02, percent-clipped=3.0 2023-06-22 12:31:50,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=960078.0, ans=0.125 2023-06-22 12:32:01,775 INFO [train.py:996] (2/4) Epoch 6, batch 7550, loss[loss=0.1575, simple_loss=0.2326, pruned_loss=0.04123, over 15842.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.2986, pruned_loss=0.0749, over 4266324.97 frames. ], batch size: 62, lr: 5.22e-03, grad_scale: 16.0 2023-06-22 12:32:36,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=960198.0, ans=0.125 2023-06-22 12:33:57,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=960378.0, ans=0.125 2023-06-22 12:34:12,401 INFO [train.py:996] (2/4) Epoch 6, batch 7600, loss[loss=0.221, simple_loss=0.2853, pruned_loss=0.07833, over 21584.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2992, pruned_loss=0.07444, over 4271234.11 frames. ], batch size: 548, lr: 5.22e-03, grad_scale: 32.0 2023-06-22 12:34:49,520 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.58 vs. limit=15.0 2023-06-22 12:35:00,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=960498.0, ans=0.1 2023-06-22 12:35:07,185 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.51 vs. limit=15.0 2023-06-22 12:35:50,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=960618.0, ans=0.0 2023-06-22 12:35:59,842 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.966e+02 2.444e+02 2.746e+02 3.359e+02 4.824e+02, threshold=5.491e+02, percent-clipped=0.0 2023-06-22 12:36:01,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=960618.0, ans=0.0 2023-06-22 12:36:39,356 INFO [train.py:996] (2/4) Epoch 6, batch 7650, loss[loss=0.247, simple_loss=0.3412, pruned_loss=0.07641, over 20087.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.2979, pruned_loss=0.07527, over 4276350.81 frames. ], batch size: 703, lr: 5.22e-03, grad_scale: 32.0 2023-06-22 12:36:53,568 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.15 vs. limit=15.0 2023-06-22 12:36:57,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=960798.0, ans=0.125 2023-06-22 12:37:07,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=960798.0, ans=0.0 2023-06-22 12:37:27,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=960858.0, ans=0.1 2023-06-22 12:38:15,951 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 12:38:24,848 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=960978.0, ans=0.0 2023-06-22 12:38:24,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=960978.0, ans=0.0 2023-06-22 12:38:35,704 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.17 vs. limit=22.5 2023-06-22 12:38:40,601 INFO [train.py:996] (2/4) Epoch 6, batch 7700, loss[loss=0.2699, simple_loss=0.3393, pruned_loss=0.1003, over 21622.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.302, pruned_loss=0.07864, over 4279192.25 frames. ], batch size: 389, lr: 5.22e-03, grad_scale: 32.0 2023-06-22 12:39:20,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=961098.0, ans=0.0 2023-06-22 12:39:42,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=961158.0, ans=0.2 2023-06-22 12:39:47,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=961158.0, ans=0.2 2023-06-22 12:40:00,989 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=961158.0, ans=0.1 2023-06-22 12:40:09,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=961218.0, ans=0.125 2023-06-22 12:40:18,780 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.909e+02 2.597e+02 2.996e+02 3.499e+02 4.592e+02, threshold=5.993e+02, percent-clipped=0.0 2023-06-22 12:40:28,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=961218.0, ans=0.2 2023-06-22 12:41:02,099 INFO [train.py:996] (2/4) Epoch 6, batch 7750, loss[loss=0.2552, simple_loss=0.3545, pruned_loss=0.07797, over 21762.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3089, pruned_loss=0.07903, over 4277776.60 frames. ], batch size: 282, lr: 5.22e-03, grad_scale: 32.0 2023-06-22 12:43:18,985 INFO [train.py:996] (2/4) Epoch 6, batch 7800, loss[loss=0.2315, simple_loss=0.306, pruned_loss=0.07849, over 21761.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3135, pruned_loss=0.08086, over 4275136.43 frames. ], batch size: 282, lr: 5.22e-03, grad_scale: 32.0 2023-06-22 12:43:54,731 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=961698.0, ans=0.1 2023-06-22 12:43:56,915 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.44 vs. limit=6.0 2023-06-22 12:44:21,905 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.76 vs. limit=15.0 2023-06-22 12:44:25,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=961758.0, ans=0.0 2023-06-22 12:44:25,424 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 12:44:40,859 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 2.813e+02 3.314e+02 4.119e+02 8.453e+02, threshold=6.627e+02, percent-clipped=5.0 2023-06-22 12:44:53,304 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.90 vs. limit=22.5 2023-06-22 12:45:23,678 INFO [train.py:996] (2/4) Epoch 6, batch 7850, loss[loss=0.2063, simple_loss=0.2696, pruned_loss=0.0715, over 21616.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3045, pruned_loss=0.07947, over 4263205.78 frames. ], batch size: 298, lr: 5.21e-03, grad_scale: 32.0 2023-06-22 12:46:07,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=962058.0, ans=0.1 2023-06-22 12:46:28,034 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.12 vs. limit=15.0 2023-06-22 12:46:35,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=962118.0, ans=0.125 2023-06-22 12:46:57,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=962118.0, ans=0.1 2023-06-22 12:47:34,152 INFO [train.py:996] (2/4) Epoch 6, batch 7900, loss[loss=0.1966, simple_loss=0.2619, pruned_loss=0.0657, over 21143.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.2992, pruned_loss=0.07802, over 4254980.05 frames. ], batch size: 143, lr: 5.21e-03, grad_scale: 32.0 2023-06-22 12:48:00,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=962238.0, ans=0.125 2023-06-22 12:49:17,364 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 2.834e+02 3.343e+02 3.831e+02 7.219e+02, threshold=6.686e+02, percent-clipped=3.0 2023-06-22 12:49:52,101 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=962478.0, ans=0.125 2023-06-22 12:50:10,557 INFO [train.py:996] (2/4) Epoch 6, batch 7950, loss[loss=0.2651, simple_loss=0.3427, pruned_loss=0.09377, over 21750.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3028, pruned_loss=0.07716, over 4256890.40 frames. ], batch size: 441, lr: 5.21e-03, grad_scale: 16.0 2023-06-22 12:50:17,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=962538.0, ans=0.125 2023-06-22 12:50:33,566 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=962598.0, ans=0.035 2023-06-22 12:51:53,371 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.51 vs. limit=15.0 2023-06-22 12:52:48,111 INFO [train.py:996] (2/4) Epoch 6, batch 8000, loss[loss=0.2472, simple_loss=0.322, pruned_loss=0.08621, over 21445.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3073, pruned_loss=0.07914, over 4250965.10 frames. ], batch size: 194, lr: 5.21e-03, grad_scale: 32.0 2023-06-22 12:52:50,474 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.41 vs. limit=15.0 2023-06-22 12:52:58,666 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.56 vs. limit=15.0 2023-06-22 12:53:05,866 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=962838.0, ans=0.1 2023-06-22 12:53:12,387 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=962898.0, ans=0.0 2023-06-22 12:53:25,519 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.29 vs. limit=15.0 2023-06-22 12:54:34,223 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.065e+02 2.700e+02 3.556e+02 4.480e+02 7.069e+02, threshold=7.112e+02, percent-clipped=3.0 2023-06-22 12:54:35,346 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.16 vs. limit=22.5 2023-06-22 12:55:13,308 INFO [train.py:996] (2/4) Epoch 6, batch 8050, loss[loss=0.2699, simple_loss=0.3572, pruned_loss=0.09128, over 21692.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3134, pruned_loss=0.07992, over 4258715.89 frames. ], batch size: 389, lr: 5.21e-03, grad_scale: 32.0 2023-06-22 12:55:28,424 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=963198.0, ans=0.125 2023-06-22 12:55:29,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=963198.0, ans=0.125 2023-06-22 12:56:08,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=963258.0, ans=0.1 2023-06-22 12:56:24,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=963258.0, ans=0.0 2023-06-22 12:56:54,488 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=963318.0, ans=0.125 2023-06-22 12:57:34,232 INFO [train.py:996] (2/4) Epoch 6, batch 8100, loss[loss=0.2438, simple_loss=0.3097, pruned_loss=0.08898, over 21538.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3107, pruned_loss=0.08039, over 4264046.73 frames. ], batch size: 548, lr: 5.21e-03, grad_scale: 32.0 2023-06-22 12:58:46,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=963558.0, ans=0.1 2023-06-22 12:59:45,220 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.277e+02 2.718e+02 3.272e+02 3.962e+02 1.016e+03, threshold=6.543e+02, percent-clipped=3.0 2023-06-22 12:59:54,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=963678.0, ans=0.0 2023-06-22 13:00:20,322 INFO [train.py:996] (2/4) Epoch 6, batch 8150, loss[loss=0.2199, simple_loss=0.3113, pruned_loss=0.06426, over 21597.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3188, pruned_loss=0.08335, over 4258187.03 frames. ], batch size: 263, lr: 5.21e-03, grad_scale: 16.0 2023-06-22 13:02:16,178 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=963978.0, ans=0.0 2023-06-22 13:02:25,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=963978.0, ans=0.1 2023-06-22 13:02:29,435 INFO [train.py:996] (2/4) Epoch 6, batch 8200, loss[loss=0.2111, simple_loss=0.2699, pruned_loss=0.07618, over 21135.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3098, pruned_loss=0.07954, over 4260605.17 frames. ], batch size: 159, lr: 5.21e-03, grad_scale: 16.0 2023-06-22 13:03:03,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=964098.0, ans=0.1 2023-06-22 13:03:11,514 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=964098.0, ans=0.125 2023-06-22 13:03:26,681 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.84 vs. limit=15.0 2023-06-22 13:03:32,792 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.87 vs. limit=15.0 2023-06-22 13:04:16,274 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.939e+02 2.425e+02 2.871e+02 3.494e+02 6.098e+02, threshold=5.742e+02, percent-clipped=0.0 2023-06-22 13:04:59,540 INFO [train.py:996] (2/4) Epoch 6, batch 8250, loss[loss=0.1989, simple_loss=0.2971, pruned_loss=0.05038, over 19924.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3088, pruned_loss=0.07966, over 4268444.63 frames. ], batch size: 703, lr: 5.21e-03, grad_scale: 16.0 2023-06-22 13:05:01,191 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=964338.0, ans=0.125 2023-06-22 13:05:10,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=964338.0, ans=0.07 2023-06-22 13:05:19,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=964398.0, ans=0.2 2023-06-22 13:05:24,773 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.38 vs. limit=6.0 2023-06-22 13:06:50,063 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.83 vs. limit=15.0 2023-06-22 13:07:14,484 INFO [train.py:996] (2/4) Epoch 6, batch 8300, loss[loss=0.2519, simple_loss=0.3339, pruned_loss=0.08497, over 21639.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3073, pruned_loss=0.07664, over 4265998.47 frames. ], batch size: 389, lr: 5.21e-03, grad_scale: 16.0 2023-06-22 13:07:51,315 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=964698.0, ans=0.0 2023-06-22 13:08:39,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=964818.0, ans=0.2 2023-06-22 13:08:39,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=964818.0, ans=0.025 2023-06-22 13:08:40,411 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.855e+02 2.450e+02 2.816e+02 3.478e+02 6.310e+02, threshold=5.632e+02, percent-clipped=2.0 2023-06-22 13:08:52,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=964878.0, ans=0.125 2023-06-22 13:09:32,578 INFO [train.py:996] (2/4) Epoch 6, batch 8350, loss[loss=0.2071, simple_loss=0.2881, pruned_loss=0.06305, over 21540.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3066, pruned_loss=0.07498, over 4272805.94 frames. ], batch size: 212, lr: 5.21e-03, grad_scale: 16.0 2023-06-22 13:10:07,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=964998.0, ans=0.0 2023-06-22 13:11:37,441 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=965178.0, ans=0.05 2023-06-22 13:11:45,371 INFO [train.py:996] (2/4) Epoch 6, batch 8400, loss[loss=0.174, simple_loss=0.255, pruned_loss=0.04647, over 21199.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3047, pruned_loss=0.07265, over 4273076.23 frames. ], batch size: 143, lr: 5.21e-03, grad_scale: 32.0 2023-06-22 13:13:14,775 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.764e+02 2.326e+02 2.586e+02 3.002e+02 4.637e+02, threshold=5.171e+02, percent-clipped=0.0 2023-06-22 13:13:50,676 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.96 vs. limit=22.5 2023-06-22 13:13:50,942 INFO [train.py:996] (2/4) Epoch 6, batch 8450, loss[loss=0.231, simple_loss=0.3007, pruned_loss=0.08064, over 21866.00 frames. ], tot_loss[loss=0.223, simple_loss=0.3021, pruned_loss=0.072, over 4277288.60 frames. ], batch size: 118, lr: 5.20e-03, grad_scale: 32.0 2023-06-22 13:13:55,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=965538.0, ans=0.125 2023-06-22 13:14:23,356 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.79 vs. limit=15.0 2023-06-22 13:14:43,553 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=965658.0, ans=0.1 2023-06-22 13:16:01,016 INFO [train.py:996] (2/4) Epoch 6, batch 8500, loss[loss=0.1985, simple_loss=0.253, pruned_loss=0.07202, over 21265.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2975, pruned_loss=0.07294, over 4280791.40 frames. ], batch size: 548, lr: 5.20e-03, grad_scale: 32.0 2023-06-22 13:16:29,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=965838.0, ans=0.2 2023-06-22 13:16:42,055 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=965898.0, ans=0.0 2023-06-22 13:17:12,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=965958.0, ans=0.0 2023-06-22 13:17:48,959 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.924e+02 2.784e+02 3.161e+02 3.760e+02 5.772e+02, threshold=6.322e+02, percent-clipped=2.0 2023-06-22 13:17:52,386 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 13:18:27,235 INFO [train.py:996] (2/4) Epoch 6, batch 8550, loss[loss=0.2625, simple_loss=0.3527, pruned_loss=0.08614, over 21766.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3011, pruned_loss=0.07543, over 4273045.98 frames. ], batch size: 351, lr: 5.20e-03, grad_scale: 32.0 2023-06-22 13:18:27,636 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=966138.0, ans=0.0 2023-06-22 13:18:27,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=966138.0, ans=0.0 2023-06-22 13:18:28,979 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=966138.0, ans=0.125 2023-06-22 13:18:42,460 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.84 vs. limit=10.0 2023-06-22 13:19:37,485 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-22 13:19:38,971 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.58 vs. limit=15.0 2023-06-22 13:20:31,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=966378.0, ans=0.0 2023-06-22 13:20:51,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=966378.0, ans=0.1 2023-06-22 13:21:01,120 INFO [train.py:996] (2/4) Epoch 6, batch 8600, loss[loss=0.2397, simple_loss=0.3184, pruned_loss=0.08048, over 21601.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3092, pruned_loss=0.07762, over 4273102.95 frames. ], batch size: 263, lr: 5.20e-03, grad_scale: 32.0 2023-06-22 13:21:07,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=966438.0, ans=0.125 2023-06-22 13:21:14,915 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-22 13:22:45,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=966618.0, ans=0.025 2023-06-22 13:22:56,129 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.149e+02 2.812e+02 3.242e+02 4.124e+02 6.124e+02, threshold=6.484e+02, percent-clipped=0.0 2023-06-22 13:23:27,231 INFO [train.py:996] (2/4) Epoch 6, batch 8650, loss[loss=0.2534, simple_loss=0.3447, pruned_loss=0.08106, over 21537.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3158, pruned_loss=0.07783, over 4278993.85 frames. ], batch size: 471, lr: 5.20e-03, grad_scale: 32.0 2023-06-22 13:23:50,797 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=966798.0, ans=0.1 2023-06-22 13:23:57,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=966798.0, ans=0.0 2023-06-22 13:24:25,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=966858.0, ans=0.0 2023-06-22 13:25:24,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=967038.0, ans=0.0 2023-06-22 13:25:25,312 INFO [train.py:996] (2/4) Epoch 6, batch 8700, loss[loss=0.1977, simple_loss=0.2638, pruned_loss=0.06582, over 21647.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3072, pruned_loss=0.07454, over 4277923.65 frames. ], batch size: 282, lr: 5.20e-03, grad_scale: 32.0 2023-06-22 13:26:02,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=967098.0, ans=0.0 2023-06-22 13:26:20,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=967158.0, ans=0.1 2023-06-22 13:26:56,673 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.654e+02 2.253e+02 2.607e+02 2.953e+02 4.706e+02, threshold=5.214e+02, percent-clipped=0.0 2023-06-22 13:27:05,898 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=967278.0, ans=0.125 2023-06-22 13:27:26,450 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=967338.0, ans=0.2 2023-06-22 13:27:27,310 INFO [train.py:996] (2/4) Epoch 6, batch 8750, loss[loss=0.2524, simple_loss=0.3597, pruned_loss=0.07252, over 20829.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.304, pruned_loss=0.07545, over 4279757.19 frames. ], batch size: 608, lr: 5.20e-03, grad_scale: 32.0 2023-06-22 13:27:59,914 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.39 vs. limit=15.0 2023-06-22 13:28:04,276 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-22 13:28:06,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=967398.0, ans=0.0 2023-06-22 13:29:43,838 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=967578.0, ans=0.1 2023-06-22 13:29:46,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=967578.0, ans=0.09899494936611666 2023-06-22 13:29:48,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=967578.0, ans=0.1 2023-06-22 13:29:51,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=967578.0, ans=0.2 2023-06-22 13:29:59,530 INFO [train.py:996] (2/4) Epoch 6, batch 8800, loss[loss=0.2747, simple_loss=0.3697, pruned_loss=0.08985, over 19945.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3117, pruned_loss=0.07792, over 4285215.40 frames. ], batch size: 702, lr: 5.20e-03, grad_scale: 32.0 2023-06-22 13:30:06,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=967638.0, ans=0.125 2023-06-22 13:30:32,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=967698.0, ans=0.125 2023-06-22 13:31:17,517 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=967758.0, ans=0.125 2023-06-22 13:31:46,614 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.114e+02 2.702e+02 3.094e+02 3.585e+02 5.689e+02, threshold=6.187e+02, percent-clipped=2.0 2023-06-22 13:32:17,676 INFO [train.py:996] (2/4) Epoch 6, batch 8850, loss[loss=0.2411, simple_loss=0.3358, pruned_loss=0.07324, over 21630.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.318, pruned_loss=0.07941, over 4286350.14 frames. ], batch size: 389, lr: 5.20e-03, grad_scale: 16.0 2023-06-22 13:32:47,854 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=967998.0, ans=0.125 2023-06-22 13:32:52,897 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.43 vs. limit=15.0 2023-06-22 13:34:32,307 INFO [train.py:996] (2/4) Epoch 6, batch 8900, loss[loss=0.2163, simple_loss=0.2766, pruned_loss=0.07797, over 21308.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3126, pruned_loss=0.07864, over 4288922.86 frames. ], batch size: 177, lr: 5.20e-03, grad_scale: 16.0 2023-06-22 13:34:37,854 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=968238.0, ans=0.125 2023-06-22 13:36:12,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=968418.0, ans=0.0 2023-06-22 13:36:25,711 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.729e+02 2.633e+02 3.165e+02 3.746e+02 7.673e+02, threshold=6.331e+02, percent-clipped=6.0 2023-06-22 13:36:33,499 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=968478.0, ans=0.0 2023-06-22 13:36:37,058 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.28 vs. limit=22.5 2023-06-22 13:36:50,922 INFO [train.py:996] (2/4) Epoch 6, batch 8950, loss[loss=0.2104, simple_loss=0.2731, pruned_loss=0.07387, over 21395.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3101, pruned_loss=0.07835, over 4277306.14 frames. ], batch size: 194, lr: 5.20e-03, grad_scale: 16.0 2023-06-22 13:36:51,455 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=968538.0, ans=0.0 2023-06-22 13:36:53,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=968538.0, ans=0.125 2023-06-22 13:38:47,297 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.34 vs. limit=15.0 2023-06-22 13:38:49,269 INFO [train.py:996] (2/4) Epoch 6, batch 9000, loss[loss=0.1889, simple_loss=0.251, pruned_loss=0.06344, over 21620.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3042, pruned_loss=0.07801, over 4274833.65 frames. ], batch size: 282, lr: 5.20e-03, grad_scale: 16.0 2023-06-22 13:38:49,269 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-22 13:39:41,076 INFO [train.py:1028] (2/4) Epoch 6, validation: loss=0.2635, simple_loss=0.3541, pruned_loss=0.08643, over 1796401.00 frames. 2023-06-22 13:39:41,078 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-22 13:40:33,367 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.82 vs. limit=22.5 2023-06-22 13:40:46,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=968958.0, ans=0.125 2023-06-22 13:40:59,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=969018.0, ans=0.125 2023-06-22 13:41:02,144 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=969018.0, ans=0.2 2023-06-22 13:41:02,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=969018.0, ans=0.125 2023-06-22 13:41:04,554 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.803e+02 2.727e+02 3.183e+02 3.602e+02 6.441e+02, threshold=6.367e+02, percent-clipped=1.0 2023-06-22 13:41:08,736 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.24 vs. limit=15.0 2023-06-22 13:41:36,430 INFO [train.py:996] (2/4) Epoch 6, batch 9050, loss[loss=0.2319, simple_loss=0.3112, pruned_loss=0.0763, over 21283.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3006, pruned_loss=0.07511, over 4278224.11 frames. ], batch size: 549, lr: 5.20e-03, grad_scale: 16.0 2023-06-22 13:42:01,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=969138.0, ans=0.125 2023-06-22 13:42:42,066 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=969258.0, ans=0.0 2023-06-22 13:43:22,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=969318.0, ans=0.04949747468305833 2023-06-22 13:43:56,168 INFO [train.py:996] (2/4) Epoch 6, batch 9100, loss[loss=0.2489, simple_loss=0.3403, pruned_loss=0.07877, over 21639.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.308, pruned_loss=0.07748, over 4275665.54 frames. ], batch size: 441, lr: 5.19e-03, grad_scale: 16.0 2023-06-22 13:43:56,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=969438.0, ans=0.0 2023-06-22 13:45:08,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=969558.0, ans=0.2 2023-06-22 13:45:16,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=969558.0, ans=0.0 2023-06-22 13:45:21,386 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=969618.0, ans=0.125 2023-06-22 13:45:51,243 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.628e+02 2.427e+02 2.866e+02 3.272e+02 6.065e+02, threshold=5.732e+02, percent-clipped=0.0 2023-06-22 13:46:03,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=969678.0, ans=0.125 2023-06-22 13:46:10,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=969678.0, ans=0.035 2023-06-22 13:46:21,659 INFO [train.py:996] (2/4) Epoch 6, batch 9150, loss[loss=0.2019, simple_loss=0.2877, pruned_loss=0.05806, over 21501.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3119, pruned_loss=0.07588, over 4267804.10 frames. ], batch size: 131, lr: 5.19e-03, grad_scale: 16.0 2023-06-22 13:46:27,003 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=12.0 2023-06-22 13:46:45,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=969738.0, ans=0.125 2023-06-22 13:47:16,471 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=12.0 2023-06-22 13:47:29,314 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=969858.0, ans=0.0 2023-06-22 13:48:25,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=969978.0, ans=0.125 2023-06-22 13:48:25,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=969978.0, ans=0.0 2023-06-22 13:48:31,183 INFO [train.py:996] (2/4) Epoch 6, batch 9200, loss[loss=0.2783, simple_loss=0.3493, pruned_loss=0.1036, over 21324.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3122, pruned_loss=0.07437, over 4261536.03 frames. ], batch size: 548, lr: 5.19e-03, grad_scale: 32.0 2023-06-22 13:49:34,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=970158.0, ans=0.125 2023-06-22 13:49:43,940 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 13:50:20,609 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.927e+02 2.602e+02 3.064e+02 3.617e+02 6.755e+02, threshold=6.128e+02, percent-clipped=8.0 2023-06-22 13:50:39,026 INFO [train.py:996] (2/4) Epoch 6, batch 9250, loss[loss=0.2294, simple_loss=0.2971, pruned_loss=0.08079, over 21448.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3149, pruned_loss=0.0782, over 4268854.13 frames. ], batch size: 131, lr: 5.19e-03, grad_scale: 32.0 2023-06-22 13:52:33,059 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=970578.0, ans=0.125 2023-06-22 13:52:47,226 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 13:52:47,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=970578.0, ans=0.0 2023-06-22 13:52:49,822 INFO [train.py:996] (2/4) Epoch 6, batch 9300, loss[loss=0.2603, simple_loss=0.3251, pruned_loss=0.09775, over 21281.00 frames. ], tot_loss[loss=0.232, simple_loss=0.308, pruned_loss=0.078, over 4276833.24 frames. ], batch size: 471, lr: 5.19e-03, grad_scale: 32.0 2023-06-22 13:54:30,087 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=970758.0, ans=0.0 2023-06-22 13:54:31,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=970818.0, ans=0.0 2023-06-22 13:54:33,072 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=970818.0, ans=0.0 2023-06-22 13:54:44,843 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.639e+02 3.001e+02 3.479e+02 6.527e+02, threshold=6.003e+02, percent-clipped=1.0 2023-06-22 13:54:51,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=970878.0, ans=0.0 2023-06-22 13:55:03,822 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=19.60 vs. limit=22.5 2023-06-22 13:55:04,095 INFO [train.py:996] (2/4) Epoch 6, batch 9350, loss[loss=0.2204, simple_loss=0.2994, pruned_loss=0.07066, over 21900.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3144, pruned_loss=0.0786, over 4280631.17 frames. ], batch size: 98, lr: 5.19e-03, grad_scale: 16.0 2023-06-22 13:57:41,553 INFO [train.py:996] (2/4) Epoch 6, batch 9400, loss[loss=0.1924, simple_loss=0.2667, pruned_loss=0.05899, over 21664.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3155, pruned_loss=0.07884, over 4282819.78 frames. ], batch size: 282, lr: 5.19e-03, grad_scale: 16.0 2023-06-22 13:58:31,256 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.48 vs. limit=12.0 2023-06-22 13:58:35,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=971358.0, ans=0.125 2023-06-22 13:58:48,434 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=971418.0, ans=0.2 2023-06-22 13:58:56,459 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=971418.0, ans=0.125 2023-06-22 13:59:15,655 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.060e+02 2.533e+02 2.854e+02 3.549e+02 7.944e+02, threshold=5.708e+02, percent-clipped=6.0 2023-06-22 13:59:31,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=971478.0, ans=0.125 2023-06-22 13:59:45,789 INFO [train.py:996] (2/4) Epoch 6, batch 9450, loss[loss=0.2048, simple_loss=0.2691, pruned_loss=0.07028, over 21601.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3075, pruned_loss=0.07795, over 4277004.96 frames. ], batch size: 298, lr: 5.19e-03, grad_scale: 16.0 2023-06-22 14:00:01,009 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.04 vs. limit=22.5 2023-06-22 14:00:23,331 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.81 vs. limit=15.0 2023-06-22 14:00:23,411 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.64 vs. limit=6.0 2023-06-22 14:01:38,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=971778.0, ans=0.125 2023-06-22 14:02:06,235 INFO [train.py:996] (2/4) Epoch 6, batch 9500, loss[loss=0.2147, simple_loss=0.2758, pruned_loss=0.07681, over 21537.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.2987, pruned_loss=0.07574, over 4276975.73 frames. ], batch size: 414, lr: 5.19e-03, grad_scale: 16.0 2023-06-22 14:03:24,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=972018.0, ans=0.2 2023-06-22 14:03:52,203 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.065e+02 2.465e+02 2.826e+02 3.384e+02 5.228e+02, threshold=5.653e+02, percent-clipped=0.0 2023-06-22 14:04:25,790 INFO [train.py:996] (2/4) Epoch 6, batch 9550, loss[loss=0.2647, simple_loss=0.3621, pruned_loss=0.08363, over 21638.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3033, pruned_loss=0.07771, over 4279302.29 frames. ], batch size: 414, lr: 5.19e-03, grad_scale: 16.0 2023-06-22 14:04:27,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=972138.0, ans=0.0 2023-06-22 14:04:49,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=972138.0, ans=0.125 2023-06-22 14:06:44,195 INFO [train.py:996] (2/4) Epoch 6, batch 9600, loss[loss=0.2122, simple_loss=0.2886, pruned_loss=0.0679, over 21775.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3066, pruned_loss=0.07893, over 4286356.21 frames. ], batch size: 112, lr: 5.19e-03, grad_scale: 32.0 2023-06-22 14:06:50,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=972438.0, ans=0.2 2023-06-22 14:07:06,656 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=972438.0, ans=0.025 2023-06-22 14:07:06,675 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=972438.0, ans=0.125 2023-06-22 14:07:52,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=972558.0, ans=0.125 2023-06-22 14:07:54,147 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=972618.0, ans=0.2 2023-06-22 14:08:31,432 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.969e+02 2.618e+02 2.907e+02 3.359e+02 5.518e+02, threshold=5.814e+02, percent-clipped=0.0 2023-06-22 14:09:09,933 INFO [train.py:996] (2/4) Epoch 6, batch 9650, loss[loss=0.1971, simple_loss=0.271, pruned_loss=0.06155, over 21491.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3089, pruned_loss=0.07882, over 4287235.78 frames. ], batch size: 211, lr: 5.19e-03, grad_scale: 32.0 2023-06-22 14:09:19,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=972738.0, ans=0.125 2023-06-22 14:10:58,810 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.09 vs. limit=22.5 2023-06-22 14:11:29,906 INFO [train.py:996] (2/4) Epoch 6, batch 9700, loss[loss=0.2237, simple_loss=0.2984, pruned_loss=0.07451, over 21911.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3116, pruned_loss=0.07923, over 4287076.66 frames. ], batch size: 316, lr: 5.18e-03, grad_scale: 16.0 2023-06-22 14:13:01,755 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 2.601e+02 2.900e+02 3.367e+02 7.337e+02, threshold=5.800e+02, percent-clipped=1.0 2023-06-22 14:13:41,732 INFO [train.py:996] (2/4) Epoch 6, batch 9750, loss[loss=0.2574, simple_loss=0.3445, pruned_loss=0.0852, over 21877.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3049, pruned_loss=0.07816, over 4286691.19 frames. ], batch size: 118, lr: 5.18e-03, grad_scale: 16.0 2023-06-22 14:14:09,229 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=973398.0, ans=0.09899494936611666 2023-06-22 14:14:52,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=973518.0, ans=0.125 2023-06-22 14:14:54,964 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=973518.0, ans=0.0 2023-06-22 14:14:57,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=973518.0, ans=0.125 2023-06-22 14:15:38,400 INFO [train.py:996] (2/4) Epoch 6, batch 9800, loss[loss=0.2324, simple_loss=0.2978, pruned_loss=0.08347, over 21786.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3061, pruned_loss=0.07828, over 4287619.45 frames. ], batch size: 441, lr: 5.18e-03, grad_scale: 16.0 2023-06-22 14:16:02,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=973638.0, ans=0.0 2023-06-22 14:16:12,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=973698.0, ans=0.125 2023-06-22 14:17:28,255 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.056e+02 2.469e+02 2.952e+02 3.754e+02 9.468e+02, threshold=5.905e+02, percent-clipped=4.0 2023-06-22 14:17:38,971 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 14:17:42,960 INFO [train.py:996] (2/4) Epoch 6, batch 9850, loss[loss=0.2134, simple_loss=0.2779, pruned_loss=0.07448, over 21442.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3021, pruned_loss=0.07773, over 4292847.95 frames. ], batch size: 131, lr: 5.18e-03, grad_scale: 16.0 2023-06-22 14:18:13,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=973998.0, ans=0.125 2023-06-22 14:18:34,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=974058.0, ans=0.0 2023-06-22 14:19:07,203 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=974118.0, ans=0.125 2023-06-22 14:19:18,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=974178.0, ans=0.0 2023-06-22 14:19:44,033 INFO [train.py:996] (2/4) Epoch 6, batch 9900, loss[loss=0.2955, simple_loss=0.3583, pruned_loss=0.1164, over 21381.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.2978, pruned_loss=0.07691, over 4282809.23 frames. ], batch size: 471, lr: 5.18e-03, grad_scale: 16.0 2023-06-22 14:19:44,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=974238.0, ans=0.125 2023-06-22 14:20:19,622 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=974298.0, ans=0.1 2023-06-22 14:20:31,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=974298.0, ans=0.0 2023-06-22 14:20:59,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=974358.0, ans=0.1 2023-06-22 14:21:11,203 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.17 vs. limit=15.0 2023-06-22 14:21:12,041 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=974418.0, ans=0.0 2023-06-22 14:21:36,179 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.871e+02 2.526e+02 2.876e+02 3.339e+02 4.860e+02, threshold=5.753e+02, percent-clipped=0.0 2023-06-22 14:21:46,798 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=974478.0, ans=0.125 2023-06-22 14:21:55,174 INFO [train.py:996] (2/4) Epoch 6, batch 9950, loss[loss=0.2371, simple_loss=0.3087, pruned_loss=0.08274, over 21424.00 frames. ], tot_loss[loss=0.229, simple_loss=0.2992, pruned_loss=0.07937, over 4259726.64 frames. ], batch size: 131, lr: 5.18e-03, grad_scale: 16.0 2023-06-22 14:22:34,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=974598.0, ans=0.0 2023-06-22 14:23:40,331 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.77 vs. limit=15.0 2023-06-22 14:23:42,736 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=974718.0, ans=0.125 2023-06-22 14:23:44,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=974718.0, ans=0.0 2023-06-22 14:24:22,815 INFO [train.py:996] (2/4) Epoch 6, batch 10000, loss[loss=0.2183, simple_loss=0.2958, pruned_loss=0.07039, over 21762.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.2949, pruned_loss=0.07787, over 4262430.51 frames. ], batch size: 352, lr: 5.18e-03, grad_scale: 32.0 2023-06-22 14:24:32,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=974838.0, ans=0.025 2023-06-22 14:24:37,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=974898.0, ans=0.125 2023-06-22 14:25:07,819 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.48 vs. limit=22.5 2023-06-22 14:25:16,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=974958.0, ans=0.5 2023-06-22 14:25:16,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff2.min_abs, batch_count=974958.0, ans=0.1 2023-06-22 14:25:38,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=974958.0, ans=0.95 2023-06-22 14:25:56,000 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.243e+02 2.535e+02 2.869e+02 3.521e+02 5.167e+02, threshold=5.738e+02, percent-clipped=0.0 2023-06-22 14:26:30,278 INFO [train.py:996] (2/4) Epoch 6, batch 10050, loss[loss=0.1922, simple_loss=0.2634, pruned_loss=0.06051, over 21717.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.2969, pruned_loss=0.07829, over 4264590.86 frames. ], batch size: 282, lr: 5.18e-03, grad_scale: 32.0 2023-06-22 14:26:47,592 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=975198.0, ans=0.2 2023-06-22 14:26:48,164 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.63 vs. limit=10.0 2023-06-22 14:28:37,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=975378.0, ans=0.2 2023-06-22 14:28:50,958 INFO [train.py:996] (2/4) Epoch 6, batch 10100, loss[loss=0.2275, simple_loss=0.2972, pruned_loss=0.07894, over 21609.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2945, pruned_loss=0.07645, over 4258583.48 frames. ], batch size: 230, lr: 5.18e-03, grad_scale: 32.0 2023-06-22 14:29:02,695 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.83 vs. limit=15.0 2023-06-22 14:29:29,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=975498.0, ans=0.04949747468305833 2023-06-22 14:30:06,247 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.50 vs. limit=22.5 2023-06-22 14:30:41,759 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 2.495e+02 2.944e+02 3.852e+02 6.344e+02, threshold=5.889e+02, percent-clipped=1.0 2023-06-22 14:30:55,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=975738.0, ans=0.125 2023-06-22 14:30:56,708 INFO [train.py:996] (2/4) Epoch 6, batch 10150, loss[loss=0.2258, simple_loss=0.2988, pruned_loss=0.07639, over 21655.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3019, pruned_loss=0.07911, over 4263474.51 frames. ], batch size: 332, lr: 5.18e-03, grad_scale: 32.0 2023-06-22 14:31:12,829 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.05 vs. limit=15.0 2023-06-22 14:32:53,302 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 14:33:06,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=976038.0, ans=0.125 2023-06-22 14:33:07,626 INFO [train.py:996] (2/4) Epoch 6, batch 10200, loss[loss=0.2176, simple_loss=0.2957, pruned_loss=0.06973, over 21001.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3002, pruned_loss=0.07645, over 4266831.00 frames. ], batch size: 607, lr: 5.18e-03, grad_scale: 32.0 2023-06-22 14:33:08,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=976038.0, ans=0.95 2023-06-22 14:33:16,895 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=976038.0, ans=0.125 2023-06-22 14:33:41,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=976098.0, ans=0.0 2023-06-22 14:35:02,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=976278.0, ans=0.0 2023-06-22 14:35:04,010 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.679e+02 2.209e+02 2.586e+02 3.021e+02 4.237e+02, threshold=5.173e+02, percent-clipped=0.0 2023-06-22 14:35:10,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=976278.0, ans=0.125 2023-06-22 14:35:18,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=976338.0, ans=0.1 2023-06-22 14:35:19,327 INFO [train.py:996] (2/4) Epoch 6, batch 10250, loss[loss=0.1933, simple_loss=0.2839, pruned_loss=0.05136, over 21378.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2955, pruned_loss=0.07123, over 4257419.94 frames. ], batch size: 211, lr: 5.18e-03, grad_scale: 32.0 2023-06-22 14:35:24,145 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=976338.0, ans=0.125 2023-06-22 14:36:33,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=976458.0, ans=0.1 2023-06-22 14:37:31,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=976578.0, ans=0.125 2023-06-22 14:37:41,404 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.02 vs. limit=12.0 2023-06-22 14:37:49,250 INFO [train.py:996] (2/4) Epoch 6, batch 10300, loss[loss=0.2475, simple_loss=0.3329, pruned_loss=0.08105, over 21426.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2995, pruned_loss=0.07293, over 4267384.18 frames. ], batch size: 194, lr: 5.18e-03, grad_scale: 32.0 2023-06-22 14:39:11,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=976818.0, ans=0.125 2023-06-22 14:39:23,077 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=976878.0, ans=0.0 2023-06-22 14:39:23,999 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.583e+02 2.460e+02 2.831e+02 3.476e+02 5.397e+02, threshold=5.661e+02, percent-clipped=1.0 2023-06-22 14:39:24,645 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=976878.0, ans=0.125 2023-06-22 14:39:45,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=976878.0, ans=0.125 2023-06-22 14:40:01,748 INFO [train.py:996] (2/4) Epoch 6, batch 10350, loss[loss=0.2147, simple_loss=0.301, pruned_loss=0.06422, over 21838.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2995, pruned_loss=0.07278, over 4270947.99 frames. ], batch size: 372, lr: 5.17e-03, grad_scale: 16.0 2023-06-22 14:41:29,180 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.17 vs. limit=15.0 2023-06-22 14:42:16,431 INFO [train.py:996] (2/4) Epoch 6, batch 10400, loss[loss=0.1606, simple_loss=0.2105, pruned_loss=0.0554, over 21107.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2929, pruned_loss=0.07132, over 4271711.24 frames. ], batch size: 143, lr: 5.17e-03, grad_scale: 32.0 2023-06-22 14:43:11,030 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.26 vs. limit=15.0 2023-06-22 14:43:12,698 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.70 vs. limit=6.0 2023-06-22 14:44:03,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=977478.0, ans=0.04949747468305833 2023-06-22 14:44:05,936 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.953e+02 2.727e+02 3.248e+02 3.919e+02 5.926e+02, threshold=6.497e+02, percent-clipped=3.0 2023-06-22 14:44:14,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=977478.0, ans=0.125 2023-06-22 14:44:37,198 INFO [train.py:996] (2/4) Epoch 6, batch 10450, loss[loss=0.2365, simple_loss=0.3204, pruned_loss=0.07631, over 21825.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2974, pruned_loss=0.07401, over 4262513.52 frames. ], batch size: 316, lr: 5.17e-03, grad_scale: 32.0 2023-06-22 14:45:21,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=977658.0, ans=0.125 2023-06-22 14:46:28,762 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=977778.0, ans=0.0 2023-06-22 14:46:35,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=977778.0, ans=0.125 2023-06-22 14:46:46,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=977778.0, ans=0.0 2023-06-22 14:46:55,129 INFO [train.py:996] (2/4) Epoch 6, batch 10500, loss[loss=0.2254, simple_loss=0.2991, pruned_loss=0.07592, over 21811.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2961, pruned_loss=0.07278, over 4254879.75 frames. ], batch size: 98, lr: 5.17e-03, grad_scale: 32.0 2023-06-22 14:47:18,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=977838.0, ans=0.125 2023-06-22 14:47:22,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=977898.0, ans=0.2 2023-06-22 14:47:28,941 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=977898.0, ans=0.125 2023-06-22 14:48:31,848 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.853e+02 2.295e+02 2.560e+02 3.007e+02 4.435e+02, threshold=5.120e+02, percent-clipped=0.0 2023-06-22 14:49:11,524 INFO [train.py:996] (2/4) Epoch 6, batch 10550, loss[loss=0.2045, simple_loss=0.2685, pruned_loss=0.07022, over 21662.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.292, pruned_loss=0.07283, over 4250192.67 frames. ], batch size: 333, lr: 5.17e-03, grad_scale: 32.0 2023-06-22 14:49:13,854 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.29 vs. limit=15.0 2023-06-22 14:49:17,788 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=978138.0, ans=0.125 2023-06-22 14:49:18,406 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-22 14:50:16,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=978258.0, ans=0.2 2023-06-22 14:50:59,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=978378.0, ans=0.125 2023-06-22 14:51:24,203 INFO [train.py:996] (2/4) Epoch 6, batch 10600, loss[loss=0.1975, simple_loss=0.2721, pruned_loss=0.06146, over 21737.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2872, pruned_loss=0.07107, over 4244613.28 frames. ], batch size: 316, lr: 5.17e-03, grad_scale: 32.0 2023-06-22 14:51:29,725 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.02 vs. limit=10.0 2023-06-22 14:51:39,567 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 14:52:42,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=978618.0, ans=0.125 2023-06-22 14:53:21,275 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.874e+02 2.439e+02 2.926e+02 3.580e+02 7.545e+02, threshold=5.851e+02, percent-clipped=4.0 2023-06-22 14:53:38,942 INFO [train.py:996] (2/4) Epoch 6, batch 10650, loss[loss=0.1802, simple_loss=0.2639, pruned_loss=0.04823, over 21804.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2893, pruned_loss=0.06954, over 4255971.64 frames. ], batch size: 317, lr: 5.17e-03, grad_scale: 16.0 2023-06-22 14:53:50,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=978738.0, ans=0.125 2023-06-22 14:54:11,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=978798.0, ans=0.2 2023-06-22 14:54:36,973 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=978858.0, ans=0.0 2023-06-22 14:55:04,591 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=978918.0, ans=0.1 2023-06-22 14:55:17,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=978918.0, ans=0.0 2023-06-22 14:55:47,008 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=978978.0, ans=0.0 2023-06-22 14:55:49,533 INFO [train.py:996] (2/4) Epoch 6, batch 10700, loss[loss=0.2724, simple_loss=0.3369, pruned_loss=0.1039, over 21698.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2886, pruned_loss=0.06946, over 4256547.43 frames. ], batch size: 441, lr: 5.17e-03, grad_scale: 16.0 2023-06-22 14:57:05,963 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=12.0 2023-06-22 14:58:02,680 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 2.582e+02 2.877e+02 3.268e+02 5.588e+02, threshold=5.755e+02, percent-clipped=0.0 2023-06-22 14:58:08,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=979278.0, ans=0.125 2023-06-22 14:58:17,282 INFO [train.py:996] (2/4) Epoch 6, batch 10750, loss[loss=0.2525, simple_loss=0.3486, pruned_loss=0.07824, over 21407.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2999, pruned_loss=0.07442, over 4255852.92 frames. ], batch size: 211, lr: 5.17e-03, grad_scale: 16.0 2023-06-22 14:58:28,783 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.51 vs. limit=15.0 2023-06-22 14:58:38,975 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=979338.0, ans=0.2 2023-06-22 14:58:39,100 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=979338.0, ans=0.2 2023-06-22 14:58:45,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=979398.0, ans=0.125 2023-06-22 14:59:24,566 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=979458.0, ans=0.125 2023-06-22 15:00:03,798 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.40 vs. limit=15.0 2023-06-22 15:00:52,824 INFO [train.py:996] (2/4) Epoch 6, batch 10800, loss[loss=0.2742, simple_loss=0.3922, pruned_loss=0.07805, over 19848.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3041, pruned_loss=0.07527, over 4252356.86 frames. ], batch size: 702, lr: 5.17e-03, grad_scale: 32.0 2023-06-22 15:00:59,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=979638.0, ans=0.0 2023-06-22 15:01:02,459 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=979638.0, ans=0.0 2023-06-22 15:01:17,447 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=979638.0, ans=0.125 2023-06-22 15:01:37,494 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.55 vs. limit=15.0 2023-06-22 15:02:00,691 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=979758.0, ans=0.125 2023-06-22 15:02:06,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=979758.0, ans=0.125 2023-06-22 15:02:44,265 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.022e+02 2.781e+02 3.238e+02 4.047e+02 6.056e+02, threshold=6.476e+02, percent-clipped=1.0 2023-06-22 15:03:16,962 INFO [train.py:996] (2/4) Epoch 6, batch 10850, loss[loss=0.1862, simple_loss=0.2538, pruned_loss=0.05931, over 20700.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.307, pruned_loss=0.07616, over 4255394.81 frames. ], batch size: 608, lr: 5.17e-03, grad_scale: 32.0 2023-06-22 15:03:17,531 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=979938.0, ans=0.05 2023-06-22 15:03:21,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=979938.0, ans=0.125 2023-06-22 15:04:21,427 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=980058.0, ans=0.125 2023-06-22 15:04:31,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=980118.0, ans=0.2 2023-06-22 15:04:59,517 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=980178.0, ans=0.2 2023-06-22 15:05:23,925 INFO [train.py:996] (2/4) Epoch 6, batch 10900, loss[loss=0.1986, simple_loss=0.2697, pruned_loss=0.06371, over 21212.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3001, pruned_loss=0.07464, over 4253417.20 frames. ], batch size: 143, lr: 5.17e-03, grad_scale: 32.0 2023-06-22 15:06:21,973 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=980358.0, ans=0.0 2023-06-22 15:06:27,992 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=980358.0, ans=0.125 2023-06-22 15:06:32,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=980358.0, ans=0.125 2023-06-22 15:07:09,954 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.015e+02 2.391e+02 2.681e+02 3.118e+02 5.164e+02, threshold=5.361e+02, percent-clipped=0.0 2023-06-22 15:07:34,780 INFO [train.py:996] (2/4) Epoch 6, batch 10950, loss[loss=0.1791, simple_loss=0.2508, pruned_loss=0.05374, over 21618.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2938, pruned_loss=0.07215, over 4247972.27 frames. ], batch size: 298, lr: 5.16e-03, grad_scale: 32.0 2023-06-22 15:07:53,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=980538.0, ans=0.125 2023-06-22 15:07:53,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=980538.0, ans=15.0 2023-06-22 15:09:00,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=980718.0, ans=0.1 2023-06-22 15:09:20,584 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.79 vs. limit=15.0 2023-06-22 15:09:27,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=980778.0, ans=0.07 2023-06-22 15:09:32,130 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=980778.0, ans=0.0 2023-06-22 15:10:01,902 INFO [train.py:996] (2/4) Epoch 6, batch 11000, loss[loss=0.2261, simple_loss=0.29, pruned_loss=0.08115, over 21534.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2923, pruned_loss=0.07219, over 4253687.28 frames. ], batch size: 195, lr: 5.16e-03, grad_scale: 16.0 2023-06-22 15:10:10,167 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.00 vs. limit=22.5 2023-06-22 15:11:17,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=981018.0, ans=0.0 2023-06-22 15:11:39,798 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.816e+02 2.522e+02 2.854e+02 3.360e+02 5.643e+02, threshold=5.707e+02, percent-clipped=1.0 2023-06-22 15:11:55,626 INFO [train.py:996] (2/4) Epoch 6, batch 11050, loss[loss=0.2151, simple_loss=0.2735, pruned_loss=0.07836, over 21673.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2898, pruned_loss=0.07345, over 4260608.56 frames. ], batch size: 393, lr: 5.16e-03, grad_scale: 16.0 2023-06-22 15:13:04,956 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-06-22 15:13:17,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=981258.0, ans=0.125 2023-06-22 15:13:23,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=981318.0, ans=0.0 2023-06-22 15:13:30,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=981318.0, ans=0.2 2023-06-22 15:13:39,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=981378.0, ans=0.07 2023-06-22 15:13:43,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=981378.0, ans=0.1 2023-06-22 15:14:05,803 INFO [train.py:996] (2/4) Epoch 6, batch 11100, loss[loss=0.2108, simple_loss=0.2783, pruned_loss=0.07164, over 21503.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2892, pruned_loss=0.07361, over 4251156.21 frames. ], batch size: 195, lr: 5.16e-03, grad_scale: 16.0 2023-06-22 15:15:19,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=981558.0, ans=0.05 2023-06-22 15:15:19,254 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=981558.0, ans=0.1 2023-06-22 15:15:19,767 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.20 vs. limit=15.0 2023-06-22 15:15:32,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=981618.0, ans=0.125 2023-06-22 15:15:43,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=981618.0, ans=0.2 2023-06-22 15:15:51,519 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.003e+02 2.464e+02 2.836e+02 3.391e+02 6.300e+02, threshold=5.672e+02, percent-clipped=1.0 2023-06-22 15:16:19,349 INFO [train.py:996] (2/4) Epoch 6, batch 11150, loss[loss=0.2476, simple_loss=0.3348, pruned_loss=0.08023, over 20656.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2885, pruned_loss=0.07394, over 4258760.13 frames. ], batch size: 607, lr: 5.16e-03, grad_scale: 16.0 2023-06-22 15:17:43,931 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.75 vs. limit=6.0 2023-06-22 15:17:57,591 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=981918.0, ans=0.0 2023-06-22 15:18:20,583 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=981978.0, ans=0.125 2023-06-22 15:18:36,073 INFO [train.py:996] (2/4) Epoch 6, batch 11200, loss[loss=0.2127, simple_loss=0.2769, pruned_loss=0.07424, over 21541.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2877, pruned_loss=0.07346, over 4267268.16 frames. ], batch size: 414, lr: 5.16e-03, grad_scale: 32.0 2023-06-22 15:20:09,205 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.05 vs. limit=22.5 2023-06-22 15:20:17,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=982278.0, ans=0.09899494936611666 2023-06-22 15:20:21,519 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.906e+02 2.424e+02 2.651e+02 3.050e+02 4.953e+02, threshold=5.302e+02, percent-clipped=0.0 2023-06-22 15:20:44,986 INFO [train.py:996] (2/4) Epoch 6, batch 11250, loss[loss=0.2228, simple_loss=0.2897, pruned_loss=0.07795, over 20167.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2871, pruned_loss=0.07312, over 4266156.25 frames. ], batch size: 702, lr: 5.16e-03, grad_scale: 32.0 2023-06-22 15:20:47,071 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=982338.0, ans=0.0 2023-06-22 15:20:51,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=982338.0, ans=0.1 2023-06-22 15:21:47,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=982398.0, ans=0.09899494936611666 2023-06-22 15:22:14,898 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=982518.0, ans=0.125 2023-06-22 15:22:53,994 INFO [train.py:996] (2/4) Epoch 6, batch 11300, loss[loss=0.2085, simple_loss=0.2827, pruned_loss=0.06721, over 21918.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2892, pruned_loss=0.07357, over 4271016.10 frames. ], batch size: 316, lr: 5.16e-03, grad_scale: 32.0 2023-06-22 15:22:57,493 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=982638.0, ans=0.125 2023-06-22 15:23:22,145 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=982638.0, ans=0.125 2023-06-22 15:24:23,110 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 15:24:48,792 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.874e+02 2.408e+02 2.703e+02 3.080e+02 4.144e+02, threshold=5.407e+02, percent-clipped=0.0 2023-06-22 15:25:19,002 INFO [train.py:996] (2/4) Epoch 6, batch 11350, loss[loss=0.1872, simple_loss=0.2696, pruned_loss=0.05237, over 21457.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2902, pruned_loss=0.07261, over 4268196.25 frames. ], batch size: 195, lr: 5.16e-03, grad_scale: 32.0 2023-06-22 15:25:35,777 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-06-22 15:25:44,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=982938.0, ans=0.125 2023-06-22 15:25:55,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=982998.0, ans=0.125 2023-06-22 15:25:57,960 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-06-22 15:26:14,632 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.41 vs. limit=12.0 2023-06-22 15:26:20,505 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=983058.0, ans=0.1 2023-06-22 15:26:56,498 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=983178.0, ans=0.0 2023-06-22 15:26:57,064 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.68 vs. limit=22.5 2023-06-22 15:27:42,752 INFO [train.py:996] (2/4) Epoch 6, batch 11400, loss[loss=0.2168, simple_loss=0.2943, pruned_loss=0.06961, over 21359.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.297, pruned_loss=0.07599, over 4269911.14 frames. ], batch size: 194, lr: 5.16e-03, grad_scale: 32.0 2023-06-22 15:28:27,308 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.68 vs. limit=15.0 2023-06-22 15:28:44,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=983418.0, ans=0.125 2023-06-22 15:28:47,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=983418.0, ans=0.125 2023-06-22 15:29:09,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=983418.0, ans=0.0 2023-06-22 15:29:37,574 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.801e+02 2.455e+02 2.814e+02 3.249e+02 4.711e+02, threshold=5.629e+02, percent-clipped=0.0 2023-06-22 15:29:54,451 INFO [train.py:996] (2/4) Epoch 6, batch 11450, loss[loss=0.2249, simple_loss=0.3015, pruned_loss=0.0741, over 21585.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2979, pruned_loss=0.07466, over 4271486.24 frames. ], batch size: 263, lr: 5.16e-03, grad_scale: 32.0 2023-06-22 15:30:21,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=983538.0, ans=0.1 2023-06-22 15:30:22,935 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=983538.0, ans=0.125 2023-06-22 15:30:41,661 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-06-22 15:31:23,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=983718.0, ans=0.125 2023-06-22 15:31:29,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=983718.0, ans=0.1 2023-06-22 15:32:07,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=983778.0, ans=0.125 2023-06-22 15:32:10,119 INFO [train.py:996] (2/4) Epoch 6, batch 11500, loss[loss=0.2465, simple_loss=0.3293, pruned_loss=0.08186, over 21755.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3013, pruned_loss=0.07612, over 4276571.39 frames. ], batch size: 124, lr: 5.16e-03, grad_scale: 32.0 2023-06-22 15:32:55,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=983958.0, ans=0.0 2023-06-22 15:33:43,570 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-06-22 15:34:04,705 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.014e+02 2.591e+02 3.004e+02 3.598e+02 5.267e+02, threshold=6.007e+02, percent-clipped=0.0 2023-06-22 15:34:39,612 INFO [train.py:996] (2/4) Epoch 6, batch 11550, loss[loss=0.3949, simple_loss=0.477, pruned_loss=0.1564, over 21489.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3089, pruned_loss=0.07677, over 4276883.37 frames. ], batch size: 507, lr: 5.16e-03, grad_scale: 32.0 2023-06-22 15:35:57,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=984258.0, ans=0.125 2023-06-22 15:35:57,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=984258.0, ans=0.1 2023-06-22 15:37:01,171 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=984378.0, ans=0.07 2023-06-22 15:37:01,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=984378.0, ans=0.125 2023-06-22 15:37:12,899 INFO [train.py:996] (2/4) Epoch 6, batch 11600, loss[loss=0.2442, simple_loss=0.3399, pruned_loss=0.07426, over 21343.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3218, pruned_loss=0.07822, over 4271786.49 frames. ], batch size: 131, lr: 5.15e-03, grad_scale: 32.0 2023-06-22 15:37:47,031 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.82 vs. limit=6.0 2023-06-22 15:39:22,406 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.074e+02 2.936e+02 3.532e+02 4.287e+02 8.204e+02, threshold=7.063e+02, percent-clipped=1.0 2023-06-22 15:39:24,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=984678.0, ans=0.2 2023-06-22 15:39:33,493 INFO [train.py:996] (2/4) Epoch 6, batch 11650, loss[loss=0.2468, simple_loss=0.3272, pruned_loss=0.08315, over 21448.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3284, pruned_loss=0.07912, over 4270908.34 frames. ], batch size: 211, lr: 5.15e-03, grad_scale: 32.0 2023-06-22 15:39:59,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=984798.0, ans=0.0 2023-06-22 15:40:26,255 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=984858.0, ans=0.07 2023-06-22 15:41:20,885 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.40 vs. limit=22.5 2023-06-22 15:41:29,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=984978.0, ans=0.0 2023-06-22 15:41:40,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=984978.0, ans=0.125 2023-06-22 15:41:46,241 INFO [train.py:996] (2/4) Epoch 6, batch 11700, loss[loss=0.2266, simple_loss=0.2819, pruned_loss=0.08566, over 21881.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3192, pruned_loss=0.07849, over 4271322.84 frames. ], batch size: 373, lr: 5.15e-03, grad_scale: 32.0 2023-06-22 15:43:28,522 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.31 vs. limit=15.0 2023-06-22 15:43:30,060 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.05 vs. limit=15.0 2023-06-22 15:43:37,769 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.29 vs. limit=15.0 2023-06-22 15:43:45,798 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.000e+02 2.631e+02 2.879e+02 3.567e+02 8.503e+02, threshold=5.757e+02, percent-clipped=1.0 2023-06-22 15:43:54,598 INFO [train.py:996] (2/4) Epoch 6, batch 11750, loss[loss=0.2796, simple_loss=0.33, pruned_loss=0.1146, over 21367.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3107, pruned_loss=0.07744, over 4257301.91 frames. ], batch size: 471, lr: 5.15e-03, grad_scale: 16.0 2023-06-22 15:44:15,834 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=985338.0, ans=0.125 2023-06-22 15:45:22,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=985458.0, ans=0.1 2023-06-22 15:46:36,584 INFO [train.py:996] (2/4) Epoch 6, batch 11800, loss[loss=0.2547, simple_loss=0.3295, pruned_loss=0.08994, over 21504.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3111, pruned_loss=0.07895, over 4258798.94 frames. ], batch size: 131, lr: 5.15e-03, grad_scale: 16.0 2023-06-22 15:46:57,399 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.41 vs. limit=22.5 2023-06-22 15:48:03,130 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=985758.0, ans=0.2 2023-06-22 15:48:37,797 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.911e+02 2.436e+02 2.794e+02 3.118e+02 5.023e+02, threshold=5.587e+02, percent-clipped=0.0 2023-06-22 15:48:58,512 INFO [train.py:996] (2/4) Epoch 6, batch 11850, loss[loss=0.2149, simple_loss=0.3123, pruned_loss=0.05868, over 21818.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.313, pruned_loss=0.07815, over 4259325.47 frames. ], batch size: 371, lr: 5.15e-03, grad_scale: 16.0 2023-06-22 15:48:59,575 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.37 vs. limit=10.0 2023-06-22 15:50:27,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=986058.0, ans=0.125 2023-06-22 15:50:37,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=986118.0, ans=0.125 2023-06-22 15:51:24,772 INFO [train.py:996] (2/4) Epoch 6, batch 11900, loss[loss=0.2447, simple_loss=0.3418, pruned_loss=0.07381, over 19726.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3146, pruned_loss=0.07604, over 4258106.51 frames. ], batch size: 702, lr: 5.15e-03, grad_scale: 16.0 2023-06-22 15:52:21,751 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=986298.0, ans=0.125 2023-06-22 15:52:52,876 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.00 vs. limit=15.0 2023-06-22 15:53:05,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=986418.0, ans=0.025 2023-06-22 15:53:09,819 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=986478.0, ans=0.125 2023-06-22 15:53:13,470 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.863e+02 2.309e+02 2.613e+02 2.997e+02 4.619e+02, threshold=5.227e+02, percent-clipped=0.0 2023-06-22 15:53:17,798 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=986478.0, ans=0.1 2023-06-22 15:53:41,841 INFO [train.py:996] (2/4) Epoch 6, batch 11950, loss[loss=0.2308, simple_loss=0.3324, pruned_loss=0.06459, over 21668.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3125, pruned_loss=0.0725, over 4262468.76 frames. ], batch size: 247, lr: 5.15e-03, grad_scale: 16.0 2023-06-22 15:55:35,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=986778.0, ans=0.1 2023-06-22 15:55:52,248 INFO [train.py:996] (2/4) Epoch 6, batch 12000, loss[loss=0.2055, simple_loss=0.2692, pruned_loss=0.0709, over 21570.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3077, pruned_loss=0.07111, over 4255638.53 frames. ], batch size: 263, lr: 5.15e-03, grad_scale: 32.0 2023-06-22 15:55:52,248 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-22 15:56:35,832 INFO [train.py:1028] (2/4) Epoch 6, validation: loss=0.2631, simple_loss=0.3525, pruned_loss=0.08686, over 1796401.00 frames. 2023-06-22 15:56:35,833 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24283MB 2023-06-22 15:57:23,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=986898.0, ans=0.0 2023-06-22 15:57:46,679 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 15:58:12,591 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=987078.0, ans=0.1 2023-06-22 15:58:18,394 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.895e+02 2.546e+02 3.142e+02 3.620e+02 6.312e+02, threshold=6.283e+02, percent-clipped=4.0 2023-06-22 15:58:41,703 INFO [train.py:996] (2/4) Epoch 6, batch 12050, loss[loss=0.2253, simple_loss=0.2935, pruned_loss=0.07855, over 21646.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3046, pruned_loss=0.07331, over 4251723.74 frames. ], batch size: 195, lr: 5.15e-03, grad_scale: 32.0 2023-06-22 15:58:53,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=987138.0, ans=0.125 2023-06-22 15:58:58,405 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-06-22 15:59:23,948 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.01 vs. limit=22.5 2023-06-22 15:59:24,809 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=987198.0, ans=0.1 2023-06-22 15:59:30,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=987258.0, ans=0.1 2023-06-22 15:59:36,918 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=987258.0, ans=0.125 2023-06-22 15:59:36,921 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=987258.0, ans=0.2 2023-06-22 16:00:53,305 INFO [train.py:996] (2/4) Epoch 6, batch 12100, loss[loss=0.2767, simple_loss=0.3582, pruned_loss=0.09763, over 21369.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3082, pruned_loss=0.07746, over 4258523.11 frames. ], batch size: 131, lr: 5.15e-03, grad_scale: 32.0 2023-06-22 16:01:28,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=987498.0, ans=0.125 2023-06-22 16:03:01,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=987678.0, ans=0.2 2023-06-22 16:03:13,396 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=987678.0, ans=0.125 2023-06-22 16:03:14,373 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.727e+02 2.726e+02 3.141e+02 3.562e+02 5.633e+02, threshold=6.281e+02, percent-clipped=0.0 2023-06-22 16:03:33,319 INFO [train.py:996] (2/4) Epoch 6, batch 12150, loss[loss=0.2278, simple_loss=0.3176, pruned_loss=0.06895, over 21852.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3114, pruned_loss=0.07691, over 4254861.51 frames. ], batch size: 316, lr: 5.15e-03, grad_scale: 32.0 2023-06-22 16:04:46,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=987858.0, ans=0.125 2023-06-22 16:04:54,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=987858.0, ans=0.125 2023-06-22 16:05:00,721 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=987918.0, ans=0.0 2023-06-22 16:05:50,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=987978.0, ans=0.125 2023-06-22 16:05:52,881 INFO [train.py:996] (2/4) Epoch 6, batch 12200, loss[loss=0.2626, simple_loss=0.3056, pruned_loss=0.1098, over 21353.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3073, pruned_loss=0.07664, over 4260300.67 frames. ], batch size: 508, lr: 5.15e-03, grad_scale: 32.0 2023-06-22 16:05:56,266 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=988038.0, ans=0.0 2023-06-22 16:05:57,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=988038.0, ans=0.125 2023-06-22 16:06:39,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=988158.0, ans=0.125 2023-06-22 16:06:42,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=988158.0, ans=0.125 2023-06-22 16:07:46,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=988278.0, ans=0.125 2023-06-22 16:07:56,035 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.713e+02 2.303e+02 2.808e+02 3.526e+02 6.344e+02, threshold=5.616e+02, percent-clipped=1.0 2023-06-22 16:08:03,853 INFO [train.py:996] (2/4) Epoch 6, batch 12250, loss[loss=0.1566, simple_loss=0.2295, pruned_loss=0.04187, over 21757.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2992, pruned_loss=0.07363, over 4265622.86 frames. ], batch size: 112, lr: 5.14e-03, grad_scale: 16.0 2023-06-22 16:08:04,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=988338.0, ans=0.125 2023-06-22 16:09:26,447 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=988518.0, ans=0.125 2023-06-22 16:09:46,582 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=988578.0, ans=0.125 2023-06-22 16:10:13,424 INFO [train.py:996] (2/4) Epoch 6, batch 12300, loss[loss=0.1699, simple_loss=0.2432, pruned_loss=0.0483, over 21153.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2913, pruned_loss=0.06786, over 4266571.24 frames. ], batch size: 143, lr: 5.14e-03, grad_scale: 8.0 2023-06-22 16:10:15,630 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=988638.0, ans=0.125 2023-06-22 16:11:02,078 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=988758.0, ans=10.0 2023-06-22 16:11:30,209 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 16:12:13,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=988878.0, ans=0.125 2023-06-22 16:12:28,479 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.384e+02 1.976e+02 2.467e+02 3.000e+02 5.737e+02, threshold=4.934e+02, percent-clipped=1.0 2023-06-22 16:12:34,333 INFO [train.py:996] (2/4) Epoch 6, batch 12350, loss[loss=0.259, simple_loss=0.3377, pruned_loss=0.09017, over 21720.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2982, pruned_loss=0.0695, over 4275489.01 frames. ], batch size: 389, lr: 5.14e-03, grad_scale: 8.0 2023-06-22 16:12:49,337 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=988998.0, ans=0.0 2023-06-22 16:13:37,695 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=989058.0, ans=0.0 2023-06-22 16:14:01,580 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=989118.0, ans=0.0 2023-06-22 16:14:36,938 INFO [train.py:996] (2/4) Epoch 6, batch 12400, loss[loss=0.2108, simple_loss=0.2903, pruned_loss=0.06568, over 21832.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3003, pruned_loss=0.07304, over 4276374.97 frames. ], batch size: 298, lr: 5.14e-03, grad_scale: 16.0 2023-06-22 16:14:44,026 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=989238.0, ans=0.125 2023-06-22 16:15:12,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=989298.0, ans=0.125 2023-06-22 16:16:01,008 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=989358.0, ans=0.125 2023-06-22 16:16:45,917 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.988e+02 2.525e+02 2.978e+02 3.587e+02 4.939e+02, threshold=5.956e+02, percent-clipped=1.0 2023-06-22 16:16:51,791 INFO [train.py:996] (2/4) Epoch 6, batch 12450, loss[loss=0.2638, simple_loss=0.3337, pruned_loss=0.09692, over 21605.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3031, pruned_loss=0.07613, over 4280735.64 frames. ], batch size: 389, lr: 5.14e-03, grad_scale: 16.0 2023-06-22 16:17:51,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=989658.0, ans=0.125 2023-06-22 16:18:56,503 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=989718.0, ans=0.125 2023-06-22 16:18:58,731 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.67 vs. limit=15.0 2023-06-22 16:19:14,377 INFO [train.py:996] (2/4) Epoch 6, batch 12500, loss[loss=0.2589, simple_loss=0.3568, pruned_loss=0.08054, over 21643.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3146, pruned_loss=0.07974, over 4279321.59 frames. ], batch size: 263, lr: 5.14e-03, grad_scale: 16.0 2023-06-22 16:21:38,094 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.31 vs. limit=22.5 2023-06-22 16:21:38,558 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.199e+02 2.650e+02 2.940e+02 3.345e+02 4.530e+02, threshold=5.879e+02, percent-clipped=0.0 2023-06-22 16:22:04,951 INFO [train.py:996] (2/4) Epoch 6, batch 12550, loss[loss=0.1837, simple_loss=0.2188, pruned_loss=0.07429, over 19991.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3173, pruned_loss=0.08149, over 4276479.30 frames. ], batch size: 703, lr: 5.14e-03, grad_scale: 16.0 2023-06-22 16:22:07,177 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=990138.0, ans=0.125 2023-06-22 16:23:01,816 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=990258.0, ans=0.125 2023-06-22 16:23:06,736 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-06-22 16:24:22,783 INFO [train.py:996] (2/4) Epoch 6, batch 12600, loss[loss=0.2049, simple_loss=0.2911, pruned_loss=0.05929, over 21621.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3155, pruned_loss=0.07824, over 4274486.93 frames. ], batch size: 230, lr: 5.14e-03, grad_scale: 16.0 2023-06-22 16:25:00,413 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=990498.0, ans=0.05 2023-06-22 16:25:34,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=990618.0, ans=0.125 2023-06-22 16:26:02,975 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=990678.0, ans=0.0 2023-06-22 16:26:21,259 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.899e+02 2.398e+02 2.729e+02 3.427e+02 5.536e+02, threshold=5.458e+02, percent-clipped=0.0 2023-06-22 16:26:33,021 INFO [train.py:996] (2/4) Epoch 6, batch 12650, loss[loss=0.2115, simple_loss=0.2845, pruned_loss=0.0693, over 21883.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3067, pruned_loss=0.07412, over 4275531.37 frames. ], batch size: 316, lr: 5.14e-03, grad_scale: 16.0 2023-06-22 16:26:34,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=990738.0, ans=15.0 2023-06-22 16:27:09,967 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=990798.0, ans=0.2 2023-06-22 16:27:43,857 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=990918.0, ans=0.1 2023-06-22 16:28:44,050 INFO [train.py:996] (2/4) Epoch 6, batch 12700, loss[loss=0.2243, simple_loss=0.2944, pruned_loss=0.07711, over 21438.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3076, pruned_loss=0.07673, over 4278279.60 frames. ], batch size: 211, lr: 5.14e-03, grad_scale: 8.0 2023-06-22 16:29:04,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=991038.0, ans=0.125 2023-06-22 16:29:59,902 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.89 vs. limit=22.5 2023-06-22 16:30:48,843 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.853e+02 2.637e+02 2.986e+02 3.360e+02 4.638e+02, threshold=5.971e+02, percent-clipped=0.0 2023-06-22 16:31:04,989 INFO [train.py:996] (2/4) Epoch 6, batch 12750, loss[loss=0.2238, simple_loss=0.3013, pruned_loss=0.07314, over 20075.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3089, pruned_loss=0.07718, over 4274532.68 frames. ], batch size: 702, lr: 5.14e-03, grad_scale: 8.0 2023-06-22 16:32:09,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=991458.0, ans=0.0 2023-06-22 16:33:00,645 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=991578.0, ans=0.125 2023-06-22 16:33:13,754 INFO [train.py:996] (2/4) Epoch 6, batch 12800, loss[loss=0.2916, simple_loss=0.3408, pruned_loss=0.1212, over 21618.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3078, pruned_loss=0.07754, over 4276612.25 frames. ], batch size: 508, lr: 5.14e-03, grad_scale: 16.0 2023-06-22 16:33:45,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=991698.0, ans=0.025 2023-06-22 16:34:26,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=991758.0, ans=0.125 2023-06-22 16:35:20,956 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.092e+02 2.507e+02 2.757e+02 3.011e+02 5.577e+02, threshold=5.515e+02, percent-clipped=0.0 2023-06-22 16:35:25,517 INFO [train.py:996] (2/4) Epoch 6, batch 12850, loss[loss=0.235, simple_loss=0.3069, pruned_loss=0.08155, over 20674.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3103, pruned_loss=0.07868, over 4278044.41 frames. ], batch size: 607, lr: 5.14e-03, grad_scale: 16.0 2023-06-22 16:35:25,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=991938.0, ans=0.2 2023-06-22 16:35:26,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=991938.0, ans=0.2 2023-06-22 16:36:26,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=992058.0, ans=0.0 2023-06-22 16:36:28,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=992058.0, ans=0.125 2023-06-22 16:37:40,704 INFO [train.py:996] (2/4) Epoch 6, batch 12900, loss[loss=0.1954, simple_loss=0.2708, pruned_loss=0.06002, over 21184.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3086, pruned_loss=0.0751, over 4278754.87 frames. ], batch size: 159, lr: 5.13e-03, grad_scale: 16.0 2023-06-22 16:38:17,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=992298.0, ans=0.1 2023-06-22 16:38:18,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=992298.0, ans=0.125 2023-06-22 16:38:48,073 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.95 vs. limit=6.0 2023-06-22 16:39:08,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=992418.0, ans=0.1 2023-06-22 16:39:47,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=992478.0, ans=0.125 2023-06-22 16:39:50,238 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.712e+02 2.250e+02 2.553e+02 2.829e+02 4.968e+02, threshold=5.106e+02, percent-clipped=0.0 2023-06-22 16:39:54,887 INFO [train.py:996] (2/4) Epoch 6, batch 12950, loss[loss=0.2001, simple_loss=0.2821, pruned_loss=0.05901, over 21696.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3067, pruned_loss=0.07385, over 4278331.54 frames. ], batch size: 332, lr: 5.13e-03, grad_scale: 16.0 2023-06-22 16:40:21,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=992538.0, ans=0.125 2023-06-22 16:40:26,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=992598.0, ans=0.125 2023-06-22 16:40:45,186 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=992598.0, ans=0.2 2023-06-22 16:41:21,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=992718.0, ans=0.1 2023-06-22 16:41:38,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=992718.0, ans=0.1 2023-06-22 16:41:55,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=992778.0, ans=0.0 2023-06-22 16:42:11,632 INFO [train.py:996] (2/4) Epoch 6, batch 13000, loss[loss=0.1952, simple_loss=0.2801, pruned_loss=0.05512, over 21828.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3071, pruned_loss=0.07472, over 4282793.46 frames. ], batch size: 372, lr: 5.13e-03, grad_scale: 16.0 2023-06-22 16:42:33,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=992838.0, ans=0.125 2023-06-22 16:42:39,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=992898.0, ans=0.1 2023-06-22 16:43:11,050 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.62 vs. limit=8.0 2023-06-22 16:43:52,765 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=993018.0, ans=0.125 2023-06-22 16:43:55,797 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=993018.0, ans=0.1 2023-06-22 16:44:03,396 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=993018.0, ans=0.0 2023-06-22 16:44:05,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=993018.0, ans=0.125 2023-06-22 16:44:19,927 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.776e+02 2.566e+02 2.996e+02 3.463e+02 5.052e+02, threshold=5.993e+02, percent-clipped=0.0 2023-06-22 16:44:24,265 INFO [train.py:996] (2/4) Epoch 6, batch 13050, loss[loss=0.2453, simple_loss=0.3142, pruned_loss=0.08826, over 21932.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3045, pruned_loss=0.07336, over 4287774.86 frames. ], batch size: 415, lr: 5.13e-03, grad_scale: 16.0 2023-06-22 16:45:15,986 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=993198.0, ans=0.1 2023-06-22 16:45:23,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=993258.0, ans=0.125 2023-06-22 16:46:42,708 INFO [train.py:996] (2/4) Epoch 6, batch 13100, loss[loss=0.277, simple_loss=0.4222, pruned_loss=0.06591, over 19634.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3057, pruned_loss=0.0733, over 4288348.82 frames. ], batch size: 702, lr: 5.13e-03, grad_scale: 16.0 2023-06-22 16:47:27,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=993498.0, ans=0.125 2023-06-22 16:48:19,459 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=993618.0, ans=0.125 2023-06-22 16:48:30,795 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=993618.0, ans=0.1 2023-06-22 16:49:00,429 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.229e+02 2.692e+02 3.327e+02 4.339e+02 6.233e+02, threshold=6.654e+02, percent-clipped=2.0 2023-06-22 16:49:18,894 INFO [train.py:996] (2/4) Epoch 6, batch 13150, loss[loss=0.2, simple_loss=0.26, pruned_loss=0.07005, over 21188.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3093, pruned_loss=0.07568, over 4281809.53 frames. ], batch size: 143, lr: 5.13e-03, grad_scale: 16.0 2023-06-22 16:49:26,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=993738.0, ans=0.0 2023-06-22 16:50:56,816 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=993918.0, ans=0.125 2023-06-22 16:51:34,369 INFO [train.py:996] (2/4) Epoch 6, batch 13200, loss[loss=0.2277, simple_loss=0.3006, pruned_loss=0.0774, over 21400.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3077, pruned_loss=0.07534, over 4280037.36 frames. ], batch size: 549, lr: 5.13e-03, grad_scale: 32.0 2023-06-22 16:51:38,186 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.73 vs. limit=12.0 2023-06-22 16:53:16,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=994218.0, ans=0.09899494936611666 2023-06-22 16:53:43,470 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=994278.0, ans=0.125 2023-06-22 16:53:45,958 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.127e+02 2.879e+02 3.252e+02 3.649e+02 5.858e+02, threshold=6.504e+02, percent-clipped=0.0 2023-06-22 16:53:56,125 INFO [train.py:996] (2/4) Epoch 6, batch 13250, loss[loss=0.2288, simple_loss=0.3258, pruned_loss=0.06589, over 21800.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3071, pruned_loss=0.07686, over 4282072.68 frames. ], batch size: 332, lr: 5.13e-03, grad_scale: 32.0 2023-06-22 16:54:34,531 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=994398.0, ans=0.2